March 5, 20267 min read

Every Tool Call Your Agent Makes Is Resending Your Entire Conversation History

The stateless HTTP API problem is quietly bankrupting your agent in tokens and time — and WebSockets are the fix nobody talks about enough

Side-by-side comparison of HTTP stateless API and WebSocket persistent session for AI agents
Diagram comparing stateless HTTP (full context re-sent per tool call) vs persistent WebSocket (incremental deltas only)

Here's something that should bother you more than it probably does.

Every time your AI agent calls a tool — searches the web, reads a file, queries a database — it re-sends your entire conversation history to the API. The system prompt. Every prior message. Every tool result that came before. All of it, again, from scratch.

For a simple chatbot, this is a footnote. For an agent making 20+ tool calls in a session, this is a structural tax that compounds with every step.

Why the API Knows Nothing About You (By Design)

LLM APIs are stateless. This is not an oversight — it's a deliberate architectural choice. Each HTTP request is independent. The server holds no session. You send a complete snapshot of the conversation, the model responds, and the connection closes.

This is clean. It's horizontally scalable. It makes the API simple to reason about.

But it also means the server can't remember you. Every request is the first request.

The "context window" people talk about — 128K tokens, 200K tokens — that's not a server-side buffer that grows as you chat. It's the payload size limit for a single HTTP request. You're not streaming into a running session. You're reconstructing one from scratch, every time.

Tool Calls Are a Multiplier

In a normal conversation, the context grows linearly. Message by message. Manageable.

Agents break this assumption. An agent doesn't just respond — it acts, observes, and acts again. Each action is a tool call. Each tool call is a round-trip to the API. Each round-trip carries the full context.

Here's what that looks like concretely. Say your agent has:

  • A 2,000-token system prompt
  • A 500-token user task
  • Tool call results averaging 800 tokens each

By tool call 10, you're sending roughly 10,300 tokens per request — and that number grows with every subsequent call. By tool call 20, you're at 18,500 tokens per request. The cumulative token bill for a 20-tool-call session isn't 20 × (average response size). It's the triangular sum of an ever-growing context.

Tool call 1:  2,500 tokens sent
Tool call 2:  3,300 tokens sent
Tool call 3:  4,100 tokens sent
...
Tool call 10: 10,300 tokens sent
Tool call 20: 18,500 tokens sent

Total tokens sent: ~107,000
Tokens of actual new information: ~16,000

You're sending ~6.7x more data than you need to. And you're paying for every token of it.

Token growth comparison between HTTP and WebSocket across 20 tool calls
Token growth comparison between HTTP and WebSocket across 20 tool calls

The Speed Problem Is Just as Bad

Token cost gets the headlines. Latency rarely does.

Each HTTP round-trip for a tool call has overhead: connection setup, TLS handshake, serialization of the full context payload, network transit, server-side reconstruction of session state from your payload, inference, response serialization, network transit back.

The connection setup and TLS handshake alone add 50–150ms per call on a good day. Multiply by 20 tool calls. You've added 1–3 seconds of pure infrastructure overhead before a single token is generated.

For an agent that's supposed to feel responsive, this death-by-a-thousand-cuts is real. The inference time is the fast part. The overhead is what slows you down.

Connection overhead comparison: HTTP repeated handshakes vs WebSocket persistent connection
Connection overhead comparison: HTTP repeated handshakes vs WebSocket persistent connection

What WebSockets Actually Fix

OpenAI's WebSocket mode for the Responses API is the specific mechanism worth understanding here. This is distinct from the Realtime API (which targets voice and audio) — WebSocket mode is built for extended agentic workflows over text.

The key difference is in how turns work. With standard HTTP, every request carries the full conversation. With WebSocket mode, each subsequent turn sends only two things: a previous_response_id referencing the prior turn, and the new input items for this turn. The server maintains connection-local state in memory, so it can continue from where the last response left off without you reconstructing the full history.

# HTTP (every turn)
POST /responses
{ messages: [system_prompt, turn_1, result_1, turn_2, result_2, ..., turn_N] }

# WebSocket mode (after turn 1)
{ previous_response_id: "resp_xyz", input: [new_tool_result] }

A persistent WebSocket connection means:

Delta-based turns, not full context reconstruction. You send new items only. The server reuses its cached state from the previous response. The growing history stops being your problem to transmit.

The connection stays open. No TLS handshake per tool call. No connection setup overhead. The round-trip cost collapses to network latency plus inference time.

The server streams incrementally. Rather than buffering a complete response, you receive tokens as they're generated. First token latency drops dramatically.

The confirmed number from OpenAI's own documentation: up to ~40% faster end-to-end execution for rollouts with 20+ tool calls. That's not a theoretical ceiling — it's what they've observed in practice.

What This Means If You're Building Agents

If you're making fewer than 5 tool calls per session, this probably doesn't matter enough to change your architecture today. The overhead is real but not catastrophic at small scale.

If you're building agents that run extended workflows — research agents, coding agents, multi-step automation — the math changes fast. The stateless HTTP model is actively working against you.

A few things worth internalizing:

Prompt caching is a partial fix, not a solution. Anthropic and OpenAI both offer prompt caching that reduces cost for repeated prefixes (like your system prompt) — up to 80% latency reduction and 90% input cost reduction for cached tokens. This helps. But it only works for the static prefix. It doesn't solve the fundamental problem of re-transmitting growing tool call history on every request.

Token cost scales with context length × tool calls. This isn't linear, it's quadratic-ish. If you're building a commercial agent product, this is a unit economics problem that will bite you at scale before you expect it to.

Architecture matters now. Retrofitting WebSocket session management onto an HTTP-based agent is non-trivial. OpenAI's conversation state guide shows how much state management you're currently handling yourself that WebSocket mode would abstract away. Building this in from the start is easier than adding it later.

The cache has limits you need to design around. WebSocket mode caches one previous response state per connection in memory. With store=false or Zero Data Retention policies, there's no fallback to persisted storage — only the in-memory state is accessible. Failed requests (4xx/5xx) automatically evict the cached state. If your agent needs to resume across connection drops or handle errors gracefully, you need explicit reconnection logic.

Not all tasks need a persistent session. A quick lookup agent that makes 2–3 tool calls and terminates? HTTP is fine. A coding assistant that iteratively reads files, makes edits, runs tests, and interprets output over 30+ steps? That's where persistent sessions earn their complexity. The 60-minute maximum connection duration also means very long-running tasks need a reconnection strategy built in.

The Bigger Picture

The stateless API design made LLMs accessible. It made them easy to integrate. It made them scalable on the provider side. All of that remains true.

But agent workloads aren't chatbot workloads. They're long-running, tool-heavy, iterative. They were designed to be. And the infrastructure assumptions that work for "user sends message, model responds" start breaking down when the pattern is "agent calls tool 40 times while completing a multi-hour task."

WebSockets aren't a silver bullet. The added complexity is real. But for agents at any meaningful scale, the question isn't whether to move toward persistent sessions — it's when and how fast.

Right now, most agents are running HTTP and paying the stateless tax on every step. That tax is invisible until it isn't. When it becomes visible, it shows up as unexpectedly high token bills, sluggish tool call latency, and agents that feel slower than they should.

The fix exists. The infrastructure is being built. The question is whether you're designing for the world where it's available.


Further reading:

The video that sparked this post: this breakdown of OpenAI's Realtime API architecture. Worth watching if you're building anything agent-adjacent.

Comments

Leave a comment