March 5, 2026Updated March 6, 202610 min read

Every Tool Call Your Agent Makes Is Resending Your Entire Conversation History

The stateless HTTP API problem is quietly bankrupting your agent in tokens and time — and WebSockets are the fix nobody talks about enough

Side-by-side comparison of HTTP stateless API and WebSocket persistent session for AI agents
Diagram comparing stateless HTTP (full context re-sent per tool call) vs persistent WebSocket (incremental deltas only)

Here's something that should bother you more than it probably does.

Every time your AI agent calls a tool — searches the web, reads a file, queries a database — it re-sends your entire conversation history to the API. The system prompt. Every prior message. Every tool result that came before. All of it, again, from scratch.

For a simple chatbot, this is a footnote. For an agent making 20+ tool calls in a session, this is a structural tax that compounds with every step.

Why the API Knows Nothing About You (By Design)

LLM APIs are stateless. This is not an oversight — it's a deliberate architectural choice. Each HTTP request is independent. The server holds no session. You send a complete snapshot of the conversation, the model responds, and the connection closes.

This is clean. It's horizontally scalable. It makes the API simple to reason about.

But it also means the server can't remember you. Every request is the first request.

The "context window" people talk about — 128K tokens, 200K tokens — that's not a server-side buffer that grows as you chat. It's the payload size limit for a single HTTP request. You're not streaming into a running session. You're reconstructing one from scratch, every time.

Why WebSockets Weren't the Default

This is the question worth asking before accepting WebSockets as the obvious fix.

Historical inertia. LLMs launched as simple request/response APIs. HTTP matched how developers already thought about APIs — send a prompt, get a completion. WebSockets require a different mental model and more complex client code.

Statelessness is operationally easier for providers. Running stateless HTTP at scale is well-understood infrastructure. Maintaining server-side session state across millions of concurrent WebSocket connections is a genuinely hard distributed systems problem — sticky sessions, memory management, failover handling. It's a much bigger operational commitment.

The agentic use case didn't exist yet. When GPT-3 launched, nobody was running 20-tool-call workflows. The chatbot pattern dominated, and for chatbots HTTP is perfectly fine. WebSockets only become obviously necessary once agentic workloads became common — which is recent.

Streaming HTTP was good enough. Server-sent events and chunked HTTP responses gave the feel of a persistent connection for token streaming. It papered over the problem enough that the statelessness wasn't painful for most users.

The ecosystem built around the constraints of early LLMs. Now those constraints are changing faster than the infrastructure assumptions.

Tool Calls Are a Multiplier

In a normal conversation, the context grows linearly. Message by message. Manageable.

Agents break this assumption. An agent doesn't just respond — it acts, observes, and acts again. Each action is a tool call. Each tool call is a round-trip to the API. Each round-trip carries the full context.

Here's what that looks like concretely. Say your agent has:

  • A 2,000-token system prompt
  • A 500-token user task
  • Tool call results averaging 800 tokens each

By tool call 10, you're sending roughly 10,300 tokens per request — and that number grows with every subsequent call. By tool call 20, you're at 18,500 tokens per request. The cumulative token bill for a 20-tool-call session isn't 20 × (average response size). It's the triangular sum of an ever-growing context.

Tool call 1:  2,500 tokens sent
Tool call 2:  3,300 tokens sent
Tool call 3:  4,100 tokens sent
...
Tool call 10: 10,300 tokens sent
Tool call 20: 18,500 tokens sent

Total tokens sent: ~107,000
Tokens of actual new information: ~16,000

You're sending ~6.7x more data than you need to. And you're paying for every token of it.

Token growth comparison between HTTP and WebSocket across 20 tool calls
Token growth comparison between HTTP and WebSocket across 20 tool calls

The Speed Problem Is Just as Bad

Token cost gets the headlines. Latency rarely does.

Each HTTP round-trip for a tool call has overhead: connection setup, TLS handshake, serialization of the full context payload, network transit, server-side reconstruction of session state from your payload, inference, response serialization, network transit back.

The connection setup and TLS handshake alone add 50–150ms per call on a good day. Multiply by 20 tool calls. You've added 1–3 seconds of pure infrastructure overhead before a single token is generated.

For an agent that's supposed to feel responsive, this death-by-a-thousand-cuts is real. The inference time is the fast part. The overhead is what slows you down.

Connection overhead comparison: HTTP repeated handshakes vs WebSocket persistent connection
Connection overhead comparison: HTTP repeated handshakes vs WebSocket persistent connection

What WebSockets Actually Fix

OpenAI's WebSocket mode for the Responses API is the specific mechanism worth understanding here. This is distinct from the Realtime API (which targets voice and audio) — WebSocket mode is built for extended agentic workflows over text.

The key difference is in how turns work. With standard HTTP, every request carries the full conversation. With WebSocket mode, each subsequent turn sends only two things: a previous_response_id referencing the prior turn, and the new input items for this turn. The server maintains connection-local state in memory, so it can continue from where the last response left off without you reconstructing the full history.

# HTTP (every turn)
POST /responses
{ messages: [system_prompt, turn_1, result_1, turn_2, result_2, ..., turn_N] }

# WebSocket mode (after turn 1)
{ previous_response_id: "resp_xyz", input: [new_tool_result] }

A persistent WebSocket connection means:

Delta-based turns, not full context reconstruction. You send new items only. The server reuses its cached state from the previous response. The growing history stops being your problem to transmit.

The connection stays open. No TLS handshake per tool call. No connection setup overhead. The round-trip cost collapses to network latency plus inference time.

The server streams incrementally. Rather than buffering a complete response, you receive tokens as they're generated. First token latency drops dramatically.

The confirmed number from OpenAI's own documentation: up to ~40% faster end-to-end execution for rollouts with 20+ tool calls. That's not a theoretical ceiling — it's what they've observed in practice.

Caveat worth noting: This figure comes from OpenAI's own documentation. Real-world gains depend heavily on tool call payload sizes, network conditions, and whether you hit the in-memory cache. Treat it as directionally correct, not a guaranteed benchmark.

Advanced Patterns: Warmup and Compaction

Two capabilities in the WebSocket API that don't get enough attention.

The Warmup Trick (generate: false)

You can send response.create with generate: false before the user asks anything. This pre-loads your tools, system prompt, and static context onto the connection — OpenAI prepares the request state and returns a response ID without running inference.

When the actual task arrives, the first real turn starts faster because setup is already done. Think of it like mise en place — kitchen prep before orders come in.

Context Compaction (/responses/compact)

Even with delta-chaining, context eventually grows too long over a very extended session. The /responses/compact endpoint takes your full conversation history and returns a compressed summary of it. You use that as the new starting point — effectively resetting the chain with a smaller footprint.

This is OpenAI's answer to "what happens when even the incremental history gets unwieldy." The warmup + WebSocket + compaction combination is the full stack for long-running agents.

Prompt Caching: The Partial Fix

Prompt caching (available from both OpenAI and Anthropic) works differently from WebSocket mode and solves a different problem.

Caching works by hashing the prefix of your request. If you send the same system prompt and conversation history again, the provider recognises the repeated prefix and skips reprocessing it — you pay a fraction of the token cost.

But you're still transmitting the full context every request. The bandwidth overhead is still there. You're just paying less to process it.

Prompt CachingWebSocket Mode
TransmissionFull context every timeDelta only
Processing costReduced for cached prefixReduced (less input)
Latency savingSome (cache hit)More (less data + no handshake)
Works across connectionsYesNo (per-connection, in-memory)
Works with any providerAnthropic + OpenAIOpenAI only right now
Shared across usersYes (same prefix)No

One underrated advantage of caching: if all your users share the same system prompt, only the first request pays full processing cost. Everyone else hits the cache. WebSocket can't do that — each connection is isolated.

In practice for a heavy agentic workflow, you'd want both — caching handles the static prefix cheaply, WebSocket handles the growing tool call history.

The Real Drawbacks

WebSockets aren't a silver bullet. The tradeoffs deserve more airtime than they usually get.

Server-side state is a black box. With HTTP you own your conversation history. With WebSocket mode, state lives in OpenAI's memory. If the connection drops mid-agent, you're potentially in an inconsistent state with no clean way to inspect what the server thinks happened.

The in-memory cache only holds one previous response. Not the full history — just the most recent response state. Trying to branch mid-conversation from an earlier turn won't work from cache. Think of it like git: you only have one HEAD.

Debugging gets harder. HTTP is trivially loggable. With WebSocket delta-mode, reconstructing "what did the agent know at step 15?" requires you to track incremental events yourself. Tools like LangSmith exist for HTTP agents — the WebSocket equivalent is less mature.

Vendor lock-in. previous_response_id is OpenAI-specific. Any agent built around this pattern is tightly coupled to their infrastructure. Switching to Claude or Gemini means rearchitecting.

The 60-minute limit is real. Very long-running tasks need a reconnect strategy built in from the start. And no multiplexing — one in-flight response per connection means parallel tool calls require multiple connections.

Error eviction. If a turn fails (4xx/5xx), the service evicts the referenced previous_response_id from cache. You need explicit error handling to recover gracefully.

Will Anthropic and Google Adopt This?

The bandwidth cost argument is compelling, but there's a stronger forcing function: context windows are getting enormous.

When you're regularly working with 100K–1M token contexts, re-transmitting that on every tool call isn't just expensive — it's a latency killer regardless of caching. Anthropic's prompt caching is a stepping stone that shows they're aware of the problem. Google with Gemini's 1M token context probably feels this pain most acutely.

The direction is clear. The question is whether someone builds a provider-agnostic standard, or whether each provider ships their own flavour and we end up with fragmentation.

For now, most agents are running HTTP and paying the stateless tax on every step. The fix exists. The infrastructure is being built. The question is whether you're designing for the world where it's available.

What This Means If You're Building

  • Fewer than 5 tool calls per session: HTTP is fine. Don't add WebSocket complexity.
  • 20+ tool calls, long-running workflows: the math changes fast. Model this in terms of your actual token costs.
  • Commercial agent product at scale: this is a unit economics problem that will bite before you expect it.
  • Building now: architect for WebSocket mode from the start. Retrofitting is painful.

Further reading:

The video that sparked this post: this breakdown of OpenAI's Realtime API architecture. Worth watching if you're building anything agent-adjacent.

Comments

Leave a comment