How to make SSE token streams resumable, cancellable, and multi-device — /dev/knill

9 min read Original article ↗

Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks.

But a lot of folks are saying: “No, you can just use Server-Sent Events (SSE) with Last-Event-ID to get a durable stream, it’s easy”. And yes, all of this is do-able. But I contest that it’s easy. So let’s walk through how to do it, and you can decide for yourself.

Catch up on the previous article and discussion

The advanced chatbot features I want to walk through are:

  • Resumable streams — refresh the page mid-response and get the in-progress tokens back, instead of waiting for the full response to land in the database.
  • Cancellations — stopping the LLM mid-response when the user changes their mind, even though the connection is now allowed to drop and reconnect.
  • Multi-device — open the same conversation on a second device or browser, and have it pick up the in-flight response and any new prompts in realtime.

Each of these is do-able on SSE. Whether they’re easy is what we’re going to find out.

Tokens vs. the API responses

Tokens are the individual pieces of text that LLMs generate, but the actual responses you get back from LLM providers have a bunch more stuff in the. The responses have slightly different structure and format, but pretty much all follow a similar pattern.

Some kind of ‘start’ event, some ‘delta’ events that contain text or tool call requests, and then some kind of ’end’ event.

To get the full response text, you either concatenate the text deltas together, or some of the APIs will give you the ‘full’ response as it’s own event type at the end.

Vercel AI SDK:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{"type":"text-delta","value":"Let me"}
{"type":"text-delta","value":" look that up."}
{"type":"tool-call-streaming-start","value":{"toolCallId":"call_001","toolName":"search"}}
{"type":"tool-call-delta","value":{"toolCallId":"call_001","argsTextDelta":"{\"query\":\"weather Belfast\"}"}}
{"type":"tool-call","value":{"toolCallId":"call_001","toolName":"search","args":{"query":"weather Belfast"}}}
{"type":"tool-result","value":{"toolCallId":"call_001","result":{"temp":"14°C","condition":"cloudy"}}}
{"type":"text-delta","value":"It's currently"}
{"type":"text-delta","value":" 14°C and cloudy"}
{"type":"text-delta","value":" in Belfast."}
{"type":"finish-message","value":{"finishReason":"stop","usage":{"promptTokens":30,"completionTokens":28}}}

OpenAI Responses API:

1
2
3
4
5
6
7
8
9
{"event":"response.created","data":{"id":"resp_abc123","object":"response","status":"in_progress","model":"gpt-4o"}}
{"event":"response.output_item.added","data":{"output_index":0,"item":{"id":"item_001","type":"message","role":"assistant"}}}
{"event":"response.content_part.added","data":{"output_index":0,"content_index":0,"part":{"type":"output_text","text":""}}}
{"event":"response.output_text.delta","data":{"output_index":0,"content_index":0,"delta":"Hello"}}
{"event":"response.output_text.delta","data":{"output_index":0,"content_index":0,"delta":"! How can I"}}
{"event":"response.output_text.delta","data":{"output_index":0,"content_index":0,"delta":" help you today?"}}
{"event":"response.output_text.done","data":{"output_index":0,"content_index":0,"text":"Hello! How can I help you today?"}}
{"event":"response.output_item.done","data":{"output_index":0,"item":{"id":"item_001","type":"message","role":"assistant","content":[{"type":"output_text","text":"Hello!  How can I help you today?"}]}}}
{"event":"response.completed","data":{"id":"resp_abc123","object":"response","status":"completed","usage":{"input_tokens":25,"output_tokens":12}}}

Anthropic API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{"event":"message_start","data":{"type":"message_start","message":{"id":"msg_01XFDUDYJgAACzvnptvVoYEL","type":"message","role":"assistant","content":[],"model":"claude-sonnet-4-20250514"}}}
{"event":"content_block_start","data":{"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}}
{"event":"content_block_delta","data":{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}}
{"event":"content_block_delta","data":{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"!  How"}}}
{"event":"content_block_delta","data":{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" can I"}}}
{"event":"content_block_delta","data":{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" help you"}}}
{"event":"content_block_delta","data":{"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" today?"}}}
{"event":"content_block_stop","data":{"type":"content_block_stop","index":0}}
{"event":"message_delta","data":{"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":12}}}
{"event":"message_stop","data":{"type":"message_stop"}}

What you’ll notice, is that each ’line’ of the responses contain quite a lot of metadata for not very much text-delta data.

For example, a single event (line) from the Anthropic API contains 125 characters for just 5 characters of text delta:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "event": "content_block_delta",
  "data": {
    "type": "content_block_delta",
    "index": 0,
    "delta": {
      "type": "text_delta",
      "text": "Hello"
    }
  }
}

Why does this matter? It matters as soon as you decide to store every token, which we will get to in a sec.

Baseline: How to stream tokens to one client

This is table-stakes , right?

Starting with the basics that pretty much all AI chatbot demos have. A user makes an HTTP POST request with the prompt and current conversation history to the server. The server runs the agent, and the client holds the response open waiting for an SSE stream of response tokens from the agent.

SSE agent server architecture

Once streaming is finished, the server stores the full response in the database. The ‘conversation history’ stored in the database is really to re-hydrate the client on page refresh, it’s not for anything on the server.

This works fine for one user, on one device, on one connection that doesn’t drop. But out of the box, you get the behaviour in the gif below; if you refresh the page the in-progress token stream is lost, and the response only becomes available again once the stream has finished and the ‘full’ response is stored in the database.

Claude.ai page refresh

How to make the stream resumable

The SSE spec has a mechanism for resuming streams, which is based on the HTTP header Last-Event-ID.

The idea is that if every ’event’ in your server-sent events stream has a unique ID, then the client can keep track of the last event it received, and when the connection drops, it can reconnect and tell the server to resume from that last event.

Maybe you decide that each ‘response’ will have an id, say: abcxyz. And each token in-order will have an index. So abcxyz:0 is the first token, abcxyz:1 is the second token, and so on.

Horizontal scale architecture

Most applications are built on an architecture like the one above, where there are a number of stateless horizontally scaleable server replicas that can handle client requests. Data is stored in the database.

So all the tokens, from each LLM response, need to be stored in the database or in some token cache in case the ‘resume’ request with the Last-Event-ID is routed to a different server replica than the one that started the stream.

Because with the stateless server replica design, any replica can handle any user request. All the state lives in some database, and the replicas look it up on demand.

Resume on a different server replica

The problem is that you now need to write every token to the database. And you’re writing these tokens to a database on the off-chance that the client might drop. For successful requests where the HTTP SSE stream stays intact and delivers all the tokens, you will never need the per-token data that you’re writing to your database.

Remember that each token event has a lot of metadata for not that much text delta? Well, that means a lot of database writes for not much text. A lot of writes, for not much value.

And as soon as that response is finished by the LLM, the individual tokens are useless, because the ‘full’ response supersedes them all. So once the stream completes, you then need to go and clean up all the individual tokens in favour of the ‘full response’.

This is a pretty expensive write amplification for a feature that is only needed in the case of dropped connections. It used to be 1 request, 1 response, and a couple of database queries. Now you’ve got a db write per-token. (And don’t tell me you can batch them without thinking that through; because batching is just promising the client it can resume from an ID that you might not now have in the db).

How to handle cancellations

Cancellations get a bit awkward once you’ve made the stream resumable. Maybe before you assumed that a dropped connection from the client meant the server could cancel the request. But now, you assume that the client might come back with a Last-Event-ID to resume the stream, so a dropped connection doesn’t mean a cancellation any more.

Instead, you need to thread cancellations through the database too. A separate POST /cancel/{response_id} endpoint that writes a cancel marker into the same shared store the LLM inference process is writing to. The process handling the inference checks for the marker between tokens, and aborts the upstream LLM call if it sees one.

The replica that’s handling the LLM response might not be the replica that receives the cancellation request from the client, so the shared store is what routes the cancel through.

How to support multi-device

Multi-device is two problems, not one, and they don’t have the same answer.

You’re already some of the way there for the first problem, because you’re storing the individual tokens of the response in the database. So you can serve those to multiple devices. A second device can request the same conversation and receive the history and then the token stream for any in-flight responses.

But then there’s the second problem;

If device A sends a new prompt and starts receiving a token stream response, how does device B know there’s a new prompt and response that it should render?

… well device B doesn’t know.

Folks would tell you that you need all your clients to poll the server for new data. In “Patterns for building realtime features” we discussed why polling sucks; it’s a trade-off of latency if you poll in-frequently or hammering your servers with traffic if you poll frequently. Neither is great.

Of course you can make polling longer, with, yunno, long-polling. But that’s still polling.

Or, use a transport that does this for you

Warning

Stop reading here if you just wanted the how-to. Because I’m going to talk about what I think is better, and that is probably too ‘commercial’ for some folks.

All these features are doable on SSE, that’s what this post has been about. But I contest that they are ’easy’. They are janky and inefficient solutions to work around the fact that HTTP is just not a good transport for streaming LLM tokens and for building async agentic applications.

Ably pub/sub channel

I work for Ably, and I’m building a dedicated transport for AI applications that supports token streaming really well. It’s based on the pub/sub pattern in the diagram above.

The key differences from HTTP based SSE streaming are:

  1. The pub/sub channel still exists even if a client drops the connection to it, so the server can keep publishing tokens, and those tokens are available to the client the moment it reconnects. This decouples the connection lifetime from the agent lifecycle.
  2. Multiple users or multiple devices can connect to the same pub/sub channel and get the exact same token stream. The channel makes sure that clients get the tokens in realtime, and the Ably SDKs automatically handle reconnection, rewind for missed tokens, and history.
  3. The channel automatically compacts the token-deltas into full responses, so clients who are catching up only get 1 message for each full response instead of streaming every single token.
  4. Cancellations, interrupts, and steering are easy, because the channel handles the routing from the client publishing the interrupt on the client to the server process that’s running the agent. No more routing cancellations or followups through the database, the channel handles routing for you automatically.

So yes, you can build all these things if you want. But I’m not convinced they are ’easy’, or that SSE over HTTP is a suitable transport for streaming tokens from LLMs and building async agentic applications.