SSE sucks for transporting LLM tokens

SSE sucks

I’m just going to cut to the chase here. SSE as a transport mechanism for LLM tokens is naff. It’s not that it can’t work, obviously it can, because people are using it and SDKs are built around it. But it’s not a great fit for the problem space.

The basic SSE flow goes something like this:

Client makes an HTTP POST request to the server with a prompt
Server responds with a 200 OK and keeps the connection open
Server streams tokens back to the client as they are generated, using the SSE format
Client processes the tokens as they arrive on the long-lived HTTP connection

Sure the approach has some benefits, like simplicity and compatibility with existing HTTP infrastructure. But it still sucks.

When you’re building an app that integrates LLM model responses, the most expensive part of any call is the model inference. The cost of generating the tokens dwarfs the cost of transporting them over the network. So the transport mechanism should be bulletproof. It would suck to have a transport where some network interruption meant that you had to re-run the model inference. But that’s exactly what you get with SSE.

If the SSE connection drops halfway through the response, the client has to re-POST the prompt, the model has to re-run the generation, and the client has to start receiving tokens from scratch again. This is sucky.

SSE might be fine for server-to-server communication where network reliability is high, but for end user client connections over the internet, where connections can be flaky, it’s a poor choice.

If your user goes into a tunnel, or switches networks, or their phone goes to sleep, or any number of other common scenarios, the SSE connection can drop. And then your user has to wait for the entire response to be re-generated. This leads to a poor user experience. And someone has to pay the model providers for the extra inference calls.

And don’t even think about wanting to steer the generation (or AI agent) mid-response. Nope, not gonna happen with SSE. It’s uni-directional after all. Once that prompt is sent, you’re stuck with it for the duration of the response generation. Maybe you you choose to cancel the model inference by dropping the connection (given that’s your only feedback mechanism), but then it’s impossible to distinguish between accidental disconnects and intentional cancellations. So you end up having to re-run the entire inference anyway when the user reconnects. RIP resumable streams.

Expecting more from your transport

At the very least, a transport mechanism for LLM tokens should support resuming from where it left off. If the connection drops, the client should be able to reconnect and request the remaining tokens without having to re-run the entire model inference.

This would require some state management on the server side, to keep track of both the tokens that have been generated, and which of those tokens have been successfully delivered to the client.

If you start to design this, it quickly looks like each token being written to your database. You’ve got to store the tokens, track the position the client has received up to, and handle reconnections.

What it actually ends up looking like for lots of folks is that; in the happy path while the SSE connection is intact, tokens are streamed over SSE. But if the connection drops, the client falls back to polling the server for the entire response to be generated. The experience for the user is sucky; they see some tokens, the connection drops, and they see no more tokens until the entire response is finished, and then they sell all the remaining tokens at once.

So why not just use WebSockets?

Websockets don’t really help. Yes, websockets provide a bi-directional communication channel, but they don’t do anything to solve the core problem of resuming from where you left off. If the websocket connection drops, the client still has to re-POST the prompt, and the server still has to re-run the model inference. So you’re back to square one.

Wait a sec, I thought SSE was already a resumable protocol?

Well kind of, but not really. SSE as a protocol runs over an HTTP connection. So it supports headers at the start of the response, and then a stream of events. In order to get the event stream to be resumable you need to put some kind of index/serial/identifier in each event. Then you need to be store that index on the server side, and when the client reconnects, the client needs to tell the server the index of the last event it received. But you still end up writing every ’event’ (read: token) to a database or cache and looking that up on reconnect. You’re halfway there, but it’s a lot of faff.

And even if you do that, you’re still fighting the SDKs. For example, using the Vercel AI SDK you have to choose between stream abort or stream resume. You can’t have both.

A better approach: Pub/Sub

So all of this leads to the conclusion that a better approach for transporting LLM tokens is to use a Pub/Sub model. In this model, the client subscribes to a topic for the response tokens, and the server publishes tokens to that topic as they are generated. If the connection drops, the client can simply re-subscribe to the topic and request the remaining tokens without having to re-run the model inference.

The model or server side can continue to push the generated tokens into the topic without having to worry about whether the client is connected or not. The client can consume the tokens at its own pace, and if it disconnects, it can simply re-subscribe and pick up where it left off.

But maybe the sucky thing is…

So yes SSE sucks, but maybe the truly sucky thing is that you’re going to end up paying a pub/sub provider to be your transport. And unless those providers can transport the tokens more cheaply than you can generate them, you’re going to end up paying more for transport than you do for inference.

At that point, you might as well eat the bad UX of SSE, because at least it’s cheap.