I used to think batching requests for a machine learning model was mostly solved. I have hosted models and served requests. Batch them, send them through the GPU together, get better throughput. Simple enough.
If you skip batching and send one tiny request at a time, the GPU is basically a giant bus carrying one passenger. It still moves. The economics are just awful. This is why batching exists. But LLMs make batching weird.
The Idea
In a traditional web backend or a standard computer vision pipeline, it's straightforward. You put requests into a queue, wait until you hit a batch size of 4 or 8, slam them into the GPU, and return the results. Standard engineering. A single trip through the model.
input -> model -> outputAn image classifier does this. An embedding model does this. You pass in data, the model runs, and you get the result. But LLM serving completely breaks this mental model. LLM generation is iterative. It generates one token at a time.
prompt -> prefill
token 1 -> decode
token 2 -> decode
token 3 -> decode
...Serving an LLM is not "run the model once." It is a scheduling problem that repeats every token. Treat LLM requests like ordinary inference requests and GPU efficiency drops fast. Latency gets worse. The cloud bill follows.
LLM serving is a loop. Every token is another chance to waste the GPU or fill it.
Prefill, Decode, and the KV Cache
Each request has two phases. In prefill, the model reads the prompt and builds the internal attention state. In decode, the model uses autoregressive decoding to generate text one token at a time, feeding each new token back in to predict the next one.
Decode runs until the request hits an end condition, so request lengths vary wildly. To avoid recomputing the prompt history at every step, the server keeps a KV cache in GPU memory. The scheduler has to keep the GPU busy without running out of that finite cache.
Static Batching
Imagine three requests arrive together.
A -> needs 8 output tokens
B -> needs 2 output tokens
C -> needs 6 output tokensStatic batching puts them on the same bus and makes the bus finish the whole trip before taking new passengers.
Step 1: A B C
Step 2: A B C # B is done
Step 3: A _ C
Step 4: A _ C
Step 5: A _ C
Step 6: A _ _ # C is done
Step 7: A _ _
Step 8: A _ _ # A is doneEven though B and C finished early, their seats cannot be reassigned. The GPU keeps running, but it wastes memory and computes empty padding tokens. That is the waste.
The Tour Bus vs. The City Bus
I like thinking about the difference between static and in-flight batching as the difference between a pre-booked tour bus and a public city bus. Both are buses, but they operate very differently.
- The Tour Bus (Static Batching): A tour bus leaves the station with a fixed passenger list. If someone gets off at stop 2, their seat stays empty for the rest of the route. The bus does not pick up anyone new. It finishes the tour, returns to the station, and loads the next group.
- The City Bus (In-Flight Batching): A bus that runs a continuous loop. As soon as a passenger reaches their destination and steps off, the bus pauses briefly at the next stop, lets a new passenger board to fill the empty seat, and immediately continues its journey.
In LLM serving, the bus is the active batch. A seat is not just a "batch slot." It represents GPU memory and KV cache capacity. Getting off means a request has hit an end condition. Boarding means a new request has enough memory budget to join the active generation loop.
LLM serving intuition
The bus that changes passengers while moving
Each seat is a batch slot backed by GPU memory and KV cache. Each tick is one generation iteration.
In-flight batch
The bus is already moving, but the scheduler keeps swapping riders at token boundaries when capacity opens up.
3/3 active seats
Bus stop queue
0 idle seats in this toy loop
The batch is no longer a fixed group of requests. It changes between tokens.
In-flight Batching
In-flight batching is also called continuous batching or iteration-level batching. Engines like vLLM and TensorRT-LLM make it work by moving the scheduling boundary from the request level to the iteration level.
At every generation iteration, the scheduler asks:
- Which requests are still active?
- Which requests just finished?
- Which new requests are waiting?
- Is there enough KV cache space?
- Can we add new work without hurting latency too much?
Instead of waiting for a batch of requests to finish entirely before running the next batch, the execution engine runs a single forward pass of the transformer model (generating exactly one token for all active requests), pauses for a microsecond to look at the queue, and rebuilds the batch for the very next token.
Of course production systems are more complicated. They deal with priorities, timeouts, chunked prefill, speculative decoding, multiple GPUs, tensor parallelism, and fairness. But the loop above is the core of the idea.
Why It Matters
The naive way to improve throughput is to make the batch larger. That works until it does not. Bigger batches can increase throughput, but they can also make users wait longer before the first token. In a chatbot, the first token matters a lot. A user can tolerate a long answer streaming over time, but waiting too long before anything appears feels broken. So LLM serving has a slightly different objective than plain inference:
- maximize tokens/sec
- without destroying time to first token
- without wasting KV cache
- without letting long requests block short ones
In-flight batching resolves these constraints by keeping the active batch dense with useful work on every iteration. It avoids the empty seats and padding that static batching leaves behind. The model gets most of the attention. The scheduler is what makes serving it affordable.