Prompt Caching from First Principles

On the Summation app, every LLM request starts with the same large prefix: a system prompt plus a client-specific semantic layer with table schemas, column descriptions, and business rules. That prefix can be thousands of tokens long and stay unchanged for hours. The user’s actual question is often a single sentence.

We were paying full price to recompute that identical prefix on every request. Time to first token was dominated by work we had already done.

Prompt caching fixed the economics. For stable prefixes, input costs dropped by about 10x, and cached requests started responding much faster.

But most explanations stop at “identical prefixes get cheaper.” They do not explain what is being reused, why whitespace can break a hit, or why temperature does not matter.

This post builds the mechanism from first principles. We will walk a prompt through the transformer pipeline, isolate the deterministic parts, and show why providers can reuse them across requests.

The assembly line

When you send text to an LLM, it passes through four stages. Think of it as an assembly line: text enters one end, a predicted token comes out the other.

Let’s trace a real prompt, “May the force”, through each stage.

Stage 1: Tokens

LLMs do not operate on raw text. They operate on numbers, so the first step is converting text into a sequence of integers.

That is what the tokenizer does. It chops text into tokens, usually using a learned subword vocabulary rather than whole words or single characters.

"May the force" becomes [7108, 262, 3081]. Three integers. The word “May” maps to 7108. Always. No matter what you set the temperature to. It is a deterministic lookup.

Key property

Given a tokenizer, the same text always produces the same token IDs. This is pure lookup, no randomness, no model weights involved. Remember this. It’s the foundation that makes caching possible.

Some tokens map neatly to words. Others map to fragments. The important part for caching is not where the boundaries land. It is that the same text produces the same token IDs.

Stage 2: Embeddings

Token IDs are just labels. The embedding stage maps each token ID into a dense vector the model can compute with.

After embedding, each token becomes a dense vector:

Token 1 ("May"):   [0.12, -0.84, 0.33, 0.07, ...]  ← 3,072 numbers
Token 2 ("the"):   [0.56,  0.21, -0.17, 0.91, ...]
Token 3 ("force"): [-0.03, 0.67, 0.44, -0.22, ...]

Stack these row by row and you get a matrix. 3 rows (one per token) by 3,072 columns (one per dimension). Call this matrix E.

The model also injects position information. In the simplified picture, you can think of it as combining each token embedding with a position vector. Whether the implementation uses additive embeddings, rotary encodings, or something else, the key fact is the same: token "the" at position 2 is not the same state as token "the" at position 5.

This is how the model distinguishes "dog bites man" from "man bites dog". Same tokens, different positions, different input states.

Key property

The same token at the same position always produces the same input state. Same prompt + same model = the same starting representation.

Stage 3: Attention

This is the heart of a transformer. This is where the model goes from isolated token representations to contextual understanding.

Here’s the fundamental question attention answers:

For each token, how much should it care about the tokens before it?

“May” by itself is vague. Is it a month? A permission verb? A wish?

But "May the force" is already much less ambiguous. Attention is the mechanism that wires those relationships together.

The Q, K, V framework

The model learns three projections:

Q (Query): what this token is looking for
K (Key): what this token offers
V (Value): the content that gets mixed into the result

I personally like the library analogy:

You walk in with a question (your Query). Every book on the shelf has a label on its spine (its Key). When your question matches a label, you pull that book and read its contents (its Value).

In shorthand:

Q = E × W_Q
K = E × W_K
V = E × W_V

These projections are recomputed in every transformer layer from that layer’s current hidden state. For prompt caching, the important part is simpler: for a fixed prefix, the model will compute the same K and V every time.

The attention calculation

Now the mechanism.

Take the last token in our prompt: "force". The model builds a query for "force" and compares it against the keys of every token it is allowed to see. Higher score means higher relevance.

In other words: compare the current token against every earlier token and score how relevant each one is. But there is a constraint. A token can only look left. The model applies a causal mask that blocks each token from seeing anything after it.

           May   the   force
May         ✓     ✗      ✗
the         ✓     ✓      ✗
force       ✓     ✓      ✓

"May" sees only itself. "the" sees "May" and itself. "force" sees all three. Earlier rows never depend on later tokens.

After masking, those rows are still just raw relevance scores. The model next normalizes each row into attention weights, using softmax so the weights add up to 1. It then uses those weights to mix the values: tokens with higher scores contribute more, tokens with lower scores contribute less. The result is a new representation at each position enriched by the tokens before it.

Softmax, in plain English: take a row of scores and turn it into a set of shares that add up to 1, so the model can treat them as “how much attention goes to each token.”

Want a slower softmax walkthrough?

Select the above paragraph, click Margin Notes, and choose the option that best fits your learning style.

Now here is the key visual. What happens when we append "be"?

The grid grows by one row and one column, but the first three rows are unchanged. "May", "the", and "force" still see exactly what they saw before. Appending a new token creates new work only for that token’s row.

That is the fact prompt caching depends on. Once the model has computed attention state for earlier positions, later tokens cannot invalidate it.

This happens in every attention layer. You do not need every implementation detail for prompt caching. You only need this consequence: the result for earlier tokens does not change when you append later ones.

But we have only processed the input so far. How does the model turn all of this into an actual word?

Stage 4: Prediction

After the final attention layer, each position holds a context-aware representation. But for next-token prediction, we only care about the last position.

The model projects that final vector into vocabulary-sized logits:

logits = final_embedding × W_out
probs  = softmax(logits)

This is where temperature enters the picture. Temperature scales the logits before softmax. It changes how the model samples from the distribution, but it does not change tokenization, embeddings, K, or V.

That detail matters because it explains a common confusion: changing temperature can change the output, but it does not invalidate a prompt cache.

The model samples from this distribution and produces a single token: "be".

Now what? We need to generate the rest of the response.

The generation loop

LLMs generate text one token at a time. After running our 3-token prompt through all the stages, the model predicts "be". To get the next word, it appends that token and continues with "May the force be".

The important thing to notice is not that the model adds one new token. It is that, naively, it has to revisit the whole prefix to do it.

After "May the force" comes "be", then maybe "with", then "you". At each step, the visible context grows:

"May the force"
"May the force be"
"May the force be with"
"May the force be with you"

Without caching, the model keeps recomputing attention state for the same earlier tokens again and again. "May" gets processed on every decode step. So does "the". So does "force".

That is the waste.

For a toy prompt, it looks harmless. For a real application, it is not. If your system prompt is 2,000 tokens long and you generate 100 output tokens, the model has to keep dragging that long prefix through the loop. The most expensive part of the input is also the part that changed the least: not at all.

This is the first place caching pays off. There are really two separate opportunities:

KV caching saves repeated work within a single request during decode.
Prompt caching saves repeated prefill work across multiple requests that share a prefix.

KV caching

The core insight

For a causal transformer, the computed state for position i depends only on the prefix up to position i and the model’s weights. Appending token i + 1 cannot change it retroactively.

That is exactly what KV caching exploits.

The model stores the K and V tensors for the tokens it has already processed. On the next generation step, it computes fresh Q, K, and V only for the new token, appends the new K and V rows, and reuses the rest.

The important thing to notice in the animation is the pattern:

Without KV cache: the model keeps recomputing old rows
With KV cache: the model computes just one new row and reuses history

We cache K and V because they are the heavy reusable history. Q is only needed for the current token.

Prompt caching: taking it across requests

KV caching works within one API call. Prompt caching applies the same idea across calls.

If every request starts with the same 2,000-token system prompt, every request would otherwise recompute the same prefix state from scratch, then discard it at the end. Prompt caching keeps that reusable prefix around so the next request can start from the cached state instead of paying the full prefill cost again.

Here’s a concrete example. You’re building a customer support bot. Every request starts with the same 2,000-token system prompt (your instructions, persona, formatting rules). Then comes the user’s question, which varies each time:

Request 1:  [system prompt]  +  "How do I reset my password?"
Request 2:  [system prompt]  +  "What are your business hours?"
Request 3:  [system prompt]  +  "Can I get a refund?"

The shared system prompt produces the same tokens, the same positions, and therefore the same reusable prefix state each time. Only the user suffix changes.

With prompt caching, the provider stores that prefix state after the first request and reuses it for subsequent requests.

That is why the economics are so good. OpenAI documents up to 90% lower input cost and up to 80% lower latency on cache hits. Anthropic prices cache reads at 0.1x the base input rate and lets you mark the reusable prefix explicitly with cache_control.

Same mechanism, different APIs

The underlying idea is the same across providers: if the prefix matches exactly, they can reuse the cached prefix state. What changes is mostly the API shape.

OpenAI leans automatic. On supported models, exact prefix matches can be cached without you explicitly marking the boundary.

Anthropic exposes that boundary directly. You mark the reusable prefix with cache_control, which gives you more control over what gets cached.

Different ergonomics, same rule: exact prefix match. If the prefix stays stable, you win. If it drifts, you miss.

What affects the cache

At this point, the practical question is simple: what changes between requests actually break reuse?

What determines a cache hit is the prefix itself, not the sampling knobs. Temperature, top_p, and top_k only affect the final sampling step, after the prefix state has already been computed.

What invalidates the cache?

Change the prefix, even by one token, and the cache no longer matches from that point onward. Stable content belongs first. Variable content belongs last.

That is why prompt caching is prefix-shaped. Providers can reuse a shared beginning of the prompt, not arbitrary fragments from the middle.

If you are building an application, the practical rule is simple: keep the front of the prompt stable.

Practical tips for cache hits

Understanding the theory is one thing. Getting consistent cache hits in production is another.

Put stable content first. System prompts, formatting rules, examples, tool schemas, long documents. Variable content belongs at the end.

Keep context append-only. If you are building a multi-turn conversation, append new turns. Do not rewrite or reshuffle earlier ones unless you are willing to lose the cache.

Serialize deterministically. If your prompt includes JSON or structured data, keep key order and formatting stable. Semantically identical JSON can still tokenize differently.

Pin tool definitions. Many frameworks regenerate tool schemas with tiny ordering or whitespace changes. Treat tool definitions as immutable prefix content.

Watch your cache metrics. If cached-token counts suddenly drop, treat that as a prompt regression. Most cache misses are caused by prompt churn, not provider mystery.

The complete picture

Zoom all the way out:

Text tokenizes deterministically.
Those tokens become deterministic input states once position is fixed.
Causal attention means later tokens cannot change the computed prefix state.
So providers can store that prefix state and reuse it when the same prefix appears again.

Once you see that chain, prompt caching stops feeling magical. It becomes a prompt-design constraint.

Put the reusable stuff first. Keep it stable. Move anything user-specific to the end.

That is the real payoff. Lower cost and faster time to first token are not lucky side effects. They are what you should expect when your prompts preserve the prefix.

If you want to go deeper, I recommend Sebastian Raschka’s Build a Large Language Model (From Scratch), Andrej Karpathy’s Neural Networks: Zero to Hero series, and the Transformer Explainer interactive visualization from Georgia Tech. For the implementation side (paged attention, vLLM internals, block-level hashing), Sankalp’s How Prompt Caching Works is excellent.