Deep Dive into LLM Token Cost — Blog Series Part 2: How Prompt Caching Actually Works

19 min read Original article ↗

The first post in this series, Part 1: A Real-World Case Study, ended with a single number: a 31-hour Claude Code session that cost $172.58, of which $114.98 — about 66% of the bill — was cache reads. Caching wasn’t a side effect of that session. It was the dominant cost line, by volume so large that it outweighed every other line combined. Anyone trying to reason about LLM cost without a precise mental model of how the cache works is missing the part that matters most.

This post is that mental model. It’s not a strategies post — that’s the next one. It’s the mechanics post for caching specifically: how the prefix match actually behaves on the wire, what happens on the second message of a conversation versus the first, what happens when you walk away for two days, and what really happens to a single 800-token file you read into the session at turn 5.

A note on scope. Claude is used as the worked example throughout because its cache surface is the most explicit — every read, write, and TTL has a corresponding counter in the usage block, so the mechanics can be traced line by line. Where GPT or Gemini diverge in ways that change the practical answer, you’ll find a short cross-provider callout. The third post in this series goes much further on the cross-provider story; here the goal is to get the Claude mechanics rock-solid first.

What’s in this post:

  • Part 1 — Three questions that reveal how caching actually works. Does every chat message use the cache? Is the full context window really sent on every turn? What happens when you resume a session two days later? The short answers expose the asymmetry that governs almost every Claude cost decision.
  • Part 2 — A worked example. Following a single 800-token file through its full lifecycle in a 20-turn session: arrival as fresh input, the one-turn-lagged cache write, then cache reads for the rest of the session. Three diagrams make the mechanics tangible. Once you can trace 800 tokens, you can trace anything.

The series. This is the second of three posts:

  1. Part 1: A Real-World Case Study (previous) — the mental model, anchored in a real $172 case study.
  2. Part 2: How Prompt Caching Actually Works (this post) — three questions and one worked example.
  3. Part 3: Strategies and Anti-Patterns (next) — five strategies ranked by impact, the silent failures that undo them, and the cross-provider comparison.

Each post stands on its own. But if you arrived here without reading the first one, the $172 case study is the proof-by-data of why this post matters: caching reads were two-thirds of that bill, and getting your mental model wrong about caching means optimizing the wrong line for the rest of your time on Claude.

Part 1: Three Questions That Reveal How Caching Actually Works

Three questions come up almost immediately the first time someone tries to reason about a prompt cache:

  1. When I send a chat message, does the cache actually get used?
  2. If my context window is 53.6K tokens, is the full payload really sent every turn?
  3. What happens if I pause and resume the session two days later?

The questions sound simple. The answers expose precisely how statelessness, the cache prefix, and the 5-minute TTL interact — and they explain why some usage patterns are cheap and others are surprisingly expensive.

Question 1 — When I send a chat message, does Claude actually use the cache?

Yes — almost certainly, and aggressively. This is what’s happening on every message you send in a tool like Claude Code.

Every time you send a message, the client makes a fresh API call that re-ships the entire conversation so far. The payload looks roughly like:

The client places cache_control breakpoints to mark the static prefix as cacheable. On each new message you send:

  • System prompt + tool definitions → cache read at ~0.1x. (This block alone is often 10K–20K tokens.)
  • All prior conversation turns → also cache read at ~0.1x, as long as nothing earlier was edited.
  • Your newest message → fresh input at 1x.
  • The assistant’s reply → output at 5x, then becomes part of the cached prefix on your next turn.

So yes, the cache is working hard for you in any back-and-forth chat. The economic shape of the conversation is dominated by the small fresh tail at the end of each turn, not the large repeating prefix.

The 5-minute catch. Cache entries live for 5 minutes from the last read. If you walk away for longer than that between messages, the entry expires. Your next message no longer reads from cache — it pays to rewrite the cache from scratch. We’ll come back to this in Question 3, because it’s where the biggest hidden costs hide.

Cross-provider note. The same “every message uses the cache” pattern holds on GPT (cache lookup is automatic for prompts above ~1K tokens) and on Gemini (only if you’ve explicitly created a CachedContent object and the prefix exceeds the ~32K minimum). On GPT it’s effortless and invisible; on Gemini it requires deliberate setup. The asymmetry: short Claude or GPT chats benefit from caching automatically once your context grows; short Gemini chats can’t use the feature at all.


Question 2 — If my context window is 53.6K tokens, does Claude actually send all 53.6K back to the API every turn?

Yes — every byte of those 53.6K tokens travels over the wire on every turn. Claude is stateless: there is no server-side memory of your conversation. The full payload ships each time.

But “shipped” and “billed at full rate” are very different things. Here is what actually happens to those 53.6K tokens when you send a new message of, say, 200 tokens:

Bill lineTokensRateEffective cost
cache_read_input_tokens~53,4000.1x~5,340
input_tokens (your new message)~2001x~200
cache_creation_input_tokens (none — already cached)01.25x0
output_tokens (the new reply)TBD5xTBD

Effective input cost: ~5,540 token-equivalents, instead of 53,600. Roughly a 10x reduction on the input side, on this one turn.

Note the asymmetry this creates: network bandwidth and billing diverge. Your network is moving 53.6K tokens of payload. Anthropic’s billing system is charging you for the equivalent of ~5.5K tokens of input. The full payload still has to be parsed and the model still has to be aware of all 53.6K tokens of context — but the dollars correspond only to the small fresh tail.

This is the central insight that makes long Claude conversations economically viable. Without caching, every turn of a 50-turn conversation would re-bill the entire growing prefix at the 1x rate, and the per-turn cost would balloon as the conversation went on. With caching, the per-turn input cost stays roughly proportional to how much new content you’re adding, not how long the conversation already is.

One small nuance. The assistant’s reply from the previous turn was not in the cache yet when you sent this new message. The cache breakpoint placed on the previous turn cached everything through your previous message, but not the reply itself. So in practice the new request also writes that previous reply into the cache as part of this turn’s cache_creation_input_tokens line — billed once at 1.25x. From the next turn onward, that reply joins the cached prefix and reads at 0.1x. This is the same “one-turn lag” mechanic we’ll see in detail in the next part.

Cross-provider note. The “full payload ships every turn” part is universal — every major LLM API is stateless and re-receives the entire conversation each call. What differs is only how much of it ends up billed at the discounted rate: ~10% on Claude, ~50% on GPT, ~25% (of the cached portion) on Gemini. The network cost is identical across providers; the billing cost is where caching mechanics produce the divergence.


Question 3 — What if I pause my work and reopen the session two days later?

The cache is gone. And the consequences of that are bigger than people expect.

The 5-minute TTL expired ~47 hours and 55 minutes ago. There is no cache entry for the prefix anymore. When you send your first message after the long pause:

Bill lineTokensRateEffective cost
cache_read_input_tokens00.1x0
cache_creation_input_tokens~53,4001.25x~66,750
input_tokens (new message)~2001x~200
output_tokensTBD5xTBD

Effective input cost on this first resumed message: ~66,950 token-equivalents.

Two comparisons to put that number in perspective:

  • Versus a message sent within the 5-minute TTL (~5,540 token-equivalents): about 12x more expensive. The cache discount is gone.
  • Versus a message sent with caching never enabled at all (53,600 token-equivalents at 1x): 25% more expensive. Cache writes cost 1.25x, so rebuilding a cache you never get to read from is strictly worse than not caching in the first place.

You pay extra to rebuild a cache that you may or may not get to use.

What happens next. Once the first resumed message has paid to rebuild the cache, subsequent messages within 5 minutes of it benefit normally — cache reads at 0.1x, just like before. The cost shape is one expensive spike to rebuild, then back to cheap operation.

So the lifecycle of a resumed-after-pause session looks like:

  • Resume message: ~12x cost spike. Cache fully rebuilt.
  • Next 5 minutes of activity: normal cheap reads.
  • Pause again past 5 minutes: cache expires, the next resume pays the spike again.

The pattern this implies

This produces a very specific bad pattern in real workflows:

“I’ll just send one quick follow-up message in this old session.” → Pays the full ~53.6K rebuild cost. → Gets one reply. → Closes the session. → Two days later does it again with a different old session. → Pays another full rebuild.

If you do this across five different old sessions in a week, you’ve paid the rebuild tax five times. The “quick follow-up” is anything but quick on the bill.

Better pattern: If you’re going to resume an old session for substantial work, batch your questions. The first message pays the rebuild tax. Messages 2–10 are cheap. Front-load your work into one sitting instead of spreading it across days.

Best pattern for genuinely intermittent work: Don’t resume the old session at all. Start a fresh session with only the context you actually need carried forward. A new session with a 5K-token brief is dramatically cheaper than resuming a 53.6K-token old session for a small task — and once you’re in the fresh session, its cache starts paying off immediately.

Two reasons this matters even on subscription plans

For users on subscription plans (rather than per-token API billing), the dollar cost is absorbed by the provider. But the cache rebuild still has real consequences for you:

  1. Latency. Rebuilding a cache entry on 53.6K tokens is slower than reading the same 53.6K tokens from cache. The first message after a long pause genuinely feels slower than subsequent ones. Many users have noticed this without understanding why.
  2. Rate limits and quota. Even on subscription plans, cache writes count more heavily toward your usage limits than cache reads. Habitually resuming many stale sessions can cause you to hit rate limits faster than you’d otherwise expect.

The 5-minute TTL is not a billing footnote. It’s a real constraint on how to structure long-running work — and once you see it, you start designing around it instead of accidentally fighting it.

Cross-provider note. On GPT, the two-day resume picture is similar — automatic caching is best-effort and certainly won’t survive overnight, so your first message after a long pause runs at full input price. On Gemini the answer depends entirely on what TTL you paid for when you created the cache: if you provisioned a 48-hour cache, your resume reads from it cheaply; if you provisioned a 1-hour cache, it’s long gone. Gemini is the only one of the three where “resume an old session cheaply” is a knob you can turn — at the cost of paying storage rent the whole time.


What these three questions add up to

Three small questions, one underlying insight: Claude‘s cache rewards continuity and punishes sporadic touch-points.

  • Within a continuous conversation, the cache is your friend — it reduces your effective input cost by an order of magnitude.
  • Across the 5-minute TTL, the cache is your enemy — it adds a 25% premium to a message that would otherwise have just been “fresh input.”

Once you internalize that asymmetry, a lot of usage decisions get easier. Batching, session reuse, fresh-vs-resumed sessions, when to use /compact — these stop being matters of taste and start being matters of arithmetic.

In the next part, we’ll follow a single 800-token file through its entire lifecycle in a session, and watch all of these mechanics interact in miniature.

Part 2: A Worked Example — What Happens When You Read an 800-Token File

Part 1 walked through three questions about how caching actually behaves. This part zooms all the way in on a single, concrete event: the model reads one 800-token file during a conversation. Following those 800 tokens from arrival to end-of-session reveals exactly how the four-line bill, prompt caching, and the cost-compounding effect of statelessness interact in practice.

The example is small on purpose. Once you can trace the cost of 800 tokens through 20 turns, you can trace the cost of anything.


The setup

A Claude Code session is in progress. The model has been running for a few turns when, at turn N, you ask it to read a file. The file is small — ~3 KB of source code, tokenizing to roughly 800 tokens. The session continues for many more turns after the file is read.

The question this part answers: what happens to those 800 tokens, billing-wise, from the moment they enter the conversation until the session ends?

The intuitive guesses are usually wrong. A reader who has just learned about caching often assumes one of:

  • “The 800 tokens are cached immediately, so they cost ~0.1x forever.”
  • “The 800 tokens are billed once at 1.25x to write to the cache, then ~0.1x after.”
  • “The 800 tokens are billed at 1x on every turn from now on.”

None of these is correct. The actual lifecycle has three distinct phases.


Phase 1 — Turn N: arrival as fresh input (1x)

When Claude Code reads the file on turn N, the request payload sent to Anthropic looks like this:

The cache only knows about content that existed at the previous breakpoint. The file is brand new — it has never been seen before. So on turn N, the 800 tokens are billed at the fresh-input rate of 1x. No cache involvement, no premium, no discount. Just 800 token-equivalents of input cost.

This contradicts the “it’s cached immediately” intuition. The cache write doesn’t happen on the turn the content arrives — it happens on the turn after.


Phase 2 — Turn N+1: written to cache (1.25x, one-time)

On your next message — turn N+1 — Claude Code places a new cache_control breakpoint that extends the cached region to include everything from turn N. The request now looks like:

This is the one-time onboarding cost. The 800 tokens get billed at the cache-write rate of 1.25x exactly once — 1,000 token-equivalents — and from this point forward they live in the prompt cache.

This is the phase most people miss. The “cache write” line on the bill is not the act of caching itself — it is the moment the cache grows to include new content. And it always happens one turn behind the content’s arrival.


Phase 3 — Turns N+2 through end of session: cache reads (0.1x)

From turn N+2 onward, the 800 tokens are part of the cached prefix. Every subsequent turn reads them at the cache-read rate of 0.1x — 80 token-equivalents per turn — for as long as the conversation continues (and as long as nothing earlier in the conversation is edited, and as long as the 5-minute TTL never expires between calls).

This is the phase that makes long sessions economically viable. Once those 800 tokens are cached, they essentially stop costing money.


The three diagrams

Diagram 1 — Lifecycle of the 800 tokens across turns

The three phases visualized as a timeline. Yellow is fresh input (1x), red is the one-time cache write (1.25x), green is cache reads (0.1x). The lesson at a glance: two expensive turns of “onboarding,” then the cost falls off a cliff.

Diagram 2 — Request payload: Turn N vs Turn N+1

A side-by-side view of what’s actually sent on the wire each turn. The file (yellow on turn N) gets pulled into the cache write region (red on turn N+1). From turn N+2 onward, it lives quietly inside the green “cached prefix” block at the top.

The single most important visual idea: the breakpoint advances one turn at a time, and that’s why the cache write always lags the content by one turn.

Diagram 3 — Cumulative cost over 20 turns: with caching vs without

For an 800-token file read at turn 5 of a 20-turn session, the cumulative cost of those 800 tokens through the rest of the session, with caching on vs caching off.

  • Without caching: a straight line climbing to ~12,800 token-equivalents by turn 20. Every turn re-bills 800 tokens at the 1x fresh-input rate.
  • With caching: an early spike at turn 6 (the 1.25x cache write), then an almost-flat crawl ending at ~2,920 token-equivalents. A ~4.4x reduction on this one file over this one session.

The diagram makes the punchline of caching visible: it’s not that caching makes any individual turn cheap — it’s that caching breaks the linear-growth curve of stateless re-shipping.


The total bill for those 800 tokens

For an 800-token file read at turn 5 of a 20-turn session, the lifecycle math works out to:

Without caching, the same file would have cost 800 × 16 turns = 12,800 token-equivalents — 4.4x more.

Cross-provider note. The three-phase lifecycle (1x arrival → 1.25x cache write → 0.1x reads) is Claude-specific. The same file looks different elsewhere:

  • On GPT: no 1.25x cache-write line (writes are free), and reads cost ~0.5x instead of ~0.1x. The total for the same 20-turn session works out to roughly 6,800 token-equivalents (800 × 1x on turn 5, then 800 × 0.5x × 15 for turns 6–20) — better than no-caching, but worse than Claude‘s ~2,920.
  • On Gemini: the 800-token file is too small to cache at all (below the ~32K minimum). The file gets billed at full 1x on every turn — closer to the “without caching” 12,800 number than to the Claude one. Same file, same conversation length, three very different bills. The choice of provider is implicitly a choice of which cost shape you’re signing up for.

Three practical lessons that fall out of this

The 800-token walkthrough isn’t just an explanation of mechanics. It produces three practical rules that change how you design Claude-powered workflows.

Lesson 1 — File reads have a fixed “onboarding cost”

Every file you bring into a session costs 1x on the turn it arrives and 1.25x on the next turn, regardless of how the session unfolds afterward. That’s a fixed 2.25x premium that gets paid up front for every file. The cost is amortized across however many turns the file then sticks around for cache reads.

Implication: read files early in a session, not late. A file read at turn 2 of a 20-turn session amortizes its 2.25x onboarding across 18 cache-read turns. A file read at turn 18 amortizes it across only 1. The same file in the same conversation can be 5x more expensive depending on when you read it.

Lesson 2 — The cache write lags by one turn

The mental model of “caching turns content cheap immediately” is wrong. There is always a one-turn lag: content arrives at 1x, then gets written to cache at 1.25x on the next turn. The cache only starts paying off on turn N+2.

Implication: caching is a bet on conversation length. Reading a 10K-token file and then ending the session two turns later is a textbook example of caching costing you money. You paid the 1x + 1.25x onboarding without ever collecting enough 0.1x reads to break even. The break-even math: onboarding costs 2.25x, and each cache-read turn saves you 0.9x relative to no-caching. So you need at least three cache-read turns (i.e., the file needs to live through roughly five turns total from arrival to session end) before caching that file pays for itself.

Lesson 3 — /compact is an economic operation, not just a UX one

In Claude Code, the /compact command summarizes earlier conversation history into a smaller representation. People often think of it as “freeing context window space.” It is also — and arguably more importantly — a token economics operation.

If /compact replaces 40K tokens of accumulated history with a 5K-token summary, every remaining turn now reads 5K from cache (500 token-equivalents) instead of 40K (4,000 token-equivalents). On a 10-turn continuation, that’s a 35,000-token-equivalent savings, easily worth the onboarding cost of writing the 5K summary into the cache: 5,000 × 1x (arrival) + 5,000 × 1.25x (cache write) = 11,250 token-equivalents. Net savings: ~23,750 token-equivalents from one /compact invocation.

Implication: in long Claude Code sessions, periodic /compact is one of the highest-leverage economic actions you can take. The win is bigger than it looks because the savings compound on every subsequent turn.


Why this matters for the rest of the post

The 800-token walkthrough is the smallest possible example of how Claude‘s billing mechanics actually behave. Every larger pattern you’ll see in real workloads — long agentic sessions, tool result re-injection, RAG pipelines, multi-turn chat — is just this same lifecycle, multiplied across many pieces of content arriving at different turns.

If you understand exactly what happens to one 800-token file, you have everything you need to reason about the cost shape of any Claude workload.

In the final post in this series — Part 3: Strategies and Anti-Patterns — we’ll turn this understanding into action: the five strategies that consistently deliver the biggest cost reductions in production, the silent failures that quietly undo them, and a side-by-side of how the cache mechanics differ across Claude, GPT, and Gemini. For now, the takeaway is simpler: when you watch a number go by in your usage block, you should be able to point at which of the three phases it came from. Everything downstream of that ability — instrumenting, optimizing, debugging the bill — gets dramatically easier.