CC v2.1.100+ inflates cache_creation by ~20K tokens vs v2.1.98 — same payload, server-side

4 min read Original article ↗

Summary

Claude Code versions 2.1.100 and 2.1.101 consume ~20,000 more cache_creation_input_tokens per request than v2.1.98, despite sending fewer bytes in the request payload. The inflation is server-side and version-specific (likely User-Agent routing).

This is not a billing-only issue — these tokens enter the model's context window and may affect output quality.

Reproduction

# 1. Install proxy to capture full API request/response bodies
mkdir /tmp/cc-test && cd /tmp/cc-test
npx -y claude-code-logger@1.0.2 start --port 8000 --log-body --merge-sse

# 2. In another terminal — test with older version
export ANTHROPIC_BASE_URL="http://localhost:8000"
claude-2.1.98 --print "1+1"
# Note cache_creation_input_tokens in response

# 3. Same setup — test with newer version
claude-2.1.100 --print "1+1"
# Note cache_creation_input_tokens in response

Each test is a cold cache, single API call, no session state (--print mode). Same machine, same project, same account, minutes apart.

Evidence

Version Content-Length (bytes) cache_creation_input_tokens cache_read Total
v2.1.98 169,514 49,726 0 49,726
v2.1.100 168,536 (-978 B) 69,922 0 69,922
v2.1.101 171,903 (+2,389 B) ~72,000 0 ~72,000

v2.1.100 sends 978 fewer bytes than v2.1.98 but is billed 20,196 MORE tokens. This rules out any client-side payload difference — the inflation happens server-side after the request is received.

Cross-account test (same v2.1.98, two different Max accounts): delta < 500 tokens = noise. Not account-specific.

Interactive mode confirms

/tmp/claude-cache-*.json files across 40+ sessions show bimodal distribution:

  • Group A: ~50K (cache_read=0, cache_create=~50K) — matches v2.1.98 --print
  • Group B: ~71K (cache_read=0, cache_create=~71K) — matches v2.1.100+ --print

Some sessions start cold at 71K with cache_read=0, confirming this is not accumulated cache — it's the baseline.

Quality concern

The 20K extra tokens are cache_creation_input_tokens — this means they enter the model's context window, not just the billing ledger. If the server injects additional content invisible to the user:

  • Instruction dilution: Hidden system content may compete with user-provided CLAUDE.md rules, leading to inconsistent agent behavior
  • Reduced effective context: 20K fewer tokens available for actual conversation history — in long sessions this compounds with every turn
  • Unverifiable behavior: Users cannot audit what the model actually "sees" vs what they sent — makes debugging agent misbehavior significantly harder

We observed qualitative differences between v2.1.98 and v2.1.100+ sessions (instruction adherence, tool selection accuracy), but cannot isolate whether this is caused by the extra hidden tokens or other version changes. The lack of transparency makes it impossible to tell.

Additional findings

  • After /login account switch, statusline can jump ±20K — this is cache invalidation (new account_uuid = new cache key), not a billing difference between accounts.
  • 56 count_tokens burst calls observed immediately after first interactive prompt — inflates the apparent "first visible" context number in statusline.
  • Current v2.1.104 is untested — the investigation was done on v2.1.98/100/101. We cannot confirm whether v2.1.104 carries the same inflation.

Impact

~20K extra tokens per session = ~40% overhead on a clean project. On Max plan with usage limits, this means hitting the 5-hour cap significantly faster on newer CC versions.

Combined with the quality concern: users are paying more for potentially degraded output, with no visibility into why.

Workaround

Downgrade to v2.1.98 or earlier. Check ~/.local/share/claude/versions/ for available binaries, or use npx claude-code@2.1.98.

Note: Auto-updates may overwrite older versions. v2.1.98 is no longer retained in the local versions directory after v2.1.104 installed.

Related

Environment

  • Claude Code: v2.1.98 / v2.1.100 / v2.1.101 (tested all three)
  • OS: Linux (WSL2), Windows 11
  • Plan: Max (5x)
  • Install method: native
  • Measurement: HTTP proxy capturing full request/response bodies