How Claude Code Compresses Your Conversation

9 min read Original article ↗

March 2026

Claude Code runs in your terminal, reads your files, writes code, runs tests, all inside a single continuous conversation. But that conversation has a hard limit: a context window of around 200k tokens. When it fills up, something has to give.

I wanted to know not only what that "something" is, but how it works. Turns out you can read the answer straight out of the binary.

Inside the binary

Claude Code ships as a single executable file, not a folder full of .js files. Under the hood it's still JavaScript, but a bundler packs the source, the Node.js runtime, and all dependencies into one ~200MB file. Most of that is compiled machine code. But the JavaScript source is embedded in plain text:

compiled machine code ↑ JavaScript source more compiled code

If you know what to search for, you can find it. grep -boa finds the byte offset of a string match, even in binaries:

$ grep -boa "Primary Request" claude 115503119:Primary Request

115,503,119 is a byte offset. We can use dd to extract the raw bytes around it:

$ dd if=claude bs=1 skip=115498000 count=15000 2>/dev/null \ | tr '\0' '\n' | grep -v '^$'

What comes out is readable JavaScript: the compaction prompt, token budget constants, threshold logic, file restoration rules. I spent a while pulling on this thread, searching for related strings and cross-referencing offsets. Everything in this post comes from that process.

What's inside the context window?

Every API call to Claude is the same shape: you send an array of message objects, the model sends back a response. The "context window" is this array, and it has a size limit of around 200k tokens.

Each message in the array has a role and some content. Here's what the array looks like when you ask Claude Code to edit a file:

messages = [ { role: "system", content: "You are Claude Code..." } { role: "user", content: "Add dark mode to the blog" } { role: "assistant", content: "I'll read your styles first." } { role: "assistant", content: { tool_use: Read("shared.css") } } { role: "user", content: { tool_result: "*, *::before, *::after..." } } { role: "assistant", content: "Found hardcoded colors, replacing..." } ...this array keeps growing with every turn ]

The system prompt is invisible and always present. Claude Code injects it at the start of every request: about 800 lines of instructions, tool definitions, and your CLAUDE.md config. You never see it in the terminal, but it costs ~16k tokens, and it's there on every single API call.

Why tool results are "user" messages Notice that tool results have role: "user". That's because tools execute on your machine, outside the model. Claude asks to read a file, your computer does it, and the contents come back as if you sent them.

Tool calls and tool results are just messages. When Claude decides to read a file, it outputs a tool_use block, basically "call Read with this path." That's tiny, maybe 85 tokens. But the result comes back as a tool_result message containing the entire file, verbatim. A 142-line CSS file is 4,800 tokens. A large source file can be 10k+.

The entire array gets sent on every turn. This is the crucial part. Claude doesn't have memory between turns. It re-reads the whole conversation from scratch on every API call. The array never shrinks. Every file ever read, every command output, every response is still in there. It only grows.

Below is the conversation as a tape. Each segment is a message, sized by its token cost:

Here's what that growth looks like over eight calls:

And here's what that growth is made of — token usage by message type as the conversation progresses:

Tool results (orange) quickly dominate everything else. Here's the breakdown at that point:

Nearly half the window is tool results. The system prompt that dominated early on is now just 8%.

The auto-compact trigger

Claude Code doesn't wait until the context window is completely full. The threshold is a budget, not a percentage. The code reserves room for two things:

CLAUDE_AUTOCOMPACT_PCT_OVERRIDE Override the threshold with an environment variable. Set it to a number between 0–100 and the threshold becomes that percentage of the effective window. Lower values = more frequent, smaller compactions.

First, it carves out space for the model's response (up to 20k tokens). Then it keeps a 13k-token buffer on top of that. For a 200k context window, auto-compact fires at roughly 167k tokens, about 83% full. The question the code asks is: "do I have less than 33k tokens of room left to work with?"

DISABLE_AUTO_COMPACT Set to true to turn off auto-compact entirely while still letting you /compact manually. DISABLE_COMPACT=true kills compaction completely, even manual.

Here's what that budget looks like as actual space inside the 200k window. Drag the slider to simulate different thresholds:

Try filling it yourself. Click the buttons to add messages and watch the context grow until compaction fires:


How compaction works

When compaction fires, the model doesn't just chop off old messages. It does something more interesting. First, an analysis scratchpad: which files are still relevant, which errors are resolved, what the user actually wants. Then a structured 9-section summary that replaces the entire conversation.

Watch the left side. "Files" pulls from three different messages. "Pending" is inferred: the model connects the assistant's note about hex literals to the user's unresolved complaint and recognizes unfinished work. Everything else fades. File contents, intermediate reasoning, tool call details, all compressible.

Sections 1–6 capture what happened: the goal, the tech, the files, what broke, what was tried, and every user message verbatim. Sections 7–9 capture what's next: unfinished work, current state, next action. Two-thirds backward, one-third forward. Less a transcript, more a handoff note.

The compaction instructions demand concrete detail: "Include specific code snippets, file paths with line numbers, exact function signatures, and error messages rather than general descriptions." Section 03 doesn't say "edited some CSS." It says shared.css:1-8 with the exact change.

The compact API call

Compaction is itself an API call, the model summarizing its own conversation. A deliberately constrained one:

Everything gets stripped away except the ability to read and write text. The model can't use tools, can't think step-by-step, can't see images or documents. It gets the full conversation, a tight output budget, and one job: compress.

There are also two versions of the analysis phase instructions. The full version demands a chronological walkthrough with code snippets and function signatures. The lean version (behind a feature flag) treats analysis as a "planning scratchpad":

<analysis> Treat this as a private planning scratchpad — it is not the place for content meant to reach the user. Use it to plan, not to draft. - Walk through chronologically and note what belongs in each of the 9 sections below - Do NOT write code snippets here — save those for <summary> where they will actually be kept The goal of <analysis> is coverage, not detail. The detail goes in <summary>. </analysis>

After the model writes both tags, the <analysis> block is stripped entirely. Regex-replaced with an empty string. Only the <summary> survives:

<analysis> scratchpad notes... </analysis>

→ stripped

<summary> 9-section summary... </summary>

→ kept

If the summary fails to stream (network issue, model hiccup) it retries once. And if the post-compact token count still exceeds the threshold, compaction triggers again on the very next turn, chaining until it fits.

The full compression pipeline:

The continuation prompt

After compaction, the old conversation is gone. In its place, the model receives a single user message that starts with:

This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation. [...structured summary...] Continue the conversation from where it left off without asking the user any further questions. Resume directly — do not acknowledge the summary, do not recap what was happening, do not preface with "I'll continue" or similar. Pick up the last task as if the break never happened.

"Pick up the last task as if the break never happened." If compaction works well, you shouldn't notice it at all.

What gets restored

This is the part that surprised me. After compaction, the model doesn't just get a summary. The system re-attaches several things automatically:

File restoration is the big one. It re-reads up to 5 recently accessed files (capped at 5k tokens each, 50k total) and attaches them to the context. The files you were actively editing are present as actual content, not just path names in a summary.

Background task statuses, invoked skills, and plan mode state are also re-attached.

What gets lost

Even with file restoration, compression is lossy. The summary preserves state over story, and some things don't fit neatly into either.

Pre-compaction safety net There's even a feature flag (tengu_summarize_tool_results) that, when enabled, instructs the model: "write down any important information you might need later in your response, as the original tool result may be cleared later." A mitigation against its own compression system. The model is told to save its notes before the conversation gets compacted.

The template keeps state (what files exist, what's broken, what's next) and drops story (how you got there, what you tried first, what you talked about along the way).


Why it matters

Understanding this system changed how I work with Claude Code. A few things I do differently now.

Section 1 captures "primary request and intent," so stating your goal clearly at the start means it survives compaction. Vague requests get vaguely summarized.

Explicit preferences like "always use single quotes" or "never auto-commit" get captured in section 6, "All User Messages," and persist across compactions. Implicit preferences shown through example are more likely to get lost.

You can also guide what the summary focuses on with /compact:

$ /compact focus on the test failures and the auth refactor # or $ /compact include file reads verbatim, remember the CSS bugs

These get appended to the summary prompt. Useful when switching from debugging to feature work.

The threshold is tunable too. If you want compaction to kick in earlier or later:

# Compact earlier (at 60% of effective window) $ CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60 claude # Never auto-compact, only manually $ DISABLE_AUTO_COMPACT=true claude

The context window is a conveyor belt, not a wall. Understanding the machinery underneath helps you work with it instead of against it.