Charlie O'Neill (@oneill_c) on X

2/ The KV cache is the wall everyone hits with long-horizon LLMs eg multi-day agents, repo-scale reasoning, long tool chains. It grows linearly with context and you can't get around it. So far you've had two bad options

3/ Option A — selection: throw away tokens. Cheap, but your compact cache can only ever be a subset of the original. Option B — synthesis: build new compact entries. Expressive, but you re-run an optimization for every single context. We wanted both a la cheap and expressive.

4/ So we amortize the synthesis. We train one small per-layer module (a Perceiver) once, against a frozen base model. At inference, learned latent queries cross-attend the full cache and write out compact keys & values. There are no reference queries, no gradient steps, no

5/ Across 8×–200× compression and 8k–64k contexts on Qwen3-4B, Still is the only method that stays fast and accurate. Selection collapses under tight budgets. Per-context synthesis is slow. Still sits alone on the good side of the frontier

6/ The closest published method is KV-Distill, which is also amortized, but it can only ever keep original tokens, not synthesize new ones. On matched-training RULER, Still beats it by 8–22 points across 16 of 18 cells

7/ And the compact cache isn't just for multiple-choice, it's also for open-ended generation. On HELMET summarization, Still keeps 74–95% of the full-context gain. On LongBench, it wins 60% of head-to-head judge comparisons vs KV-Distill. This means Still can be a real working

8/ Because compaction is just a forward pass, you can call it over and over as context streams in. That opens up iterative long-horizon compaction, a regime that's basically closed to per-context methods, since their fitting cost compounds every time you compact

9/ Still transfers across Qwen3 dense (4B→32B), the 30B-A3B MoE, and Gemma-3's mixed attention stack, where it just compacts the global layers and leaves the rest alone

10/ There are a few rough edges: it's not lossless, iterative compaction isn't free extrapolation (training horizon matters a lot), and exact needle retrieval is still hard. But we've made many improvements to the architecture and training process since we wrote this paper, and