1/ You can shrink a language model's KV cache by 200ร—, in a single forward pass, and it still answers correctly. At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model. Here's how we did it ๐Ÿ‘‡ https://t.co/He1ucvxGyf

1 min read Original article โ†—

1/ You can shrink a language model's KV cache by 200ร—, in a single forward pass, and it still answers correctly. At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model. Here's how we did it ๐Ÿ‘‡

2:59 PM ยท Jun 10, 202667.8KViews