Show HN: Standalone TurboQuant KV Cache Inference

3 points by g023 3 months ago · 4 comments · 1 min read

Reader

Implements TurboQuant (ICLR 2026, arXiv:2504.19874) KV cache compression directly inside a Transformers inference script. All algorithms are self-contained. Minimal dependencies.

- uses https://huggingface.co/g023/Qwen3-1.77B-g023 as the demonstration model (throw model files in Qwen3-BEST folder)

santander_cl 3 months ago

Starred immediately.

This is exactly the kind of practical quantization work that makes running longer-context models on consumer GPUs actually feasible. Looking forward to seeing it generalized beyond the one model.Great stuff, g023.

g023OP 3 months ago

I had some issues in the original, but had to jump away for a bit here to do some backups (weak). Anyways, I updated to make the necessary fixes, and also made some more tweaking values at top to play with and dialed in the params for the more this specific model a bit more. I will start testing with some other models here as my next step in this little experiment. Thanks for the interest. Feel free to try latest version and run the interactive mode to chat it up with the model and get a feedback on the results as you go. If you have any suggestions, let me know. I'm trying to keep this one as barebones as possible to make it easier for others to port to other languages, or integrate into other uses more easily.
edit: just added Mirostat v2 to clean up repetitive output from the model

ensotrade_tech 3 months ago

What does it actually do?

g023OP 3 months ago

A single file, python based, minimal/recognizable dependencies, turboquant playground, barebones af, with some easy to access globals to experiment with at top of 'run_tquant.py'. Test model is a 1.77B model that I altered by duplicating a layer in a Qwen3 1.7B model. Probably work fine with the regular Qwen3 1.7B model as well, but for right now I'm just working with my surgically altered one while I work on the script.

Settings

Show HN: Standalone TurboQuant KV Cache Inference

Keyboard Shortcuts