Show HN: Standalone TurboQuant KV Cache Inference
github.comImplements TurboQuant (ICLR 2026, arXiv:2504.19874) KV cache compression directly inside a Transformers inference script. All algorithms are self-contained. Minimal dependencies.
- uses https://huggingface.co/g023/Qwen3-1.77B-g023 as the demonstration model (throw model files in Qwen3-BEST folder) Starred immediately. This is exactly the kind of practical quantization work that makes running longer-context models on consumer GPUs actually feasible. Looking forward to seeing it generalized beyond the one model.Great stuff, g023. I had some issues in the original, but had to jump away for a bit here to do some backups (weak). Anyways, I updated to make the necessary fixes, and also made some more tweaking values at top to play with and dialed in the params for the more this specific model a bit more. I will start testing with some other models here as my next step in this little experiment. Thanks for the interest. Feel free to try latest version and run the interactive mode to chat it up with the model and get a feedback on the results as you go. If you have any suggestions, let me know. I'm trying to keep this one as barebones as possible to make it easier for others to port to other languages, or integrate into other uses more easily. edit: just added Mirostat v2 to clean up repetitive output from the model What does it actually do? A single file, python based, minimal/recognizable dependencies, turboquant playground, barebones af, with some easy to access globals to experiment with at top of 'run_tquant.py'. Test model is a 1.77B model that I altered by duplicating a layer in a Qwen3 1.7B model. Probably work fine with the regular Qwen3 1.7B model as well, but for right now I'm just working with my surgically altered one while I work on the script.