and others added 5 commits
Integrate QJL CUDA kernels from amirzandieh/QJL for fused quantization and attention score computation. Restructure flat files into turboquant/ package with setup.py for installable distribution. New files: - turboquant/cuda_backend.py: QJL CUDA kernel wrappers with PyTorch fallback - turboquant/csrc/*.cu: CUDA kernels (quant, score, gqa_score, quantization) - turboquant/benchmark_cuda.py: PyTorch vs CUDA kernel benchmarks - setup.py: pip-installable package with optional CUDA build Validation results on Qwen2.5-3B-Instruct (8K context): - 3-bit: 5.0x compression, 0.9945 cosine sim, 289MB -> 57.6MB - 4-bit: 3.8x compression, 0.9983 cosine sim, 289MB -> 75.6MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RotorQuant replaces TurboQuant's random orthogonal matrix (QR decomposition) with Clifford rotors R = exp(B/2) acting via sandwich product R x R̃. New files: - turboquant/clifford.py: Full Cl(3,0) geometric algebra (8-component multivectors, geometric product, rotor construction, sandwich product) - turboquant/rotorquant.py: RotorQuantMSE, RotorQuantProd, RotorQuantKVCache with grade-aware Lloyd-Max quantization + QJL residual correction - turboquant/benchmark_rotorquant.py: 7-test comparative benchmark Benchmark results (d=128, GPU: RTX PRO 4000 Blackwell): | Metric | TurboQuant | RotorQuant | Notes | |---------------------|-----------|-----------|--------------------------| | MSE (3-bit) | 0.034 | 0.081 | TQ wins on raw MSE | | IP correlation | 0.918 | 0.878 | TQ's full rotation helps | | Needle retrieval | 9/9 exact | 9/9 exact | Both perfect | | Parameters | 16,399 | 372 | RQ 44x more efficient | | Speed (10K vectors) | 0.41 ms | 4.87 ms | TQ 12x faster (no GP kernel) | Key finding: RotorQuant's 44x parameter efficiency comes at a cost in MSE because grouping 3 dims into Cl(3,0) multivectors changes the per-component distribution vs TurboQuant's full-matrix rotation. A fused CUDA kernel for the geometric product would close the speed gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On actual KV cache data from Qwen2.5-3B-Instruct, RotorQuant achieves essentially identical attention fidelity to TurboQuant despite higher synthetic MSE — the QJL residual correction compensates for the weaker Stage 1. Results (8/36 layers, first 16 KV heads): 2K context: TQ 3-bit: cosine=0.9906, top1=81.2%, top5=93.8% RQ 3-bit: cosine=0.9903, top1=81.2%, top5=93.8% 4K context: TQ 3-bit: cosine=0.9875, top1=81.2%, top5=87.5% RQ 3-bit: cosine=0.9870, top1=81.2%, top5=93.8% TQ 4-bit: cosine=0.9880, top1=75.0%, top5=93.8% RQ 4-bit: cosine=0.9874, top1=81.2%, top5=93.8% RotorQuant actually beats TurboQuant on top-1/top-5 match at 4K context (4-bit), suggesting the Clifford rotor decorrelation may better preserve the directional structure of real KV cache vectors despite worse MSE on random unit vectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fused kernel does embed → rotor_sandwich → quantize → inverse → extract in a single kernel launch. Exploits rotor sparsity (4 of 8 multivector components are zero) to cut FMAs by ~50%. Benchmark (RTX PRO 4000 Blackwell, d=128, 3-bit): | n_vectors | TurboQuant | RQ PyTorch | RQ CUDA | CUDA vs TQ | |-----------|-----------|------------|----------|------------| | 1,024 | 69 us | 3.30 ms | 6 us | 11x faster | | 4,096 | 132 us | 3.86 ms | 12 us | 11x faster | | 8,192 | 285 us | 4.70 ms | 20 us | 14x faster | | 16,384 | 740 us | 6.71 ms | 39 us | 19x faster | CUDA kernel is 170-530x faster than PyTorch RotorQuant and 10-19x faster than TurboQuant's cuBLAS matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
repne
mentioned this pull request
1 task
nalditopr pushed a commit to nalditopr/llama-cpp-turboquant that referenced this pull request
Two turbo3 attention paths now available:
1. FA path (-fa on): explicit inverse WHT on K+V, then MMA flash attention
2. MMVQ path (no -fa): fused WHT in K dot product, WHT linearity for V
WHT linearity trick: instead of dequanting entire V cache with inverse
WHT (O(n_kv * head_dim) cooperative kernels), dequant V to rotated
space (cheap: centroid*gamma, no syncthreads), matmul in rotated space,
then apply inverse WHT only to the tiny output (O(n_q * head_dim)).
During decode n_q=1, saving ~1000x cooperative kernel invocations.
Implementation:
- CPY turbo3→F32 now does no-WHT dequant (centroid*gamma only)
- FA path adds explicit ggml_turbo_wht(k/v, 1) after cast
- Non-FA path defers inverse WHT to after kqv matmul
- Re-enabled TURBO_WHT CUDA kernel for post-matmul inverse
Benchmarks on Qwen3.5-35B-A3B (RTX 5090):
Prefill Decode vs f16
f16 baseline: 6860 187 100%
turbo3 FA: 6832 177 95%
turbo3 MMVQ: 107 153 82% ← WHT linearity
Compression: 4.6x KV cache
Inspired by: tonbistudio/turboquant-pytorch#4 (kernel fusion insights)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters