RotorQuant: Clifford algebra reimagining of TurboQuant by johndpope · Pull Request #4 · tonbistudio/turboquant-pytorch

4 min read Original article ↗

and others added 5 commits

March 26, 2026 19:20
Integrate QJL CUDA kernels from amirzandieh/QJL for fused
quantization and attention score computation. Restructure
flat files into turboquant/ package with setup.py for
installable distribution.

New files:
- turboquant/cuda_backend.py: QJL CUDA kernel wrappers with PyTorch fallback
- turboquant/csrc/*.cu: CUDA kernels (quant, score, gqa_score, quantization)
- turboquant/benchmark_cuda.py: PyTorch vs CUDA kernel benchmarks
- setup.py: pip-installable package with optional CUDA build

Validation results on Qwen2.5-3B-Instruct (8K context):
- 3-bit: 5.0x compression, 0.9945 cosine sim, 289MB -> 57.6MB
- 4-bit: 3.8x compression, 0.9983 cosine sim, 289MB -> 75.6MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RotorQuant replaces TurboQuant's random orthogonal matrix (QR decomposition)
with Clifford rotors R = exp(B/2) acting via sandwich product R x R̃.

New files:
- turboquant/clifford.py: Full Cl(3,0) geometric algebra (8-component
  multivectors, geometric product, rotor construction, sandwich product)
- turboquant/rotorquant.py: RotorQuantMSE, RotorQuantProd, RotorQuantKVCache
  with grade-aware Lloyd-Max quantization + QJL residual correction
- turboquant/benchmark_rotorquant.py: 7-test comparative benchmark

Benchmark results (d=128, GPU: RTX PRO 4000 Blackwell):

| Metric              | TurboQuant | RotorQuant | Notes                    |
|---------------------|-----------|-----------|--------------------------|
| MSE (3-bit)         | 0.034     | 0.081     | TQ wins on raw MSE       |
| IP correlation      | 0.918     | 0.878     | TQ's full rotation helps |
| Needle retrieval    | 9/9 exact | 9/9 exact | Both perfect             |
| Parameters          | 16,399    | 372       | RQ 44x more efficient    |
| Speed (10K vectors) | 0.41 ms   | 4.87 ms   | TQ 12x faster (no GP kernel) |

Key finding: RotorQuant's 44x parameter efficiency comes at a cost in
MSE because grouping 3 dims into Cl(3,0) multivectors changes the
per-component distribution vs TurboQuant's full-matrix rotation.
A fused CUDA kernel for the geometric product would close the speed gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On actual KV cache data from Qwen2.5-3B-Instruct, RotorQuant
achieves essentially identical attention fidelity to TurboQuant
despite higher synthetic MSE — the QJL residual correction
compensates for the weaker Stage 1.

Results (8/36 layers, first 16 KV heads):

2K context:
  TQ 3-bit: cosine=0.9906, top1=81.2%, top5=93.8%
  RQ 3-bit: cosine=0.9903, top1=81.2%, top5=93.8%

4K context:
  TQ 3-bit: cosine=0.9875, top1=81.2%, top5=87.5%
  RQ 3-bit: cosine=0.9870, top1=81.2%, top5=93.8%
  TQ 4-bit: cosine=0.9880, top1=75.0%, top5=93.8%
  RQ 4-bit: cosine=0.9874, top1=81.2%, top5=93.8%

RotorQuant actually beats TurboQuant on top-1/top-5 match at 4K
context (4-bit), suggesting the Clifford rotor decorrelation
may better preserve the directional structure of real KV cache
vectors despite worse MSE on random unit vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fused kernel does embed → rotor_sandwich → quantize → inverse → extract
in a single kernel launch. Exploits rotor sparsity (4 of 8 multivector
components are zero) to cut FMAs by ~50%.

Benchmark (RTX PRO 4000 Blackwell, d=128, 3-bit):

| n_vectors | TurboQuant | RQ PyTorch | RQ CUDA  | CUDA vs TQ |
|-----------|-----------|------------|----------|------------|
| 1,024     | 69 us     | 3.30 ms    | 6 us     | 11x faster |
| 4,096     | 132 us    | 3.86 ms    | 12 us    | 11x faster |
| 8,192     | 285 us    | 4.70 ms    | 20 us    | 14x faster |
| 16,384    | 740 us    | 6.71 ms    | 39 us    | 19x faster |

CUDA kernel is 170-530x faster than PyTorch RotorQuant and
10-19x faster than TurboQuant's cuBLAS matmul.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@repne repne mentioned this pull request

Mar 26, 2026

1 task

nalditopr pushed a commit to nalditopr/llama-cpp-turboquant that referenced this pull request

Mar 28, 2026
Two turbo3 attention paths now available:
1. FA path (-fa on): explicit inverse WHT on K+V, then MMA flash attention
2. MMVQ path (no -fa): fused WHT in K dot product, WHT linearity for V

WHT linearity trick: instead of dequanting entire V cache with inverse
WHT (O(n_kv * head_dim) cooperative kernels), dequant V to rotated
space (cheap: centroid*gamma, no syncthreads), matmul in rotated space,
then apply inverse WHT only to the tiny output (O(n_q * head_dim)).
During decode n_q=1, saving ~1000x cooperative kernel invocations.

Implementation:
- CPY turbo3→F32 now does no-WHT dequant (centroid*gamma only)
- FA path adds explicit ggml_turbo_wht(k/v, 1) after cast
- Non-FA path defers inverse WHT to after kqv matmul
- Re-enabled TURBO_WHT CUDA kernel for post-matmul inverse

Benchmarks on Qwen3.5-35B-A3B (RTX 5090):
                    Prefill    Decode    vs f16
  f16 baseline:     6860       187       100%
  turbo3 FA:        6832       177        95%
  turbo3 MMVQ:       107       153        82%  ← WHT linearity
  Compression: 4.6x KV cache

Inspired by: tonbistudio/turboquant-pytorch#4 (kernel fusion insights)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>