Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache
Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.
Motivation
- Research & implementation inspired by Thinking Machines’ analysis of batch invariance and reduction order in LLM inference: [Defeating Nondeterminism in LLM Inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/).
- This work underpins reproducibility guarantees needed by Steadytext: [steadytext.julep.ai](https://steadytext.julep.ai).
What’s included
- Deterministic RMSNorm (fixed per-row reduction order; batch-invariant).
- Deterministic MatMul for FP16/BF16 (fixed tiling, no split-K, FP32 accumulation).
- Deterministic Attention (fixed split-size over KV, stable softmax reduction, unified KV path; KV cache aligned/padded so chunked vs one-shot prefill are identical).
- MoE
mul_mat_idpath made deterministic. - Off by default; normal fast paths unchanged.
Usage
- Build:
-DGGML_DETERMINISTIC=ON - Run:
--deterministic(orGGML_DETERMINISTIC=1) - For fully reproducible generation:
temperature=0, top_k=1, top_p=1.
Scope & perf
- Targets CUDA (BF16/FP16). CPU is already deterministic; other GPU backends unchanged.
- Throughput trade-off in deterministic mode; default builds/perf unaffected when flag is off.
Tests
-
New tests assert:
- run-to-run bit equality,
- batch & chunking invariance,
- attention/masking (incl. ALiBi),
- deterministic MoE.
All passing on tested NVIDIA GPUs.
Notes
- Happy to rename the flag to
LLAMA_DETERMINISTICif maintainers prefer; currentlyGGML_DETERMINISTIC.