Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache by creatorrr · Pull Request #16016 · ggml-org/llama.cpp

Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache

Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.

Motivation

Research & implementation inspired by Thinking Machines’ analysis of batch invariance and reduction order in LLM inference: [Defeating Nondeterminism in LLM Inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/).
This work underpins reproducibility guarantees needed by Steadytext: [steadytext.julep.ai](https://steadytext.julep.ai).

What’s included

Deterministic RMSNorm (fixed per-row reduction order; batch-invariant).
Deterministic MatMul for FP16/BF16 (fixed tiling, no split-K, FP32 accumulation).
Deterministic Attention (fixed split-size over KV, stable softmax reduction, unified KV path; KV cache aligned/padded so chunked vs one-shot prefill are identical).
MoE mul_mat_id path made deterministic.
Off by default; normal fast paths unchanged.

Usage

Build: -DGGML_DETERMINISTIC=ON
Run: --deterministic (or GGML_DETERMINISTIC=1)
For fully reproducible generation: temperature=0, top_k=1, top_p=1.

Scope & perf

Targets CUDA (BF16/FP16). CPU is already deterministic; other GPU backends unchanged.
Throughput trade-off in deterministic mode; default builds/perf unaffected when flag is off.

Tests

New tests assert:
- run-to-run bit equality,
- batch & chunking invariance,
- attention/masking (incl. ALiBi),
- deterministic MoE.
  All passing on tested NVIDIA GPUs.

Notes

Happy to rename the flag to LLAMA_DETERMINISTIC if maintainers prefer; currently GGML_DETERMINISTIC.