Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache by creatorrr · Pull Request #16016 · ggml-org/llama.cpp

2 min read Original article ↗

Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache

Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.

Motivation

What’s included

  • Deterministic RMSNorm (fixed per-row reduction order; batch-invariant).
  • Deterministic MatMul for FP16/BF16 (fixed tiling, no split-K, FP32 accumulation).
  • Deterministic Attention (fixed split-size over KV, stable softmax reduction, unified KV path; KV cache aligned/padded so chunked vs one-shot prefill are identical).
  • MoE mul_mat_id path made deterministic.
  • Off by default; normal fast paths unchanged.

Usage

  • Build: -DGGML_DETERMINISTIC=ON
  • Run: --deterministic (or GGML_DETERMINISTIC=1)
  • For fully reproducible generation: temperature=0, top_k=1, top_p=1.

Scope & perf

  • Targets CUDA (BF16/FP16). CPU is already deterministic; other GPU backends unchanged.
  • Throughput trade-off in deterministic mode; default builds/perf unaffected when flag is off.

Tests

  • New tests assert:

    • run-to-run bit equality,
    • batch & chunking invariance,
    • attention/masking (incl. ALiBi),
    • deterministic MoE.
      All passing on tested NVIDIA GPUs.

Notes

  • Happy to rename the flag to LLAMA_DETERMINISTIC if maintainers prefer; currently GGML_DETERMINISTIC.