ggml : x2 speed for WASM by optimizing SIMD by ngxson · Pull Request #11453 · ggml-org/llama.cpp

2 min read Original article ↗

Simlowker added a commit to Simlowker/llama.cpp that referenced this pull request

@Simlowker

Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K
on the WASM SIMD128 code path. PR ggml-org#11453 (Jan 2025) vectorized the
Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop
running 256 stores per block with per-byte bit manipulation. This PR
closes that remaining scalar region.

Approach: process 16 output lanes at once using strict (non-relaxed)
WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights),
the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores.
All intrinsics used (v128_load/store, i8x16_splat, v128_and/or,
u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in
the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact
identical across conforming implementations. Important for runtimes
that require deterministic compute (consensus-based VMs, fuelled
runtimes, reproducible-research pipelines).

Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs):

  Variant                  | ns/iter | GFLOPS | CV    | Bit-exact
  -------------------------|---------|--------|-------|----------
  Baseline (scalar unpack) |  357.81 |  22.91 | 7.98% | reference
  Patched  (vectorized)    |  349.85 |  23.43 | 2.53% | identical

Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53)
because the vectorized path has fewer branches and more predictable
cycle counts. The modest mean speedup reflects that LLVM -O3 already
extracts a non-trivial fraction of SIMD parallelism from the scalar
loop via auto-vectorization; the explicit SIMD path guarantees SIMD
codegen independent of compiler version, reduces variance, and
provides a stable baseline for further tuning.

Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically
(xorshift32, seed=42) and compares the float result. All 8 runs
produced result=56754044928.000000 identically across baseline and
patched paths.

Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are
unchanged.