Simlowker added a commit to Simlowker/llama.cpp that referenced this pull request
Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the WASM SIMD128 code path. PR ggml-org#11453 (Jan 2025) vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop running 256 stores per block with per-byte bit manipulation. This PR closes that remaining scalar region. Approach: process 16 output lanes at once using strict (non-relaxed) WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights), the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores. All intrinsics used (v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact identical across conforming implementations. Important for runtimes that require deterministic compute (consensus-based VMs, fuelled runtimes, reproducible-research pipelines). Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs): Variant | ns/iter | GFLOPS | CV | Bit-exact -------------------------|---------|--------|-------|---------- Baseline (scalar unpack) | 357.81 | 22.91 | 7.98% | reference Patched (vectorized) | 349.85 | 23.43 | 2.53% | identical Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53) because the vectorized path has fewer branches and more predictable cycle counts. The modest mean speedup reflects that LLVM -O3 already extracts a non-trivial fraction of SIMD parallelism from the scalar loop via auto-vectorization; the explicit SIMD path guarantees SIMD codegen independent of compiler version, reduces variance, and provides a stable baseline for further tuning. Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically (xorshift32, seed=42) and compares the float result. All 8 runs produced result=56754044928.000000 identically across baseline and patched paths. Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are unchanged.