GitHub - m0at/rvllm: rvLLM: High-performance LLM inference in Rust. Drop-in vLLM replacement.

LLM inference engine. Rust+CUDA on GPU, JAX+XLA on TPU.

Three Gemma 4 models on TPU v6e-4: E4B (16,794 tok/s peak, 78.3 tok/s B=1, PPL 5.87), 26B-A4B MoE (14,899 tok/s peak), 31B (9,600 tok/s peak, 128K context). GPU: 31B on H100 at 63 tok/s single-user decode (84 with speculative decoding on real text), 8,786 tok/s peak batch (FP8, CUDA graph). Zero custom kernels on TPU -- ~500 lines of JAX. Native Rust binary on GPU -- zero Python in the serving path.

Full benchmarks | June 2026 H100 session record

At a glance

	E4B (4B)	26B-A4B (MoE)	31B TPU	31B GPU	vLLM H100
B=1 tok/s	78.3	52.9	44.2	63.0 (83.9 spec)	66.9
Peak tok/s	16,794	14,899	9,600	8,786	3,848
PPL	5.87	90.21	24.76	14.75*	-
Cached TTFT	25.9 ms	35.3 ms	73.3 ms	63 ms	-
Peak tok/s/$	3,230	2,865	1,846	4,576	2,004

TPU: v6e-4, $5.20/hr, int8, max-ctx 2048 (measured 2026-04/05). GPU: H100 SXM, $1.92/hr, FP8; B=1 measured 2026-06-09 (commit 0d5f276a5), peak-batch and vLLM comparison measured 2026-04. *The GPU PPL row predates the 2026-06-15 rvllm-ppl harness fix (KV leak + cold-chunk) and is superseded; see the perplexity section.

TPU: Gemma 4 on v6e-4

Pure JAX + XLA. No custom kernels. XLA compiles the entire forward pass to TPU machine code from a ~500 line JAX script. Three models, one codebase.

Models supported

Property	E4B (4B)	26B-A4B (MoE)	31B
Total / active params	~4B / 4B	26B / ~4B	31B / 31B
Layers	42	30	60
Hidden size	2,560	2,816	5,376
Q / KV heads (sliding)	8 / 2	16 / 8	32 / 16
Q / KV heads (global)	8 / 2	16 / 2 (V=K)	32 / 4 (V=K)
Head dim (sliding / global)	256 / 512	256 / 512	256 / 512
Sliding window	512	1,024	1,024
MoE	none	128 experts, top-8	none
KV-shared layers	18 (of 42)	0	0
Per-layer input injection	256-d gated (5.6 GB embed)	none	none

Batch scaling (max-ctx 2048)

Batch	E4B tok/s	26B-A4B tok/s	31B tok/s	vLLM H100
1	78	53	44	66.9
8	542	390	318	515
64	3,661	2,662	2,112	2,794
128	6,298	4,915	3,853	3,848
256	10,214	8,192	6,246	3,709
512	13,773	12,390	8,550	3,788
768	15,514	14,899	9,600	3,671
1024	16,794	-	-	-

31B context scaling (B=1)

Context	ms/step	tok/s	Architecture	KV type
512	12.79	78.2	Single-scan, 60-layer scan + cond	bf16
2,048	22.6	44.2	Single-scan	bf16
32K	~66	~15	Single-scan	bf16
64K	~91	~11	Split-cache, 10 groups x 6	int8
128K	40.56	24.7	Split-cache + blockwise global	int8

Dual-path architecture auto-switches at the 32K boundary.

TPU deployment

# Create TPU v6e-4 ($5.20/hr)
gcloud compute tpus tpu-vm create rvllm-gemma4 \
  --zone=us-east5-b --accelerator-type=v6e-4 --version=v2-alpha-tpuv6e \
  --boot-disk-size=200

# Install (30 seconds)
pip3 install 'jax[tpu]' huggingface_hub tokenizers \
  -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Download model
huggingface-cli download google/gemma-4-E4B-it --local-dir ~/models/gemma-4-E4B-it

# Run E4B (78.3 tok/s B=1)
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-E4B-it --max-tokens 200 --max-ctx 2048

# Run 31B batched (9,600 tok/s B=768)
LIBTPU_INIT_ARGS="--xla_tpu_enable_async_collective_fusion=true \
  --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true \
  --xla_tpu_enable_async_collective_fusion_multiple_steps=true \
  --xla_tpu_overlap_compute_collective_tc=true \
  --xla_tpu_scoped_vmem_limit_kib=131072" \
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 2048 --batch 768

# 128K context (24.7 tok/s)
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 131072

# API server (OpenAI-compatible)
python3 tpu/harness/api_server.py --model-dir ~/models/gemma-4-31B-it --port 8080

# Perplexity
python3 tpu/harness/gemma4_tpu_infer.py \
  --model-dir ~/models/gemma-4-31B-it --perplexity --max-ctx 2048

No Docker. No conda. No torch. No vLLM. One pip install, one Python file, one command.

EAGLE-3 Speculative Decoding (TPU, experimental)

450M-param draft head proposes K=5 tokens per cycle; the full 31B verifies K+1=6 in one forward pass. Lossless for greedy decode.

Metric	Value
Baseline (B=1, 512 ctx)	78.2 tok/s, 12.79 ms/step
EAGLE-3 fused cycle	31.0 ms/cycle
Projected @ tau=3.5	~145 tok/s (1.8x)
Hardware ceiling	~300 tok/s (3.8x)

Requires 50K+ training examples for production tau. Current: 2K examples, loss 7.1, pipeline validated end-to-end. See tpu/harness/EAGLE3_SPEC.md.

GPU: 31B Gemma 4 on H100

Rust + CUDA on H100 SXM 80GB. FP8 weights with per-channel scales, FP8 or F16 paged KV, FA3 SM90 attention for sliding layers + a split-KV FP8 decode kernel for the fallback/global path. All 60 layers captured in a single CUDA graph. 63.0 tok/s single-user decode (2026-06-09, commit 0d5f276a5), 83.9 tok/s with speculative decoding on real text at real context, 8,786 tok/s peak (B=512, 2026-04).

Single-user decode (batch=1): the memory-bound regime

A single user generates one token at a time (batch=1, M=1). This is the hardest case for a 31B model and is purely weight-bandwidth bound: every token reads all ~30 GB of FP8 e4m3 weights once, so on an H100 (HBM3 peak 3.35 TB/s) the floor is ~9 ms/token ≈ 104-109 tok/s for plain decode. Speculative decoding is the only way past that roofline (it amortizes the weight read over multiple verified tokens).

What actually makes B=1 fast (each item measured, June 2026):

Route small-M GEMMs through cuBLASLt (RVLLM_FP8_GEMM_LT_M1=1). The CUTLASS channelscale GemmUniversal that serves large-batch GEMMs runs M=1 at only ~51% of HBM in situ (69.7 µs avg per GEMM, node-level nsys); cuBLASLt runs the same shapes at ~80% (v3/M1_OUTCOME.md had this right). Rerouting M≤16 through cuBLASLt + a scale_cols pass + f32→f16 cast is +27% end-to-end (44.4 → 56.5 tok/s at B=1, 388.8 → 495.9 at B=8) even though it adds ~480 graph nodes per step. Two non-obvious rules survive: cuBLASLt OUTER_VEC channelscale fails the sm_90 heuristic (measured LaunchFailed) so the scale stays a separate pass, and do not hand-roll an FP8 GEMV -- measured to lose to cuBLASLt on every shape.
CUDA-graph the decode loop (default-on; RVLLM_DECODE_GRAPH=0 for eager). Eager re-issues ~122 sync HtoD copies per step and host-serializes the GPU: 17 → 44.9 tok/s (2.6×), token-hash identical. One correction to an earlier analysis that claimed ~12 ms/step of "inter-node dispatch gap": node-level tracing (nsys --cuda-graph-trace=node) shows the real gap is ~1.1 ms/step. The missing time was GEMM efficiency (point 1) and the attention kernel (point 3), not dispatch. Fusing dispatch (megakernel) was built anyway and measured 1.7× slower -- see v3/MEGAKERNEL_OUTCOME.md.
A paged-attention kernel that doesn't scale with context like a brick. The original FP8 decode kernel walked the KV window one token at a time (two __syncthreads() per token, one block per query head): 551 µs/call at a full 1024-token sliding window, which put long-context FP8 decode at 14 tok/s. The split-KV GQA-grouped rewrite (v3/kernels/fp8_decode_v2.cu) loads each KV chunk once for all query heads of its KV head, zero barriers in the token loop: 15.7 µs at the full window (35×), 40 µs at ctx 8192 (171×), parity ≤ 4.9e-4 vs the old kernel across 16 shape/scale variants. FP8-KV decode now beats the F16-KV path at every context length.

Measured B=1 generate (H100 SXM5, FP8 e4m3 weights, graphed, 2026-06-09/10):

path	tok/s	notes
FP8 KV, short ctx	63.0	production kernel set, `RVLLM_FP8_GEMM_LT_M1=1`
FP8 KV, ctx 1200-1500	61.4	the production serving regime (fast prefill forces FP8 KV)
F16 KV, short ctx	56.5	FA3 sliding decode
+ speculative decoding (K=4, real text, ctx 1200+)	83.9	n-gram drafter, see below
+ speculative decoding (K=4, repetitive text)	153.0	acceptance-dependent upper range
HBM weight roofline (plain decode)	~104-109	~30 GB FP8 ÷ 3.35 TB/s
historical: graphed baseline before June 2026	44.9	F16 KV; FP8-KV long-context was 14

Speculative decoding (GPU, shipped)

Draft up to K tokens by n-gram prompt lookup (zero extra model), verify [last, drafts...] in ONE forward at M=K+1 -- the 30 GB weight read amortizes over every accepted token. Greedy acceptance; every emitted token is a model argmax. The verify forward is graph-captured per chunk size (RVLLM_SPEC_GRAPH=0 to force eager). Gates: K=0 is bit-identical to plain decode (400-token hash equality, including past the sliding-window ring wrap); graphed and eager spec runs are bit-identical to each other; K>0 vs plain decode is quality-identical but not bit-identical -- batched verify GEMMs differ from M=1 by ulps and flip genuine near-tie argmaxes (measured example: "PPL (Peak)" vs "PPL (Cached)", both coherent). Enable: RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. Measured accept rate on real prose: 0.42-0.56/draft, 2.4-3.1 tokens per verify cycle. K=4 is the sweet spot; K=6-8 lose acceptance. Next: Gemma4-E4B as a model drafter (same tokenizer family) to lift acceptance on novel text.

GPU batch scaling (fresh spread, measured 2026-06-10, commit `544b1309e`)

Decode-step bench (run_bench, FP8 weights, 40 iters/8 warmup), default configuration -- the engine now auto-routes small-M GEMMs through cuBLASLt up to the measured crossover (M≤64) and CUTLASS above it:

Batch	tok/s (default)	ms/step	vs CUTLASS-only	route
1	64.4	15.5	+32%	cuBLASLt
2	125.3	16.0	+29%	cuBLASLt
4	249.1	16.1	+27%	cuBLASLt
8	495.5	16.1	+27%	cuBLASLt
16	949.2	16.9	+21%	cuBLASLt
32	1,741	18.4	+16%	cuBLASLt
64	2,997	21.4	+4%	cuBLASLt
128	5,211	24.6	(CUTLASS wins +12%)	CUTLASS
256	7,607	33.7	(CUTLASS wins +20%)	CUTLASS

The crossover (RVLLM_FP8_GEMM_LT_MAX_M, default 64) was calibrated by running every batch size both ways. Historical April-2026 table (different harness settings, 100 iters): B=512 reached 8,786 tok/s; B≥64 April rows ran ~5-10% above the fresh 40-iter numbers -- treat cross-date deltas at B≥64 as methodology noise, not regression, until re-run at matched iters.

Single-user / greedy generate (full pipeline incl. lm_head + sampling, same commit, same day):

config	decode tok/s	e2e tok/s
FP8 KV, short prompt	63.0	57.9
FP8 KV, 1200-token prompt (production shape)	61.4	11.8*
F16 KV, short prompt	56.6	44.7
spec-decode K=4, real text, 1200-token prompt	83.9	12.6*
spec-decode K=4, repetitive text, short	153.1	126.0

*Long-prompt e2e is dominated by per-token prefill (20.5 s TTFT for 1200 tokens on the bench; the open work item). Opt-in RVLLM_FP8_GEMM_LT_F16OUT=1 (cuBLASLt writes f16 directly, one in-place channel-scale kernel) adds another +1.4-2% deterministically but changes rounding (f16-before-scale); it stays opt-in until a unified ppl gate clears it.

rvLLM vs vLLM on H100

vLLM column measured 2026-04 (vLLM 0.19, FP8, CUDA graphs); rvLLM B=1 updated 2026-06 -- an apples-to-apples re-run against current vLLM is owed before claiming the B=1 row.

Batch	rvLLM tok/s	vLLM tok/s (2026-04)	Delta
1	63.0 (83.9 spec)	69	-9% (+22% with spec)
32	1,743	1,748	~0%
64	3,265	3,130	+4%
128	5,802	4,689	+24%
256	7,808	7,077	+10%
512	8,786	8,243	+7%

GPU: 12B Gemma 4 — the solidSF IDE "brain" (added 2026-06-15)

gemma-4-12B-it (RedHatAI FP8-Dynamic) runs on the same Rust+CUDA engine as the 31B and serves as the solidSF IDE "brain". 15 GB FP8 weights, ~17 GB resident (weights + 8192-block KV) — half the 80 GB card stays free for a co-resident model. It is launched with the 31B's production recipe (RVLLM_FP8_GEMM_LT_M1, RVLLM_DECODE_GRAPH, RVLLM_BATCH_PREFILL, RVLLM_NUM_BLOCKS=8192, --max-inflight-requests 4), now baked into company/solidsf-ide/app/deploy_rvllm_brain.sh (it had been launched without them at ~119 tok/s).

Decode-step batch sweep (graphed forward, FP8_GEMM_LT_M1, H100 SXM, 80 iters / 20 warmup):

B	1	8	32	64	128	256
tok/s	130	983	3,572	6,193	10,531	16,323

~2× the 31B at matched batch (half the weights). Single-user full generate (incl. lm_head + sampling, greedy): ~124 tok/s B=1 decode; TTFT ~220 ms for short prompts (per-token prefill, same open item as the 31B). Generation is fluent and correct — see the perplexity note below for why raw-text PPL reads high for this peaked FP8 instruct model.

GPU perplexity (rvllm-ppl harness fixed 2026-06-15; old absolute numbers superseded)

The old rvllm-ppl path had two bugs that made any multi-chunk or long-corpus run untrustworthy — they, not the model, produced the wild numbers people saw (running PPL spiking past 100,000, and OOM on long inputs):

KV-cache leak. Gemma4Bringup::run_ppl re-allocated the multi-GB paged KV region (plus all scratch) every chunk and never restored the arena, so it OOM'd by the third chunk. Fixed with arena.checkpoint() at entry + arena.restore() before return (the captured graph is dropped first).
Cold-chunk eval. The driver split the corpus into independent chunks and ran each from position 0 with no BOS and no prior context, so every chunk after the first scored mid-sentence tokens cold. Fixed with a proper sliding-window driver: a fresh BOS per window (Gemma is BOS-sensitive) and a score_from context-only prefix, so each target is scored once with real left context.

Sanity check on the corrected harness: highly predictable repetitive text scores PPL 1.46, and perplexity decreases monotonically as window context accumulates — i.e. the forward, attention, paged KV, lm_head, and NLL math are all sound. (Localised by RVLLM_NO_GRAPH=1 giving the identical result, ruling out graph capture.)

What the corrected harness shows (gemma-4-12B-it-FP8-Dynamic, F16 KV, sliding window): the model is heavily peaked. It scores its own generated prose at ~35 PPL but raw WikiText-2 at ~840. It stays argmax-coherent (generation is correct and fluent), but its probability mass is miscalibrated on out-of-distribution text, so teacher-forced PPL on a raw external corpus is high. That is consistent with a heavily instruct- + multimodal-tuned FP8 model; it is not the old harness bug. A <20 figure is not reproducible on raw WikiText for this model — it traces to a different corpus/format than this row.

Old (pre-fix, mixed-path) numbers — superseded, kept for history only:

Weight path	KV cache	PPL (old harness)	tok/s (B=1, at the time)
FP8-Dynamic + CUTLASS channelscale epilogue	F16	14.75	53
BF16 split QKV per-tensor FP8	F16	17.96	37.9
F16 weights (no FP8)	F16	19.79	37.9
HuggingFace BF16 reference	--	19.62	--

The FP8 row measuring better than the BF16 reference (14.75 < 19.62) was the tell these were never apples-to-apples (softcap applied on the ppl path only; the reference ran through a different harness). With the harness now fixed, a unified clean-corpus run is the way to a trustworthy absolute number; only the relative quantization ordering from the old table is worth keeping.

Gemma 4 forward pass (14 launches per layer)

For each layer in 0..60:
  1.  fused_rmsnorm_fp8_quant           input layernorm + FP8 quantize
  2.  cutlass_fp8_gemm_channelscale     fused Q||K||V + channelscale epilogue
  3.  fused_qkv_rmsnorm                 Q/K norm (learned) + V norm (parameter-free)
  4.  fused_rope_partial_f16kv          partial RoPE + F16 KV cache write
  5.  paged_decode (FA3 SM90)           attention (head_dim=256 sliding, 512 global)
  6.  quantize_fp8_per_token            attn output to FP8
  7.  fp8_gemm                          O projection
  8.  fused_norm_add_residual           channelscale + rmsnorm + residual add
  9.  fused_rmsnorm_fp8_quant           pre-FFN layernorm + FP8 quantize
  10. cutlass_fp8_gemm_channelscale     fused gate||up + channelscale epilogue
  11. fused_gelu_mul_fp8_quant          GELU(tanh)(gate) * up to FP8
  12. fp8_gemm                          down projection
  13. fused_norm_add_residual           channelscale + rmsnorm + residual + layer_scalar

Sampling tail:
  fused_rmsnorm                       final layernorm
  f16_gemm_f32                        lm_head
  logit_softcap                       30 * tanh(logits / 30)
  argmax_kernel                       token selection

Kernel fusion summary

Four rounds of fusion + custom CUTLASS epilogue reduced graph nodes from 1776 to ~935 (47% reduction):

Fusion	Kernels eliminated	Nodes saved
f32_to_bf16 + rmsnorm + vector_add -> fused_norm_add_residual	3 -> 1 (x2/layer)	240
scale_cols_f32 fused into norm+add kernel (O-proj, down)	1 -> 0 (x2/layer)	120
residual_scale_f16 fused into post-ff norm+add	1 -> 0 (x1/layer)	60
vnorm_f16 fused into qk_rmsnorm -> fused_qkv_rmsnorm	2 -> 1 (x1/layer)	60
CUTLASS channelscale epilogue (QKV, gate_up)	3 -> 1 (x2/layer)	240+

The CUTLASS channelscale kernel uses a custom SM90 EVT epilogue that applies per-token activation scale (ColBroadcast) and per-channel weight scale (RowBroadcast) directly in the GEMM epilogue while the accumulator is still F32, then casts to F16. At M≤16 this kernel is not used in the fast path anymore: RVLLM_FP8_GEMM_LT_M1=1 reroutes small-M GEMMs through cuBLASLt (+27% measured) -- the smaller-tile CUTLASS variant that used to be "help wanted" here is obsolete.

Help wanted (current, real -- updated 2026-06-11):

Close out the two-source prefill wiring (#58): the kernel is parity-proven bit-identical and measures TTFT 20.5 s -> 0.67 s at 1200-token prompts behind RVLLM_PREFILL_TWO_SOURCE=1, but an engine wiring divergence keeps it opt-in. The issue has the full evidence chain and a precise repro.
Serve-session spec-decode optimization: spec behind the API measures 39.7 tok/s where the bench-side machinery does 83.9 -- the graphed verify isn't fully exploited in the session loop yet.
Model drafter for spec-decode (Gemma4-E4B), replacing n-gram lookup for novel text.

Shipped from the previous help-wanted list (2026-06-11): cross-request graph persistence (served long-prompt decode 27.4 -> 59.6 tok/s) and the full API sampling contract (temperature/top_p/top_k/seed/stop -- the endpoint was greedy-only before).

GPU build and run

# One-time on H100 box (~15 min)
bash kernels/build.sh               # fused PTX
bash kernels/build_cutlass_so.sh    # libcutlass_kernels.so
bash kernels/build_fa3.sh           # libfa3_kernels.so (real FA3 -- needs flash-attention checkout;
                                    # includes the hdim256 combine instantiation upstream lacks)
bash kernels/build_fa_sm89_so.sh    # libfa_sm89_kernels.so (split-KV FP8 decode + global hd512)

# Build
cargo build --release --features cuda --manifest-path v3/Cargo.toml -p rvllm-bench

# Run
RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it \
RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90 \
RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so \
RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so \
RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json \
RVLLM_BATCH=128 RVLLM_ITERS=30 RVLLM_WARMUP=5 \
  ./v3/target/release/rvllm-bench

OpenAI-compatible Gemma 4 server

The server is a Rust-only Gemma 4 path with an OpenAI-compatible HTTP surface. It keeps CUDA execution on a single engine owner thread and accepts requests through /v1/chat/completions.

For the solidSF agents production shape, including 256K context, four-seat admission, the paid-plan busy response, CAD harness prompting, systemd service shape, and verification scripts, see docs/solidsf-agent-serving.md.

export CUDA_ARCH=sm_90
export RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it
export RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90
export RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so
export RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so
export RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json
export RVLLM_SERVED_MODEL_NAME=gemma4-31b
export RUST_LOG=info

bash kernels/build.sh sm_90
bash kernels/build_cutlass_so.sh sm_90
bash kernels/build_fa3.sh
cargo build --release --features cuda,cublaslt --manifest-path v3/Cargo.toml -p rvllm-serve

./v3/target/release/rvllm-server \
  --host 127.0.0.1 \
  --port 8080 \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --max-prefill-chunk 128

The server exposes GET /health, GET /v1/models, and POST /v1/chat/completions with non-stream and SSE streaming responses. Only greedy Gemma 4 decoding is currently enabled; set temperature: 0.

Smoke:

curl -fsS http://127.0.0.1:8080/health
curl -fsS http://127.0.0.1:8080/v1/models
curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"Reply exactly: RVLLM_RUST_OK"}],"max_tokens":16,"temperature":0}'
curl -fsS --no-buffer http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"hi"}],"max_tokens":16,"temperature":0,"stream":true}'

For bind-only local checks without CUDA:

RVLLM_DRY_RUN=1 cargo run --manifest-path v3/Cargo.toml -p rvllm-serve -- \
  --host 127.0.0.1 \
  --port 8080

Kernels

Every kernel has a known purpose, a pinned variant, and a workspace contract. No dispatch fallback chains.

Kernel	Purpose
`cutlass_fp8_gemm_channelscale`	SM90 FP8 GEMM with EVT channelscale epilogue (QKV, gate_up)
`fused_rmsnorm_fp8_quant`	layernorm + FP8 quantize in one launch
`fused_qkv_rmsnorm`	per-head RMSNorm on Q, K (learned) and V (parameter-free)
`fused_rope_partial_f16kv`	partial RoPE + F16 KV cache write
`fused_gelu_mul_fp8_quant`	GELU(tanh)(gate) * up to FP8
`fused_norm_add_residual`	channelscale + RMSNorm + residual add (+ optional layer_scalar)
`logit_softcap`	30 * tanh(logits / 30)
`quantize_fp8_per_token`	activation to FP8 with per-token scale
`argmax`	f32 logits to i32 token
`fp8_decode_v2` (`fa_sm89_*` .so)	split-KV GQA-grouped FP8 paged decode + LSE combine (35-171x over the serial kernel)

No fallbacks. Missing kernel .so = engine refuses to start. One earned scar: the loader probes the FA3 .so for fa3_sm90_* symbols and quietly selects the Ada-generation kernel set if they're absent -- which is how a production H100 served sm_89 attention for four days when a fallback .so got copied over the FA3 filename. Verify with nm -D $RVLLM_FA3_SO | grep fa3_sm90 after any kernel deploy; the engine now also refuses this combination on sm_90 unless explicitly overridden.

v3 crate map

v3/crates/
  rvllm-core         typed errors, IDs, dtype, shape, config, env
  rvllm-mem          HbmArena, Region, Stream, Event, PinnedBuf, CudaContextHandle
  rvllm-kernels      manifest (sha-pinned), PTX loader, kernel catalog
  rvllm-fused        8 fused-kernel launchers + pure-Rust f32 references
  rvllm-attention    FA3 SM90 paged decode/prefill dlopen
  rvllm-cutlass      FP8 variant catalog + schedule pairing trait + cuBLASLt wrapper
  rvllm-metadata     frozen-layout metadata per bucket (one upload path)
  rvllm-loader       safetensors mmap -> HBM + CPU-path FP8 quant + clamp gate
  rvllm-sampling     argmax tail, pinned DtoH
  rvllm-graph        captured-graph pool keyed on MetaLayoutHash
  rvllm-runtime      Engine, scheduler, layer_exec, bring_up
  rvllm-bench        RVLLM_* env-driven bench binary
  rvllm-invariants   DAG-dep test, no-megakernel gate

Correctness discipline

No fallbacks. Missing autotune entry = engine panic. Missing .so = refuse start. No silent degradation.
Graph-capture invariant. Metadata buffer layout frozen per (bucket, max_blocks_per_seq). Captured graphs bind exact offsets.
CUTLASS schedule/epilogue pairing. Mainloop and epilogue schedules must match. Enforced via static_assert.
No unwrap() in libraries. Result<T, RvllmError> end-to-end with structured context.
Real block-change detection. Scheduler emits block table updates; missing signals = stale KV reads caught at the type level.

License

Apache-2.0.

Updates

2026-06-09/10 -- H100 single-user maximization + production deploy (commits `f70b4bf..0d5f276`, PR #54)

Full record: v3/H100_MAXPERF_PLAN.md. Summary:

Production-regime decode 14 -> 61.4 tok/s (4.4x); short-ctx FP8 31.2 -> 63.0; spec-decode 83.9 on real text at ctx 1200+, 153 on repetitive text. Deployed to production 2026-06-10 (API-measured 3.38x short / 1.78x long).
Corrected the published B=1 narrative: the "~12 ms inter-node dispatch gap" analysis was a graph-level-trace artifact -- node-level nsys shows ~1.1 ms gap; the real costs were the CUTLASS channelscale GEMM at ~51% HBM at M=1 (fixed: cuBLASLt routing, +27%) and the serial-token FP8 attention kernel (fixed: split-KV rewrite, 35-171x).
Found and fixed a production incident: the deployed libfa3_kernels.so was the Ada fallback under the FA3 filename (silent symbol-probe fallback) -- H100 served sm_89 attention 2026-06-06..10. The engine now refuses that combination on sm_90; kernels/build_fa3.sh gained the missing hdim256 combine instantiation so real FA3 is rebuildable from upstream.
Speculative decoding shipped for GPU (n-gram drafter, graphed verify, lossless gates): RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4.
Known weak spots, in the open: per-token prefill TTFT at >1024-token prompts (sound chunked prefill is the fix, design proven in the verify chunk), serve-side graph lifecycle at long context (27.4 API vs 61.4 bench), and the GPU PPL table needs a unified re-eval.
Next: TPU revisit (v6e access incoming), E4B model drafter, spec-decode in rvllm-serve.