LLM inference engine. Rust+CUDA on GPU, JAX+XLA on TPU.
Three Gemma 4 models on TPU v6e-4: E4B (16,794 tok/s peak, 78.3 tok/s B=1, PPL 5.87), 26B-A4B MoE (14,899 tok/s peak), 31B (9,600 tok/s peak, 128K context). GPU: 31B on H100 at 63 tok/s single-user decode (84 with speculative decoding on real text), 8,786 tok/s peak batch (FP8, CUDA graph). Zero custom kernels on TPU -- ~500 lines of JAX. Native Rust binary on GPU -- zero Python in the serving path.
Full benchmarks | June 2026 H100 session record
At a glance
| E4B (4B) | 26B-A4B (MoE) | 31B TPU | 31B GPU | vLLM H100 | |
|---|---|---|---|---|---|
| B=1 tok/s | 78.3 | 52.9 | 44.2 | 63.0 (83.9 spec) | 66.9 |
| Peak tok/s | 16,794 | 14,899 | 9,600 | 8,786 | 3,848 |
| PPL | 5.87 | 90.21 | 24.76 | 14.75* | - |
| Cached TTFT | 25.9 ms | 35.3 ms | 73.3 ms | 63 ms | - |
| Peak tok/s/$ | 3,230 | 2,865 | 1,846 | 4,576 | 2,004 |
TPU: v6e-4, $5.20/hr, int8, max-ctx 2048 (measured 2026-04/05). GPU: H100 SXM, $1.92/hr, FP8; B=1 measured 2026-06-09 (commit 0d5f276a5), peak-batch and vLLM comparison measured 2026-04. *The GPU PPL row predates the 2026-06-15 rvllm-ppl harness fix (KV leak + cold-chunk) and is superseded; see the perplexity section.
TPU: Gemma 4 on v6e-4
Pure JAX + XLA. No custom kernels. XLA compiles the entire forward pass to TPU machine code from a ~500 line JAX script. Three models, one codebase.
Models supported
| Property | E4B (4B) | 26B-A4B (MoE) | 31B |
|---|---|---|---|
| Total / active params | ~4B / 4B | 26B / ~4B | 31B / 31B |
| Layers | 42 | 30 | 60 |
| Hidden size | 2,560 | 2,816 | 5,376 |
| Q / KV heads (sliding) | 8 / 2 | 16 / 8 | 32 / 16 |
| Q / KV heads (global) | 8 / 2 | 16 / 2 (V=K) | 32 / 4 (V=K) |
| Head dim (sliding / global) | 256 / 512 | 256 / 512 | 256 / 512 |
| Sliding window | 512 | 1,024 | 1,024 |
| MoE | none | 128 experts, top-8 | none |
| KV-shared layers | 18 (of 42) | 0 | 0 |
| Per-layer input injection | 256-d gated (5.6 GB embed) | none | none |
Batch scaling (max-ctx 2048)
| Batch | E4B tok/s | 26B-A4B tok/s | 31B tok/s | vLLM H100 |
|---|---|---|---|---|
| 1 | 78 | 53 | 44 | 66.9 |
| 8 | 542 | 390 | 318 | 515 |
| 64 | 3,661 | 2,662 | 2,112 | 2,794 |
| 128 | 6,298 | 4,915 | 3,853 | 3,848 |
| 256 | 10,214 | 8,192 | 6,246 | 3,709 |
| 512 | 13,773 | 12,390 | 8,550 | 3,788 |
| 768 | 15,514 | 14,899 | 9,600 | 3,671 |
| 1024 | 16,794 | - | - | - |
31B context scaling (B=1)
| Context | ms/step | tok/s | Architecture | KV type |
|---|---|---|---|---|
| 512 | 12.79 | 78.2 | Single-scan, 60-layer scan + cond | bf16 |
| 2,048 | 22.6 | 44.2 | Single-scan | bf16 |
| 32K | ~66 | ~15 | Single-scan | bf16 |
| 64K | ~91 | ~11 | Split-cache, 10 groups x 6 | int8 |
| 128K | 40.56 | 24.7 | Split-cache + blockwise global | int8 |
Dual-path architecture auto-switches at the 32K boundary.
TPU deployment
# Create TPU v6e-4 ($5.20/hr) gcloud compute tpus tpu-vm create rvllm-gemma4 \ --zone=us-east5-b --accelerator-type=v6e-4 --version=v2-alpha-tpuv6e \ --boot-disk-size=200 # Install (30 seconds) pip3 install 'jax[tpu]' huggingface_hub tokenizers \ -f https://storage.googleapis.com/jax-releases/libtpu_releases.html # Download model huggingface-cli download google/gemma-4-E4B-it --local-dir ~/models/gemma-4-E4B-it # Run E4B (78.3 tok/s B=1) python3 tpu/harness/gemma4_tpu_infer.py \ --model-dir ~/models/gemma-4-E4B-it --max-tokens 200 --max-ctx 2048 # Run 31B batched (9,600 tok/s B=768) LIBTPU_INIT_ARGS="--xla_tpu_enable_async_collective_fusion=true \ --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true \ --xla_tpu_enable_async_collective_fusion_multiple_steps=true \ --xla_tpu_overlap_compute_collective_tc=true \ --xla_tpu_scoped_vmem_limit_kib=131072" \ python3 tpu/harness/gemma4_tpu_infer.py \ --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 2048 --batch 768 # 128K context (24.7 tok/s) python3 tpu/harness/gemma4_tpu_infer.py \ --model-dir ~/models/gemma-4-31B-it --fused --max-tokens 200 --max-ctx 131072 # API server (OpenAI-compatible) python3 tpu/harness/api_server.py --model-dir ~/models/gemma-4-31B-it --port 8080 # Perplexity python3 tpu/harness/gemma4_tpu_infer.py \ --model-dir ~/models/gemma-4-31B-it --perplexity --max-ctx 2048
No Docker. No conda. No torch. No vLLM. One pip install, one Python file, one command.
EAGLE-3 Speculative Decoding (TPU, experimental)
450M-param draft head proposes K=5 tokens per cycle; the full 31B verifies K+1=6 in one forward pass. Lossless for greedy decode.
| Metric | Value |
|---|---|
| Baseline (B=1, 512 ctx) | 78.2 tok/s, 12.79 ms/step |
| EAGLE-3 fused cycle | 31.0 ms/cycle |
| Projected @ tau=3.5 | ~145 tok/s (1.8x) |
| Hardware ceiling | ~300 tok/s (3.8x) |
Requires 50K+ training examples for production tau. Current: 2K examples, loss 7.1, pipeline validated end-to-end. See tpu/harness/EAGLE3_SPEC.md.
GPU: 31B Gemma 4 on H100
Rust + CUDA on H100 SXM 80GB. FP8 weights with per-channel scales, FP8 or F16 paged KV, FA3 SM90 attention for sliding layers + a split-KV FP8 decode kernel for the fallback/global path. All 60 layers captured in a single CUDA graph. 63.0 tok/s single-user decode (2026-06-09, commit 0d5f276a5), 83.9 tok/s with speculative decoding on real text at real context, 8,786 tok/s peak (B=512, 2026-04).
Single-user decode (batch=1): the memory-bound regime
A single user generates one token at a time (batch=1, M=1). This is the hardest case for a 31B model and is purely weight-bandwidth bound: every token reads all ~30 GB of FP8 e4m3 weights once, so on an H100 (HBM3 peak 3.35 TB/s) the floor is ~9 ms/token ≈ 104-109 tok/s for plain decode. Speculative decoding is the only way past that roofline (it amortizes the weight read over multiple verified tokens).
What actually makes B=1 fast (each item measured, June 2026):
-
Route small-M GEMMs through cuBLASLt (
RVLLM_FP8_GEMM_LT_M1=1). The CUTLASS channelscale GemmUniversal that serves large-batch GEMMs runs M=1 at only ~51% of HBM in situ (69.7 µs avg per GEMM, node-level nsys); cuBLASLt runs the same shapes at ~80% (v3/M1_OUTCOME.mdhad this right). Rerouting M≤16 through cuBLASLt + ascale_colspass + f32→f16 cast is +27% end-to-end (44.4 → 56.5 tok/s at B=1, 388.8 → 495.9 at B=8) even though it adds ~480 graph nodes per step. Two non-obvious rules survive: cuBLASLt OUTER_VEC channelscale fails the sm_90 heuristic (measuredLaunchFailed) so the scale stays a separate pass, and do not hand-roll an FP8 GEMV -- measured to lose to cuBLASLt on every shape. -
CUDA-graph the decode loop (default-on;
RVLLM_DECODE_GRAPH=0for eager). Eager re-issues ~122 sync HtoD copies per step and host-serializes the GPU: 17 → 44.9 tok/s (2.6×), token-hash identical. One correction to an earlier analysis that claimed ~12 ms/step of "inter-node dispatch gap": node-level tracing (nsys --cuda-graph-trace=node) shows the real gap is ~1.1 ms/step. The missing time was GEMM efficiency (point 1) and the attention kernel (point 3), not dispatch. Fusing dispatch (megakernel) was built anyway and measured 1.7× slower -- seev3/MEGAKERNEL_OUTCOME.md. -
A paged-attention kernel that doesn't scale with context like a brick. The original FP8 decode kernel walked the KV window one token at a time (two
__syncthreads()per token, one block per query head): 551 µs/call at a full 1024-token sliding window, which put long-context FP8 decode at 14 tok/s. The split-KV GQA-grouped rewrite (v3/kernels/fp8_decode_v2.cu) loads each KV chunk once for all query heads of its KV head, zero barriers in the token loop: 15.7 µs at the full window (35×), 40 µs at ctx 8192 (171×), parity ≤ 4.9e-4 vs the old kernel across 16 shape/scale variants. FP8-KV decode now beats the F16-KV path at every context length.
Measured B=1 generate (H100 SXM5, FP8 e4m3 weights, graphed, 2026-06-09/10):
| path | tok/s | notes |
|---|---|---|
| FP8 KV, short ctx | 63.0 | production kernel set, RVLLM_FP8_GEMM_LT_M1=1 |
| FP8 KV, ctx 1200-1500 | 61.4 | the production serving regime (fast prefill forces FP8 KV) |
| F16 KV, short ctx | 56.5 | FA3 sliding decode |
| + speculative decoding (K=4, real text, ctx 1200+) | 83.9 | n-gram drafter, see below |
| + speculative decoding (K=4, repetitive text) | 153.0 | acceptance-dependent upper range |
| HBM weight roofline (plain decode) | ~104-109 | ~30 GB FP8 ÷ 3.35 TB/s |
| historical: graphed baseline before June 2026 | 44.9 | F16 KV; FP8-KV long-context was 14 |
Speculative decoding (GPU, shipped)
Draft up to K tokens by n-gram prompt lookup (zero extra model), verify [last, drafts...] in ONE forward at M=K+1 -- the 30 GB weight read amortizes over every accepted token. Greedy acceptance; every emitted token is a model argmax. The verify forward is graph-captured per chunk size (RVLLM_SPEC_GRAPH=0 to force eager). Gates: K=0 is bit-identical to plain decode (400-token hash equality, including past the sliding-window ring wrap); graphed and eager spec runs are bit-identical to each other; K>0 vs plain decode is quality-identical but not bit-identical -- batched verify GEMMs differ from M=1 by ulps and flip genuine near-tie argmaxes (measured example: "PPL (Peak)" vs "PPL (Cached)", both coherent). Enable: RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. Measured accept rate on real prose: 0.42-0.56/draft, 2.4-3.1 tokens per verify cycle. K=4 is the sweet spot; K=6-8 lose acceptance. Next: Gemma4-E4B as a model drafter (same tokenizer family) to lift acceptance on novel text.
GPU batch scaling (fresh spread, measured 2026-06-10, commit 544b1309e)
Decode-step bench (run_bench, FP8 weights, 40 iters/8 warmup), default
configuration -- the engine now auto-routes small-M GEMMs through cuBLASLt
up to the measured crossover (M≤64) and CUTLASS above it:
| Batch | tok/s (default) | ms/step | vs CUTLASS-only | route |
|---|---|---|---|---|
| 1 | 64.4 | 15.5 | +32% | cuBLASLt |
| 2 | 125.3 | 16.0 | +29% | cuBLASLt |
| 4 | 249.1 | 16.1 | +27% | cuBLASLt |
| 8 | 495.5 | 16.1 | +27% | cuBLASLt |
| 16 | 949.2 | 16.9 | +21% | cuBLASLt |
| 32 | 1,741 | 18.4 | +16% | cuBLASLt |
| 64 | 2,997 | 21.4 | +4% | cuBLASLt |
| 128 | 5,211 | 24.6 | (CUTLASS wins +12%) | CUTLASS |
| 256 | 7,607 | 33.7 | (CUTLASS wins +20%) | CUTLASS |
The crossover (RVLLM_FP8_GEMM_LT_MAX_M, default 64) was calibrated by
running every batch size both ways. Historical April-2026 table (different
harness settings, 100 iters): B=512 reached 8,786 tok/s; B≥64 April rows
ran ~5-10% above the fresh 40-iter numbers -- treat cross-date deltas at
B≥64 as methodology noise, not regression, until re-run at matched iters.
Single-user / greedy generate (full pipeline incl. lm_head + sampling, same commit, same day):
| config | decode tok/s | e2e tok/s |
|---|---|---|
| FP8 KV, short prompt | 63.0 | 57.9 |
| FP8 KV, 1200-token prompt (production shape) | 61.4 | 11.8* |
| F16 KV, short prompt | 56.6 | 44.7 |
| spec-decode K=4, real text, 1200-token prompt | 83.9 | 12.6* |
| spec-decode K=4, repetitive text, short | 153.1 | 126.0 |
*Long-prompt e2e is dominated by per-token prefill (20.5 s TTFT for 1200
tokens on the bench; the open work item). Opt-in RVLLM_FP8_GEMM_LT_F16OUT=1
(cuBLASLt writes f16 directly, one in-place channel-scale kernel) adds
another +1.4-2% deterministically but changes rounding (f16-before-scale);
it stays opt-in until a unified ppl gate clears it.
rvLLM vs vLLM on H100
vLLM column measured 2026-04 (vLLM 0.19, FP8, CUDA graphs); rvLLM B=1 updated 2026-06 -- an apples-to-apples re-run against current vLLM is owed before claiming the B=1 row.
| Batch | rvLLM tok/s | vLLM tok/s (2026-04) | Delta |
|---|---|---|---|
| 1 | 63.0 (83.9 spec) | 69 | -9% (+22% with spec) |
| 32 | 1,743 | 1,748 | ~0% |
| 64 | 3,265 | 3,130 | +4% |
| 128 | 5,802 | 4,689 | +24% |
| 256 | 7,808 | 7,077 | +10% |
| 512 | 8,786 | 8,243 | +7% |
GPU: 12B Gemma 4 — the solidSF IDE "brain" (added 2026-06-15)
gemma-4-12B-it (RedHatAI FP8-Dynamic) runs on the same Rust+CUDA engine as the
31B and serves as the solidSF IDE "brain". 15 GB FP8 weights, ~17 GB resident
(weights + 8192-block KV) — half the 80 GB card stays free for a co-resident
model. It is launched with the 31B's production recipe (RVLLM_FP8_GEMM_LT_M1,
RVLLM_DECODE_GRAPH, RVLLM_BATCH_PREFILL, RVLLM_NUM_BLOCKS=8192,
--max-inflight-requests 4), now baked into
company/solidsf-ide/app/deploy_rvllm_brain.sh (it had been launched without
them at ~119 tok/s).
Decode-step batch sweep (graphed forward, FP8_GEMM_LT_M1, H100 SXM, 80 iters / 20 warmup):
| B | 1 | 8 | 32 | 64 | 128 | 256 |
|---|---|---|---|---|---|---|
| tok/s | 130 | 983 | 3,572 | 6,193 | 10,531 | 16,323 |
~2× the 31B at matched batch (half the weights). Single-user full generate (incl. lm_head + sampling, greedy): ~124 tok/s B=1 decode; TTFT ~220 ms for short prompts (per-token prefill, same open item as the 31B). Generation is fluent and correct — see the perplexity note below for why raw-text PPL reads high for this peaked FP8 instruct model.
GPU perplexity (rvllm-ppl harness fixed 2026-06-15; old absolute numbers superseded)
The old rvllm-ppl path had two bugs that made any multi-chunk or long-corpus
run untrustworthy — they, not the model, produced the wild numbers people saw
(running PPL spiking past 100,000, and OOM on long inputs):
- KV-cache leak.
Gemma4Bringup::run_pplre-allocated the multi-GB paged KV region (plus all scratch) every chunk and never restored the arena, so it OOM'd by the third chunk. Fixed witharena.checkpoint()at entry +arena.restore()before return (the captured graph is dropped first). - Cold-chunk eval. The driver split the corpus into independent chunks and
ran each from position 0 with no BOS and no prior context, so every chunk
after the first scored mid-sentence tokens cold. Fixed with a proper
sliding-window driver: a fresh BOS per window (Gemma is BOS-sensitive) and a
score_fromcontext-only prefix, so each target is scored once with real left context.
Sanity check on the corrected harness: highly predictable repetitive text
scores PPL 1.46, and perplexity decreases monotonically as window context
accumulates — i.e. the forward, attention, paged KV, lm_head, and NLL math are
all sound. (Localised by RVLLM_NO_GRAPH=1 giving the identical result, ruling
out graph capture.)
What the corrected harness shows (gemma-4-12B-it-FP8-Dynamic, F16 KV,
sliding window): the model is heavily peaked. It scores its own generated
prose at ~35 PPL but raw WikiText-2 at ~840. It stays argmax-coherent
(generation is correct and fluent), but its probability mass is miscalibrated
on out-of-distribution text, so teacher-forced PPL on a raw external corpus is
high. That is consistent with a heavily instruct- + multimodal-tuned FP8 model;
it is not the old harness bug. A <20 figure is not reproducible on raw
WikiText for this model — it traces to a different corpus/format than this row.
Old (pre-fix, mixed-path) numbers — superseded, kept for history only:
| Weight path | KV cache | PPL (old harness) | tok/s (B=1, at the time) |
|---|---|---|---|
| FP8-Dynamic + CUTLASS channelscale epilogue | F16 | 14.75 | 53 |
| BF16 split QKV per-tensor FP8 | F16 | 17.96 | 37.9 |
| F16 weights (no FP8) | F16 | 19.79 | 37.9 |
| HuggingFace BF16 reference | -- | 19.62 | -- |
The FP8 row measuring better than the BF16 reference (14.75 < 19.62) was the tell these were never apples-to-apples (softcap applied on the ppl path only; the reference ran through a different harness). With the harness now fixed, a unified clean-corpus run is the way to a trustworthy absolute number; only the relative quantization ordering from the old table is worth keeping.
Gemma 4 forward pass (14 launches per layer)
For each layer in 0..60:
1. fused_rmsnorm_fp8_quant input layernorm + FP8 quantize
2. cutlass_fp8_gemm_channelscale fused Q||K||V + channelscale epilogue
3. fused_qkv_rmsnorm Q/K norm (learned) + V norm (parameter-free)
4. fused_rope_partial_f16kv partial RoPE + F16 KV cache write
5. paged_decode (FA3 SM90) attention (head_dim=256 sliding, 512 global)
6. quantize_fp8_per_token attn output to FP8
7. fp8_gemm O projection
8. fused_norm_add_residual channelscale + rmsnorm + residual add
9. fused_rmsnorm_fp8_quant pre-FFN layernorm + FP8 quantize
10. cutlass_fp8_gemm_channelscale fused gate||up + channelscale epilogue
11. fused_gelu_mul_fp8_quant GELU(tanh)(gate) * up to FP8
12. fp8_gemm down projection
13. fused_norm_add_residual channelscale + rmsnorm + residual + layer_scalar
Sampling tail:
fused_rmsnorm final layernorm
f16_gemm_f32 lm_head
logit_softcap 30 * tanh(logits / 30)
argmax_kernel token selection
Kernel fusion summary
Four rounds of fusion + custom CUTLASS epilogue reduced graph nodes from 1776 to ~935 (47% reduction):
| Fusion | Kernels eliminated | Nodes saved |
|---|---|---|
| f32_to_bf16 + rmsnorm + vector_add -> fused_norm_add_residual | 3 -> 1 (x2/layer) | 240 |
| scale_cols_f32 fused into norm+add kernel (O-proj, down) | 1 -> 0 (x2/layer) | 120 |
| residual_scale_f16 fused into post-ff norm+add | 1 -> 0 (x1/layer) | 60 |
| vnorm_f16 fused into qk_rmsnorm -> fused_qkv_rmsnorm | 2 -> 1 (x1/layer) | 60 |
| CUTLASS channelscale epilogue (QKV, gate_up) | 3 -> 1 (x2/layer) | 240+ |
The CUTLASS channelscale kernel uses a custom SM90 EVT epilogue that applies per-token activation scale (ColBroadcast) and per-channel weight scale (RowBroadcast) directly in the GEMM epilogue while the accumulator is still F32, then casts to F16. At M≤16 this kernel is not used in the fast path anymore: RVLLM_FP8_GEMM_LT_M1=1 reroutes small-M GEMMs through cuBLASLt (+27% measured) -- the smaller-tile CUTLASS variant that used to be "help wanted" here is obsolete.
Help wanted (current, real -- updated 2026-06-11):
- Close out the two-source prefill wiring (#58): the kernel is parity-proven bit-identical and measures TTFT 20.5 s -> 0.67 s at 1200-token prompts behind
RVLLM_PREFILL_TWO_SOURCE=1, but an engine wiring divergence keeps it opt-in. The issue has the full evidence chain and a precise repro. - Serve-session spec-decode optimization: spec behind the API measures 39.7 tok/s where the bench-side machinery does 83.9 -- the graphed verify isn't fully exploited in the session loop yet.
- Model drafter for spec-decode (Gemma4-E4B), replacing n-gram lookup for novel text.
Shipped from the previous help-wanted list (2026-06-11): cross-request graph persistence (served long-prompt decode 27.4 -> 59.6 tok/s) and the full API sampling contract (temperature/top_p/top_k/seed/stop -- the endpoint was greedy-only before).
GPU build and run
# One-time on H100 box (~15 min) bash kernels/build.sh # fused PTX bash kernels/build_cutlass_so.sh # libcutlass_kernels.so bash kernels/build_fa3.sh # libfa3_kernels.so (real FA3 -- needs flash-attention checkout; # includes the hdim256 combine instantiation upstream lacks) bash kernels/build_fa_sm89_so.sh # libfa_sm89_kernels.so (split-KV FP8 decode + global hd512) # Build cargo build --release --features cuda --manifest-path v3/Cargo.toml -p rvllm-bench # Run RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it \ RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90 \ RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so \ RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so \ RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json \ RVLLM_BATCH=128 RVLLM_ITERS=30 RVLLM_WARMUP=5 \ ./v3/target/release/rvllm-bench
OpenAI-compatible Gemma 4 server
The server is a Rust-only Gemma 4 path with an OpenAI-compatible HTTP surface.
It keeps CUDA execution on a single engine owner thread and accepts requests
through /v1/chat/completions.
For the solidSF agents production shape, including 256K context, four-seat
admission, the paid-plan busy response, CAD harness prompting, systemd service
shape, and verification scripts, see
docs/solidsf-agent-serving.md.
export CUDA_ARCH=sm_90 export RVLLM_MODEL_DIR=/workspace/models/gemma-4-31B-it export RVLLM_KERNELS_DIR=/workspace/rvllm/kernels/sm_90 export RVLLM_CUTLASS_SO=/workspace/rvllm/kernels/sm_90/libcutlass_kernels.so export RVLLM_FA3_SO=/workspace/rvllm/kernels/sm_90/libfa3_kernels.so export RVLLM_POLICY=/workspace/rvllm/kernels/sm_90/policy.json export RVLLM_SERVED_MODEL_NAME=gemma4-31b export RUST_LOG=info bash kernels/build.sh sm_90 bash kernels/build_cutlass_so.sh sm_90 bash kernels/build_fa3.sh cargo build --release --features cuda,cublaslt --manifest-path v3/Cargo.toml -p rvllm-serve ./v3/target/release/rvllm-server \ --host 127.0.0.1 \ --port 8080 \ --max-model-len 8192 \ --max-num-seqs 1 \ --max-num-batched-tokens 2048 \ --max-prefill-chunk 128
The server exposes GET /health, GET /v1/models, and
POST /v1/chat/completions with non-stream and SSE streaming responses.
Only greedy Gemma 4 decoding is currently enabled; set temperature: 0.
Smoke:
curl -fsS http://127.0.0.1:8080/health curl -fsS http://127.0.0.1:8080/v1/models curl -fsS http://127.0.0.1:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"Reply exactly: RVLLM_RUST_OK"}],"max_tokens":16,"temperature":0}' curl -fsS --no-buffer http://127.0.0.1:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"gemma4-31b","messages":[{"role":"user","content":"hi"}],"max_tokens":16,"temperature":0,"stream":true}'
For bind-only local checks without CUDA:
RVLLM_DRY_RUN=1 cargo run --manifest-path v3/Cargo.toml -p rvllm-serve -- \ --host 127.0.0.1 \ --port 8080
Kernels
Every kernel has a known purpose, a pinned variant, and a workspace contract. No dispatch fallback chains.
| Kernel | Purpose |
|---|---|
cutlass_fp8_gemm_channelscale |
SM90 FP8 GEMM with EVT channelscale epilogue (QKV, gate_up) |
fused_rmsnorm_fp8_quant |
layernorm + FP8 quantize in one launch |
fused_qkv_rmsnorm |
per-head RMSNorm on Q, K (learned) and V (parameter-free) |
fused_rope_partial_f16kv |
partial RoPE + F16 KV cache write |
fused_gelu_mul_fp8_quant |
GELU(tanh)(gate) * up to FP8 |
fused_norm_add_residual |
channelscale + RMSNorm + residual add (+ optional layer_scalar) |
logit_softcap |
30 * tanh(logits / 30) |
quantize_fp8_per_token |
activation to FP8 with per-token scale |
argmax |
f32 logits to i32 token |
fp8_decode_v2 (fa_sm89_* .so) |
split-KV GQA-grouped FP8 paged decode + LSE combine (35-171x over the serial kernel) |
No fallbacks. Missing kernel .so = engine refuses to start. One earned scar: the loader probes the FA3 .so for fa3_sm90_* symbols and quietly selects the Ada-generation kernel set if they're absent -- which is how a production H100 served sm_89 attention for four days when a fallback .so got copied over the FA3 filename. Verify with nm -D $RVLLM_FA3_SO | grep fa3_sm90 after any kernel deploy; the engine now also refuses this combination on sm_90 unless explicitly overridden.
v3 crate map
v3/crates/
rvllm-core typed errors, IDs, dtype, shape, config, env
rvllm-mem HbmArena, Region, Stream, Event, PinnedBuf, CudaContextHandle
rvllm-kernels manifest (sha-pinned), PTX loader, kernel catalog
rvllm-fused 8 fused-kernel launchers + pure-Rust f32 references
rvllm-attention FA3 SM90 paged decode/prefill dlopen
rvllm-cutlass FP8 variant catalog + schedule pairing trait + cuBLASLt wrapper
rvllm-metadata frozen-layout metadata per bucket (one upload path)
rvllm-loader safetensors mmap -> HBM + CPU-path FP8 quant + clamp gate
rvllm-sampling argmax tail, pinned DtoH
rvllm-graph captured-graph pool keyed on MetaLayoutHash
rvllm-runtime Engine, scheduler, layer_exec, bring_up
rvllm-bench RVLLM_* env-driven bench binary
rvllm-invariants DAG-dep test, no-megakernel gate
Correctness discipline
- No fallbacks. Missing autotune entry = engine panic. Missing .so = refuse start. No silent degradation.
- Graph-capture invariant. Metadata buffer layout frozen per (bucket, max_blocks_per_seq). Captured graphs bind exact offsets.
- CUTLASS schedule/epilogue pairing. Mainloop and epilogue schedules must match. Enforced via
static_assert. - No
unwrap()in libraries.Result<T, RvllmError>end-to-end with structured context. - Real block-change detection. Scheduler emits block table updates; missing signals = stale KV reads caught at the type level.
License
Apache-2.0.
Further reading
docs/bench.html- interactive benchmark results with chartsv3/H100_MAXPERF_PLAN.md- the June 2026 measured session record: GEMM routing, FA3 incident, split-KV FP8 attention, spec-decode, production deployv3/GEMMA4_SPEC.md- 31B Gemma 4 architecture details and weight shapesv3/M1_OUTCOME.md- batch=1 decode: cuBLASLt FP8 vs hand-GEMV measurements (vindicated June 2026 -- now the production route)v3/MEGAKERNEL_OUTCOME.md- persistent megakernel: built, measured, refuted (26 vs 45 tok/s)v3/SPECDECODE_SPEC.md- GPU speculative decoding design + lossless gatesv3/SPEC_FP8_DECODE_ATTN_REWRITE.md- the split-KV FP8 attention kernel specv3/SPEC.md,v3/IMPL_PLAN.md- v3 rewrite plan, 16 agent specstpu/harness/EAGLE3_SPEC.md- EAGLE-3 speculative decoding spec (TPU, experimental)docs/arch.md- crate architecture (April 2026 snapshot; numbers therein superseded)
Updates
2026-06-09/10 -- H100 single-user maximization + production deploy (commits f70b4bf..0d5f276, PR #54)
Full record: v3/H100_MAXPERF_PLAN.md. Summary:
- Production-regime decode 14 -> 61.4 tok/s (4.4x); short-ctx FP8 31.2 -> 63.0; spec-decode 83.9 on real text at ctx 1200+, 153 on repetitive text. Deployed to production 2026-06-10 (API-measured 3.38x short / 1.78x long).
- Corrected the published B=1 narrative: the "~12 ms inter-node dispatch gap" analysis was a graph-level-trace artifact -- node-level nsys shows ~1.1 ms gap; the real costs were the CUTLASS channelscale GEMM at ~51% HBM at M=1 (fixed: cuBLASLt routing, +27%) and the serial-token FP8 attention kernel (fixed: split-KV rewrite, 35-171x).
- Found and fixed a production incident: the deployed
libfa3_kernels.sowas the Ada fallback under the FA3 filename (silent symbol-probe fallback) -- H100 served sm_89 attention 2026-06-06..10. The engine now refuses that combination on sm_90;kernels/build_fa3.shgained the missing hdim256 combine instantiation so real FA3 is rebuildable from upstream. - Speculative decoding shipped for GPU (n-gram drafter, graphed verify, lossless gates):
RVLLM_SPEC_DECODE=1 RVLLM_SPEC_K=4. - Known weak spots, in the open: per-token prefill TTFT at >1024-token prompts (sound chunked prefill is the fix, design proven in the verify chunk), serve-side graph lifecycle at long context (27.4 API vs 61.4 bench), and the GPU PPL table needs a unified re-eval.
- Next: TPU revisit (v6e access incoming), E4B model drafter, spec-decode in
rvllm-serve.