Press enter or click to view image in full size
Last year I got Gemma 3 27B to 22K output tokens per second on a single 4x H100 node with vLLM. If you’re not familiar with inference engines: they’re the software layer between the raw model weights and the API your application calls. They handle batching, scheduling, KV cache management, and GPU orchestration — the difference between a model that serves 100 tok/s and one that serves 100K tok/s. vLLM, SGLang, and TensorRT-LLM are the main ones.
Caleb has a solid overview if you want to go deeper on YouTube.
That Gemma 3 22K result felt fast at the time. Then I got access to B200s and started doing math on napkins.
96 B200 GPUs. 12 nodes. A load balancer distributing requests across them. The result: 1,103,941 total tokens per second.
Here’s how I got there, including everything that went wrong along the way.
Why Qwen 3.5 27B
Alibaba released earlier this year several Qwen 3.5 models in different waves. One of them was the dense 27B variant — all 27 billion parameters active on every token, no expert routing. The other two (122B-A10B and 35B-A3B) are MoE models that activate only a small subset of parameters per token, which makes them faster but potentially less capable per parameter.
The 27B has an unusual architecture. Alibaba calls it an “Efficient Hybrid Architecture”: 64 layers arranged in repeating groups of 4. In each group, 3 layers use Gated Delta Networks (GDN) and 1 layer uses standard grouped-query attention. In code: 16 * (3 * (GDN -> FFN) -> 1 * (GQA -> FFN)). Think of it like a highway system: the GDN layers are express lanes that handle nearby context cheaply using linear attention (O(n) instead of O(n^2)), while every 4th layer is a full interchange that lets the model look across the entire sequence using traditional attention. This hybrid design gives the model a 262K native context window, extensible to 1M via YaRN RoPE scaling, while keeping compute costs per token well below what a full-attention model would need.
In FP8 the weights are about 27.3 GB (1 byte per param), then embeddings and layer norms are typically kept in BF16 (2 bytes) which adds another 1–2 GB. So let’s assume about 29 GB. That’s small relative to what B200s can hold — each GPU has 180 GB of HBM3e, leaving ~151 GB for KV cache.
I picked this model for a few reasons. It’s a reasoning model — it generates <think> blocks before answering, so real-world usage produces a lot of output tokens. If you're benchmarking throughput, that matters. It scores well on instruction following (IFEval 95.0), which is what you need for structured extraction from documents. On SWE-Bench Verified it hits 72.4, competitive with models 5-10x its size. Community reception is strong for practical tasks while honestly mixed on harder coding benchmarks. But hey it's also Apache 2.0 licensed. OSS #FTW.
Press enter or click to view image in full size
The 27B dense variant is slower per token than the MoE siblings (all parameters active vs a small subset), but it produces higher quality output. For a throughput benchmark, the dense model is harder to push fast — an interesting challenge, and in Google’s Think 10X spirit, exactly the kind of problem I wanted to throw hardware at.
The hardware
B200 vs. H100
+--------------------+---------------+----------------+---------+
| Spec | H100 SXM | B200 SXM | Ratio |
+--------------------+---------------+----------------+---------+
| HBM capacity | 80 GB HBM3 | 180 GB HBM3e | 2.25x |
| Memory bandwidth | 3.35 TB/s | 8 TB/s | 2.4x |
| FP8 compute | 3.96 PFLOPS | 9 PFLOPS | 2.3x |
+--------------------+---------------+----------------+---------+The bandwidth number is what matters for decode. Every output token requires reading the model weights and KV cache from HBM. Going from 3.35 to 8 TB/s is a direct speedup on the bottleneck that determines how fast tokens come out.
Each A4 node has 8x B200 GPUs connected via NVLink. My first instinct was tensor parallelism — split the model across all 8 GPUs, let NVLink handle the communication.
That was wrong.
The journey
I didn’t start at 1M tok/s. I started at 9,500.
Setting up
I’m using GKE Autopilot, which means no node pool management. Autopilot provisions A4 nodes automatically when pods request nvidia.com/gpu: 8 with the B200 accelerator selector. When there's no workload, there are no GPU nodes and no GPU bill.
export PROJECT_ID="your-project-id"
export HF_TOKEN="your-hf-token" gcloud container clusters create-auto vllm-inference-cluster \
--project="${PROJECT_ID}" \
--region=europe-west4 \
--release-channel=rapid
gcloud container clusters get-credentials vllm-inference-cluster \
--project="${PROJECT_ID}" \
--region=europe-west4
kubectl create secret generic hf-token \
--from-literal=token="${HF_TOKEN}"
Phase 1: TP=8 — the wrong parallelism (9,500 → 22K tok/s)
One thing I had to get right before any benchmarking: Qwen 3.5’s hybrid GDN architecture requires --mamba-cache-mode=align and --mamba-block-size=8 in vLLM. Without these, the engine reprocesses the entire prompt on every response — throughput drops by an order of magnitude and you'll spend hours debugging before realizing it's a cache alignment issue. The vLLM Qwen 3.5 recipe documents this, but it's easy to miss.
Tensor parallelism (TP=8) splits the ~29GB model across 8 GPUs — about 3.6 GB each. Each GPU processes its slice of every layer, then they synchronize via NCCL all-reduce over NVLink before moving to the next layer. The idea is that 8 GPUs working together are faster than 1. (For the full theory on parallelism strategies and when each one makes sense, How to Scale Your Model is the best reference I’ve found — it’s TPU-focused but the core principles apply to any accelerator.)
In practice, each GPU finished its 3.6 GB slice almost instantly and then spent most of its time waiting for the other 7 to synchronize. If you’ve managed storage systems, you know this pattern: it’s IOWait. The GPUs are fast enough, but they waste cycles waiting for data from each other. The DeepSeek V3 team spent months writing custom PTX instructions just to optimize this exact communication overhead — allocating warps, tuning chunk sizes, overlapping communication with compute. That’s how hard this problem is. We sidestepped it entirely.
I also tried MTP-5 (5 draft tokens instead of 1). It gave ~21% more throughput at first — but crashed on the second run with cudaErrorIllegalAddress. The per-position acceptance rates told the story: 84.7% at position 0, then 58.7%, 49.3%, 44.1%, 41.2%. Only the first two positions carry real value; positions 2-4 are coin flips that cost more compute than they save. MTP-1 at 90% acceptance was slower but stable across every run.
Another ceiling I didn’t notice at first: max-num-seqs=512 capped how many sequences the scheduler could admit. Throughput was flat from 500 to 1000 concurrency because the scheduler was full. I raised it to 8192 in Phase 2.
Best result with TP=8: ~22,300 total tok/s with MTP-1 speculative decoding. GPU utilization stayed low.
Phase 2: DP=8 — the right idea, wrong tuning (22K → 75K tok/s)
Instead of splitting one model across 8 GPUs, run 8 independent copies of the full model, one per GPU. Data parallelism (DP=8). Each GPU loads the entire ~29 GB model weight and serves requests on its own. Zero inter-GPU communication.
Throughput jumped to 74,848 tok/s immediately. But something was off — GPU compute read 0%. The model is so small relative to the B200 that the hardware wasn’t breaking a sweat. The engine logs told the story: Running: 135 reqs, Waiting: 138 reqs, GPU KV cache usage: 99.4%. The bottleneck was KV cache, wasted by the default 131K context window. My sequences used ~1,500 tokens—that's an 85x overallocation per slot.
Phase 3: context window tuning (75K → 85K tok/s)
Dropped max-model-len from 131,072 to 4,096 (my benchmark prompts are ~1,500 tokens). But my first attempt at gpu-memory-utilization=0.95 crashed with 908 failed requests. The MTP draft model takes ~29GB per GPU — same as the base model — so at 0.95 utilization, memory pressure from model + draft + KV cache + CUDA graphs was too high. One engine choked at 99.6% KV cache with 801 queued requests while another sat completely idle. Dropping to 0.90 fixed it: throughput hit 85,509 tok/s with zero failures. Still 0% GPU compute though.
This is a configuration most people skip over. A coding assistant needs a long context window (maybe 32K-128K). An insurance document processor working with RAG-retrieved excerpts might only need 4–8K. A benchmark with fixed 1K input and 512 output tokens can use an even tighter fit. Tuning max-model-len to your actual workload can matter more than you'd expect.
Phase 4: FP8 KV cache + MTP — the breakthrough (85K → 96K tok/s)
Two changes made the difference. First, upgrading to vLLM v0.18.0 and enabling --kv-cache-dtype=fp8e4m3. KV cache capacity tripled (959K tokens/engine vs 288K). GPU utilization finally spiked to 100% during prefill. (Note: vLLM v0.18.0 has a known issue with FP8 KV cache accuracy, already fixed and coming in v0.18.1.)
Second, Multi-Token Prediction (MTP) speculative decoding. I swept MTP configs:
+----------------------+-------------+----------+-----------------------------------------+
| Config | Total tok/s | GPU util | Notes |
+----------------------+-------------+----------+-----------------------------------------+
| MTP-1, gpu-mem=0.92 | 96,023 | 84-90% | Best single-node |
| MTP-2, gpu-mem=0.95 | 90,567 | 87-100% | MTP-2 overhead costs more than it saves |
| No MTP, gpu-mem=0.95 | 66,021 | 0% | Back to 0% GPU -- MTP is the multiplier |
| MTP-1, gpu-mem=0.95 | 95,391 | 84-90% | No improvement over 0.92 |
+----------------------+-------------+----------+-----------------------------------------+MTP-1 is the single biggest throughput lever. At 90% acceptance rate, each decode step produces ~1.9 tokens instead of 1. Removing MTP drops throughput by a third and GPU utilization back to zero. For a model this small on B200s, speculative decoding is what makes the GPUs do actual work. At peak load, each B200 was pulling 864–922W — close to the 1000W TDP. The GPUs were finally earning their electricity bill.
Here’s the final optimized config (see also the vLLM deployment guide for Qwen 3.5, full manifest on GitHub):
args:
- --model=Qwen/Qwen3.5-27B-FP8
- --data-parallel-size=8
- --max-model-len=2560
- --max-num-batched-tokens=16384
- --kv-cache-dtype=fp8_e4m3
- --gpu-memory-utilization=0.92
- --enable-prefix-caching
- --enable-chunked-prefill
- --attention-backend=FLASHINFER
- -O3
- --speculative-config={"method":"mtp","numspeculativetokens":1}
env:
- name: VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS
value: "1"Each flag maps to something I learned:
--data-parallel-size=8: the Phase 2 discovery. Zero inter-GPU communication.--kv-cache-dtype=fp8_e4m3: Phase 4. Triples KV cache capacity.--speculative-config(MTP-1): the throughput multiplier. about 1.9 tokens per decode step at 90% acceptance.--max-model-len=2560: Phase 3. Tuned to the benchmark workload.--gpu-memory-utilization=0.92: sweet spot. 0.95 didn't help.VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1: prevents OOM by accounting for CUDA graph memory during profiling.
A note on GPU provisioning: A4 nodes with B200 GPUs don’t support on-demand. For this blog I used Spot VMs. For production, the options are Flex-start, Future Reservations in calendar mode, or AI Hypercomputer reservations.
Node provisioning takes about 5–6 minutes on Autopilot. Model loading and CUDA graph compilation add another 10 minutes. Without pre-downloaded weights, add time for the 14GB HuggingFace download.
Phase 4b: the 397B detour
Before scaling out, I tried the bigger model: Qwen 3.5 397B-A17B MoE in NVFP4 on the same 8 GPUs (TP=8 + expert parallel). GPU compute hit 98% — the opposite problem from the 27B. But throughput was only 19,725 tok/s. The 397B model is genuinely compute-bound.
Getting the 397B downloaded was its own adventure. HuggingFace had recently migrated this checkpoint to Xet storage, and Xet has a bug: downloads stall at around 16–20 GB. snapshot_download() stalls. huggingface-cli download stalls. What worked: plain git clone with Git LFS, which bypasses Xet entirely. I ran the download as a Kubernetes Job on a cheap C4A node (Arm-based, no GPUs) with a Hyperdisk ML PVC attached — Hyperdisk ML is only supported on C4A, which is why the download job runs there. The full download workaround and Job manifest are in the GitHub repo.
Benchmark methodology
The benchmark follows the InferenceMAX methodology from SemiAnalysis, the industry standard for comparing inference engine performance. The methodology is fully documented on GitHub and was adopted by the vLLM team for their large-scale serving benchmarks.
SemiAnalysis runs the same benchmark on Qwen 3.5 FP8 with B200 and MTP, using SGLang with EAGLE speculative decoding. We use vLLM with native MTP. The parameters are comparable: FP8 model weights, FP8 KV cache, speculative decoding, and the same benchmark harness (benchmarkserving.py).
Our config: input-len=1024, output-len=512, random-range-ratio=0.5, unique random prompts with 0% prefix cache hit rate. This is worst-case performance — no caching whatsoever, which is nearly impossible in a real deployment where requests share system prompts. Total tok/s = input + output tokens processed per second.
A note on the ISL/OSL ratio: our benchmark uses a 2:1 input-to-output ratio (1024 in, 512 out), which makes it prefill-heavy. Each request spends roughly twice as many steps generating output tokens as it does processing the input. This matters because decode is bandwidth-bound (reading model weights and KV cache from HBM on every step), which is exactly where the B200’s 8 TB/s memory bandwidth pays off. Many benchmarks you’ll see online use ratios like 16:1 (e.g., 2048 in, 128 out), which are prefill-heavy — the model chews through a long input but generates very little output. Prefill is compute-bound, not bandwidth-bound, so those results tell a different story. If you’re comparing throughput numbers across benchmarks, check the ISL/OSL ratio before drawing conclusions.
I run benchmarks from inside the cluster on a dedicated 16 vCPU C4 node to avoid client-side bottlenecks (more on that below). The benchmark pod YAML and parallel benchmark script are in the repo.
Multi-node scaling: the client bottleneck
My first multi-node results made no sense. Two nodes gave 162K, four nodes gave 275K, eight nodes gave… 242K. Going from 4 to 8 nodes decreased throughput.
The servers were fine. The benchmark client was the bottleneck. vllm bench serve is a single Python process, and at 16K concurrent async connections it burns 99% of one CPU core just managing the event loop. It can't push requests fast enough to saturate 8 nodes.
I wasted GPU hours figuring this out. In hindsight, the sign was obvious: going from 4 to 8 nodes can’t decrease throughput if the servers are the bottleneck. It can only decrease if the client is.
The fix: run multiple benchmark clients in parallel, each handling 1K concurrency, on a 16 vCPU C4 node. This matches what SemiAnalysis does in InferenceMAX. I wrote parallel-bench.sh to launch N clients and sum throughput across all of them.
With proper client saturation, scaling was liner:
+-------+---------+-------------+----------+------------+-------------+
| Nodes | GPUs | Total tok/s | Per-node | Efficiency | TPOT median |
+-------+---------+-------------+----------+------------+-------------+
| 1 | 8 B200 | 95,317 | 95,317 | 100% | 44 ms |
| 2 | 16 B200 | 190,000 | 95,000 | 99.7% | 47 ms |
| 4 | 32 B200 | 376,074 | 94,019 | 98.6% | 46 ms |
| 8 | 64 B200 | 740,192 | 92,524 | 97.1% | 46 ms |
+-------+---------+-------------+----------+------------+-------------+96.5% scaling efficiency at 12 nodes. Near-linear. Each node runs independently (DP=8, no cross-node communication), so adding a node is almost exactly +92K tok/s. TPOT stays flat at ~46ms regardless of node count. MTP acceptance stays at 90%. That’s ~11.5K tok/s per individual B200 GPU.
ClusterIP vs Inference Gateway
I expected the Inference Gateway to outperform ClusterIP at multi-node. It’s purpose-built for LLM serving — it tracks KV cache utilization and queue depth per replica and routes requests to the least-loaded node. ClusterIP just round-robins.
The results surprised me.
+-------+-----------------+---------------+------------------+
| Nodes | ClusterIP tok/s | Gateway tok/s | Gateway overhead |
+-------+-----------------+---------------+------------------+
| 2 | 190,000 | 131,303 | 31% |
| 4 | 376,074 | 245,009 | 35% |
| 8 | 740,192 | 469,433 | 37% |
+-------+-----------------+---------------+------------------+The Gateway adds ~35% overhead at all node counts. Part of that is structural: every request and every streamed token passes through the EPP pod via an ext_proc gRPC stream. But part of it is that our benchmark was worst-case for smart routing. With 0% prefix cache hit rate and unique synthetic prompts, KV cache utilization was the same across all replicas at all times. The Gateway’s KV-cache-aware routing had nothing useful to route on — round-robin would have made the same decisions. A real workload with shared system prompts and uneven KV cache pressure across replicas is where smart routing earns its keep. What we measured here is purely the ext_proc proxy cost.
The single active EPP (even with 10 CPU) becomes the throughput ceiling. Given this overhead, extending Gateway testing to 12 nodes wasn’t worth the GPU hours.
Getting the Gateway working took longer than expected. The default EPP (v1.0.1, 500m CPU, 1 replica) crashed under 8K connections — a known pattern with ext_proc. I upgraded to v1.4.0 with 10 CPU, which fixed the crashes but not the overhead. The inferenceExtension.replicas setting turned out to be HA standby (leader election), not load distribution — only one EPP replica handles traffic regardless of how many you deploy.
The Gateway also required prerequisites I didn’t expect: NetworkServices API had to be enabled on the project, and a proxy-only-subnet had to be created for the regional internal L7 load balancer. Without both, the Gateway silently fails to program. No error message, no warning — just a Gateway resource stuck at PROGRAMMED: False forever.
The Gateway deployment details (CRD installation, Helm commands, prerequisites) are in the GitHub repo. The key Helm command:
helm install qwen35-pool \
--set inferencePool.modelServers.matchLabels.app=qwen35-server \
--set provider.name=gke \
--set provider.gke.autopilot=true \
--set inferenceExtension.monitoring.gke.enabled=true \
--set inferenceExtension.resources.requests.cpu=10 \
--set inferenceExtension.resources.requests.memory=16Gi \
--set inferenceExtension.resources.limits.cpu=10 \
--set inferenceExtension.resources.limits.memory=16Gi \
--version v1.4.0 \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepoolMeanwhile, ClusterIP round-robin just works. At 12 nodes with 96.5% scaling efficiency, the “dumb” approach won. The Gateway’s intelligent routing will matter more when workloads are uneven (mixed prompt lengths, some requests much heavier than others) or when you need to enforce per-model routing policies. For our uniform synthetic benchmark, round-robin was enough.
Path to 1M
+-------+---------+-------------+-------------+-------------+
| Nodes | GPUs | Total tok/s | TPOT median | TTFT median |
+-------+---------+-------------+-------------+-------------+
| 1 | 8 B200 | 95,317 | 44 ms | 8,065 ms |
| 2 | 16 B200 | 190,000 | 47 ms | 6,744 ms |
| 4 | 32 B200 | 376,074 | 46 ms | 7,331 ms |
| 8 | 64 B200 | 740,192 | 46 ms | 5,580 ms |
| 12 | 96 B200 | 1,103,941 | 46 ms | 5,600 ms |
+-------+---------+-------------+-------------+-------------+From 9,500 tok/s with the wrong parallelism strategy to over 1.1 million with the right one. The per-token latency (TPOT) stays flat at 46ms across all node counts — adding nodes increases throughput without hurting latency. Time-to-first-token (TTFT) actually improves as you add nodes, because each node handles fewer concurrent requests.
There’s a second reason to keep each node self-contained beyond raw performance. GPUs fail. It’s a well-known issue in training, and inference hardware is no different. With TP across nodes, a single GPU failure takes down the entire serving pipeline. With DP=8 per node, a node failure only affects the requests on that node. The other nodes keep serving. You lose 1/12th of your capacity instead of all of it. At scale, this matters more than squeezing another few percent from cross-node parallelism.
Why this matters: an insurance example
A million tokens per second sounds like a benchmark number. It isn’t.
An insurance company has hundreds of thousands of policies — some legacy systems still coded in COBOL, others in modern databases. Each policy document runs from a few pages to hundreds of pages of legal language. When someone needs to know “does my policy cover water damage from a washing machine leak?”, something needs to find the right policy, locate the relevant sections, and interpret the legal wording.
The modern architecture for this is a RAG pipeline: a user authenticates (whether that’s a customer through a web portal or an internal operator through a claims dashboard), a tool retrieves the relevant policies, a vector database finds the most applicable sections, and those passages get fed to an LLM that synthesizes an answer. The model never sees the full document archive — it reads a curated slice and interprets it.
The insurance company can offer specialized internal agents for operators (with deeper access to policy internals and claim history) and a simplified agent for customers (focused on coverage questions and next steps). You don’t need 12 nodes to start. A single A4 node handles 2,000 concurrent users at 95K tok/s. Two nodes get you to 4,000, four nodes to 8,000. At 8 nodes, 16,000 operators and customers query these systems simultaneously — over 640 requests per second with sub-50ms per-token latency. Scale beyond that and the math keeps working: 12 nodes hit 1.1M tok/s.
+-------+---------+------------------+-------+-------------+
| Nodes | GPUs | Concurrent users | Req/s | Total tok/s |
+-------+---------+------------------+-------+-------------+
| 1 | 8 B200 | 2,000 | ~83 | 95,317 |
| 2 | 16 B200 | 4,000 | ~165 | 190,000 |
| 4 | 32 B200 | 8,000 | ~326 | 376,074 |
| 8 | 64 B200 | 16,000 | ~642 | 740,192 |
+-------+---------+------------------+-------+-------------+Why 2,000 concurrent users but only ~83 req/s? Each request takes about 24 seconds end to end (TTFT plus ~384 output tokens at 46ms each). 2,000 in-flight requests divided by 24 seconds ≈ 83 completions per second.
And like any cloud workload, the infrastructure can scale with demand. Spin up A4 nodes before business hours when claim volumes peak, scale back after. Not fully automated yet (today it’s kubectl scale), but the pattern is the same one cloud teams have used for years with web servers and databases.
Batch processing is where the numbers pay off. Say you need to review 50,000 policies for compliance with a new regulation. Each policy needs a structured extraction: identify defined terms, flag non-standard clauses, check for specific exclusions. At 1M tokens per second, you chew through those 50,000 documents in hours, not weeks. The LLM doesn’t need to be the smartest model on the planet for this — it needs to follow instructions precisely (Qwen 3.5 27B scores 95.0 on IFEval, the instruction-following benchmark) and process at volume.
This is where self-hosted open-source models on dedicated hardware start making economic sense. High-volume, domain-specific work where the model is good enough and the cost-per-token matters.
What makes this work
The B200’s 180 GB of HBM3e does most of the work. ~29 GB for model weights leaves 151 GB for KV cache per GPU. On previous-generation GPUs with 80 GB you’d be choosing between shorter context and fewer concurrent sequences. Here you don’t have to choose.
MTP speculative decoding matters more than anything else I tried. Without it, the 27B model is too small to keep B200s busy — GPU utilization drops to 0% and throughput falls by a third. One draft token at 90% acceptance rate nearly doubles output per decode step.
The node-per-replica architecture keeps scaling simple. Each node is fully independent — no cross-node communication, no shared state. A node failure takes out 1/12th of capacity instead of crashing the whole system. Adding a node adds ~92K tok/s. Removing a node removes ~92K tok/s. I’ll take that predictability over clever multi-node optimization.
And the routing layer turned out to be less important than I expected. ClusterIP round-robin gave 96.5% scaling efficiency at 12 nodes. The Inference Gateway’s KV-cache-aware routing added ~35% overhead from the ext_proc proxy. For uniform workloads, the “dumb” approach won. For production with varied prompt lengths and latency SLOs, the Gateway or projects like llm-d would add value — but the overhead needs to come down first.
The economics
Online APIs like Gemini and Claude are the right choice when you need state-of-the-art quality across a wide range of tasks. They handle the infrastructure, provide SLAs, and keep up with the latest model improvements. For general-purpose use, you’re paying for quality and convenience.
But for high-volume, domain-specific workloads — like the insurance example above — the economics of self-hosting look different. The model doesn’t need to be the best at everything. It needs to be good enough at your specific task and cheap enough to run at scale.
Blended cost per million tokens (input + output at a 2:1 ratio):
+------------------+----------------------------+------------+
| Approach | Model | $/M tokens |
+------------------+----------------------------+------------+
| Self-hosted GKE | 12x A4 nodes, Qwen 3.5 27B | $0.30 |
| Google Cloud API | Gemini 3.1 Flash-Lite | $0.67 |
| Google Cloud API | Gemini 3.1 Pro | $5.33 |
| Anthropic API | Claude Sonnet 4.6 | $7.00 |
| Anthropic API | Claude Opus 4.6 | $11.67 |
+------------------+----------------------------+------------+Links with official pricing:
- Google Cloud A4 Machine Type, CUD 1 year list pricing;
- Google Gemini 3.1 Pro and Flash-Light models list pricing;
- Anthropic Sonnet and Opus models list pricing;
The self-hosted cost uses 1-year Compute Resource CUD pricing at $88.93/hr per A4 node (12 nodes x $88.93 = $1,067/hr). I used Spot VMs for this blog’s experiments (much cheaper, fine for benchmarking, extremely tight budget on my end), but production deployments would use CUD or Future Reservations. With Spot VMs ($32.61/hr per node), the cost drops to $391/hr — or less than the psychological $0.10/M tokens threshold.
A few things this table doesn’t capture:
- The online APIs are delivering far more capable models. Gemini 3.1 Pro and Claude Opus 4.6 are frontier models that handle complex reasoning, nuanced language, and harder tasks that Qwen 3.5 27B can’t match. If you need that capability, the API price is justified.
- Self-hosted requires real engineering: cluster management, model serving, monitoring, upgrades. The API handles all of that for you.
- The comparison makes sense only for workloads where the open-source model is good enough — structured extraction, document processing, policy analysis, RAG pipelines at scale. For those use cases, the per-token economics of self-hosted hardware at high utilization are hard to beat.
- And you don’t need to run 12 nodes 24/7. GKE Autopilot scales GPU nodes to zero when there’s no workload. Spin up for a batch job, process millions of documents, tear down. You pay for the hours you use. Scaling is manual today (
kubectl scale), but automated scaling based on queue depth or request rate is a natural next step.
What’s left on the table
96K tok/s per node with no custom kernels. A few things could push it higher.
Our benchmarks hit 0% prefix cache. Every prompt was synthetic and unique, meaning prefix caching couldn’t reuse any KV cache entries. Real production workloads (RAG pipelines, chatbots, batch processing) typically share system prompts across requests. With prefix caching enabled (it already is in our config), those shared prefixes skip prefill entirely. Depending on your workload’s prefix overlap, this alone could add meaningful throughput.
GDN kernel optimizations are in progress. Three out of four layers in Qwen 3.5 use Gated Delta Networks, and the vLLM project board for Qwen 3.5 currently has 18 active items including GDN decode kernel optimization and Blackwell-specific GDN prefill kernels. These directly target the compute path that dominates our workload.
FP4 model weights. We’re running FP8 (~29 GB model). NVFP4 would cut that to roughly 7 GB, freeing more HBM for KV cache and potentially enabling larger batch sizes. B200 GPUs have native FP4 tensor core support, and vLLM already supports NVFP4 dispatch. The tradeoff is a small quality reduction that may or may not matter for your use case. On a related note, FP4 KV cache compression is also emerging as a way to further stretch KV cache capacity beyond what FP8 offers.
Disaggregated prefill and decode. Our current setup runs prefill and decode on the same GPUs, which means prefill bursts can interfere with decode latency. Separating them onto dedicated GPU pools reduces TTFT and TPOT variance. vLLM supports this natively via the NIXL connector (--kv-transfer-config), and llm-d adds Kubernetes-native orchestration with Qwen 3.5 support coming in v0.6.0. This is my next project once llm-d has solid Qwen 3.5 and GDN kernel support — the brute-force horizontal scaling in this post works, but disaggregated P/D is the more elegant architecture. See also BentoML's analysis of the throughput and latency tradeoffs.
Fused kernels and newer vLLM releases. vLLM has been rolling out fused kernel passes for Blackwell (AllReduce+RMSNorm+Quantization in one kernel launch, FP8 KV cache fused operations) that typically yield 4–6% end-to-end gains. Each vLLM release brings incremental improvements through FlashInfer upgrades, async scheduling improvements, and torch.compile integration. CUDA graph support for chunked prefill batches is another open optimization gap that could reduce kernel launch overhead.
Inference Gateway on faster compute. We threw 10 CPU at the EPP pod but never looked at what kind of CPU it was running on. GKE Autopilot landed it on whatever node was available. A C4 machine with higher single-thread performance might cut into that 35% overhead — the ext_proc gRPC path is latency-sensitive per-token work, not a parallelizable batch job. Worth testing before concluding the overhead is inherent to the architecture.
The hardware is moving too. NVIDIA’s recent $20B acquisition of Groq and its LPU inference chips says something about where inference hardware is going — specialized silicon for inference workloads alongside general-purpose GPUs.
These are the numbers you get today with vLLM v0.18.0 out of the box, no custom kernels. As Ethan Mollick puts it: “Assume this is the worst AI you will ever use.” The same applies to inference software. vLLM, the kernels, the routing layers — they’re all getting faster. Today’s 96K per node is the floor.
Conclusion
To be quite honest, this was not elegant. 12 nodes, round-robin load balancing, no disaggregated serving, no custom kernels. A brute-force approach: take a single-node setup that works, find the bottlenecks one at a time (wrong parallelism strategy, context window waste, missing FP8 KV cache, client-side saturation limits), fix them, then scale horizontally.
It works. 1,103,941 total tokens per second at 96.5% scaling efficiency. I got there by removing bottlenecks one at a time, not by designing something clever.
The more elegant path exists. llm-d with disaggregated prefill and decode, KV-cache-aware routing that doesn’t add 35% overhead, GDN kernels optimized for Blackwell. But you don’t need to wait for any of that. The tools available today — vLLM v0.18.0, GKE Autopilot, DP=8 with MTP, simple round-robin — get you to 1M if you’re willing to throw 12 nodes at it.
Reproducing this
Everything is in the GitHub repo.
# 1. Set your project and HF token
export PROJECT_ID="your-project-id"
export HF_TOKEN="your-hf-token" # 2. Create cluster
bash scripts/setup-cluster.sh
# 3. Deploy single replica, verify it works
kubectl apply -f k8s/single-replica-qwen35-27b.yaml
# 4. Deploy benchmark pod, run single-replica benchmark
kubectl apply -f k8s/benchmark-pod.yaml
# 5. Scale out
kubectl apply -f k8s/hyperdisk-ml-readonly.yaml # ReadOnlyMany PVC
kubectl apply -f k8s/multi-replica-qwen35-27b.yaml # Multi-node deployment
# 6. Run parallel benchmark
kubectl cp scripts/parallel-bench.sh vllm-benchmark:/usr/local/bin/parallel-bench.sh
kubectl exec -it vllm-benchmark -- parallel-bench.sh qwen35-server 8000 16 1000
Update: I tried disaggregated prefill/decode
I actually tried disaggregated serving with vLLM’s NixlConnector on Qwen 3.5 27B after publishing. Split prefill (2 GPUs) and decode (6 GPUs) onto dedicated pods on the same A4 node, so prefill bursts don’t interfere with decode latency.
GKE Autopilot won’t let you request fewer than 8 B200 GPUs per pod, so I spun up a GKE Standard cluster with partial GPU allocation. Both pods landed on the same node via pod affinity, and the NVIDIA device plugin handled GPU assignment correctly — no manual CUDA_VISIBLE_DEVICES needed.
The first problem was Qwen 3.5-specific. NixlConnector crashes on hybrid GDN models with All kv cache tensors must have the same number of blocks. The fix: --block-size=64 and --mamba-ssm-cache-dtype=float16, which come from a recent PR adding hybrid SSM-FA support to NixlConnector. With those flags, both pods start, load the model, and serve requests independently.
The second problem is the one that stops everything. vLLM v0.18.0 has no cross-instance scheduler. The prefill pod processes prompts and registers KV cache with NIXL, but kv_transfer_params — the field that tells the decode pod where to pull KV cache from — is always null in the API response. I wrote a proxy that routes prefill first (max_tokens=1) then sends the full request to decode. Tried kv_role=kv_both, kv_ip=0.0.0.0, exposed the ZMQ side channels — none of it produced a single KV transfer. Both pods just serve independently, unaware of each other.
vLLM has the NixlConnector plumbing for RDMA-based KV transfer, but on its own there seems to be nothing to coordinate which instance prefills, which decodes, and how KV cache metadata moves between them. This is what llm-d is for. I truly hope it will support Qwen 3.5 soon 🤞.
Update 2: the missing metrics
Free Saturday morning, a little budget left for the month for up to a four-node test. After the original post a lot of people asked about GPU utilization and latency distributions, and I’d accidentally deleted the raw data from the first round.
This time I ran three collectors in parallel during each benchmark: nvidia-smi dmon on every serving pod (SM%, power, temperature at 1s intervals), vLLM /metrics scraping (KV cache utilization, queue depth, preemptions), and DCGM profiling via Cloud Monitoring (tensor core activity, DRAM bandwidth, NVLink throughput). Same config as before.
Output tokens per second
16000 +-------------------------------------------------------------------+
| |
14000 | * |
| * * * * * |
| *** * * * * * * * * *** * * |
12000 | * * * * * * * * * * * * ** |
| * * * * * ** * |
10000 | * * |
| * * |
8000 | * * |
| * * |
| * * |
6000 | * * |
| * * |
4000 | * * |
| * * |
| * * |
2000 | * * |
| ***** * |
0 +-------------------------------------------------------------------+
0 5 10 15 20 25 30 35The numbers hold up
+-------+---------+-------------+----------+------------+
| Nodes | GPUs | Total tok/s | Per-node | Efficiency |
+-------+---------+-------------+----------+------------+
| 1 | 8 B200 | 96,959 | 96,959 | 100% |
| 2 | 16 B200 | 196,548 | 98,274 | 101.4% |
| 4 | 32 B200 | 380,185 | 95,046 | 98.0% |
+-------+---------+-------------+----------+------------+2-node is slightly above 2x (noise). 4 nodes at 98% matches the original data.
Latency stays flat
+--------------+-----------+-----------+-----------+
| Metric | 1 Node | 2 Nodes | 4 Nodes |
+--------------+-----------+-----------+-----------+
| TTFT mean | 6,285 ms | 6,165 ms | 6,018 ms |
| TTFT median | 6,239 ms | 6,242 ms | 6,094 ms |
| TTFT P99 | 11,581 ms | 11,368 ms | 11,583 ms |
| TPOT mean | 46.8 ms | 46.1 ms | 45.5 ms |
| TPOT median | 46.9 ms | 46.1 ms | 45.4 ms |
| TPOT P99 | 72.9 ms | 72.6 ms | 74.8 ms |
| ITL mean | 95.6 ms | 92.3 ms | 91.9 ms |
| ITL median | 74.4 ms | 73.3 ms | 72.2 ms |
| ITL P99 | 620.2 ms | 531.1 ms | 617.7 ms |
+--------------+-----------+-----------+-----------+Each node sees the same 2K concurrency regardless of cluster size, so per-request latency doesn’t change. TPOT stays at ~46ms median, ~73ms P99. The interesting number is ITL (inter-token latency) P99 at ~620ms — that’s prefill interference. When a new batch of prompts starts prefilling, in-flight decode requests stall briefly.
The B200 is bored
From a nvidia-smi standpoint clearly the SM (Streaming Multiprocessor) utilization is at 100%.
+-------------------------------+-------------+-------------+-------------+
| Metric | 1 Node | 2 Nodes | 4 Nodes |
+-------------------------------+-------------+-------------+-------------+
| SM utilization (peak / mean) | 100% / 60% | 100% / 57% | 100% / 57% |
| Power draw (peak / mean) | 988W / 680W | 977W / 672W | 975W / 661W |
| Temperature (peak) | 73C | 74C | 75C |
| KV cache (peak) | 73% | 73% | 75% |
| Requests running (peak) | 251 | 253 | 256 |
| Requests waiting (peak) | 42 | 50 | 51 |
| Preemptions | 0 | 0 | 0 |
+-------------------------------+-------------+-------------+-------------+The DCGM profiling is where it gets interesting:
+------------------------------------+-------------+--------------+---------------+
| DCGM Metric | 1 Node | 2 Nodes | 4 Nodes |
+------------------------------------+-------------+--------------+---------------+
| Tensor core activity (peak / mean) | 3.1% / 1.5% | 9.2% / 4.5% | 18.5% / 2.7% |
| SM active (peak / mean) | 13.2% / 6.3%| 28.5% / 14.7%| 77.5% / 11.8% |
| DRAM bandwidth (peak / mean) | 8.2% / 3.9% | 13.6% / 7.4% | 32.2% / 5.5% |
| NVLink TX/RX | 0 | 0 | 0 |
+------------------------------------+-------------+--------------+---------------+Tensor cores active less than 5% of the time. DRAM bandwidth under 8%. The 27B model finishes each decode step in microseconds and then the hardware sits there waiting for the scheduler to hand it the next batch. MTP-1 keeps throughput high not by using more GPU compute, but by squeezing ~1.9 tokens out of each decode step instead of 1.
I went back one more time with --enable-mfu-metrics to get vLLM’s own FLOPS and bandwidth accounting. Per-GPU averages across the 8 DP engines:
+------------+--------------+--------------+--------------------+-------------+
| MFU metric | Mean | Peak | B200 spec | Utilization |
+------------+--------------+--------------+--------------------+-------------+
| FLOPS | 395 TFLOPS/s | 405 TFLOPS/s | 9,000 TFLOPS (FP8) | 4.4% |
| HBM read | 872 GB/s | 931 GB/s | 8,000 GB/s | 10.9% |
| HBM write | 80 GB/s | 82 GB/s | -- | -- |
+------------+--------------+--------------+--------------------+-------------+4.4% compute, 10.9% memory bandwidth — lines up with the DCGM numbers two tables up. The 27B model doesn’t give each decode step enough work to stress either one.
Where the wall is
KV cache at 73–75% peak, zero preemptions, power at 66–68% of TDP, temperature nowhere near throttling, bandwidth at 11%. On paper there’s room everywhere.
TPOT won’t move though. Locked at ~46ms median regardless of concurrency or node count. I tried pushing from 2K to 8K concurrency per node during Phase 4c and throughput actually went down by 2%. More requests, more scheduler contention, fewer tokens.
The wall is the decode loop. Each step produces ~1.9 tokens (MTP-1 at 90% acceptance), and the model is so small relative to B200 that more concurrent work doesn’t help. More per-node throughput probably needs faster decode kernels (the GDN optimizations on the vLLM project board), or disaggregated P/D to kill those ITL P99 spikes where prefill bursts stall decode.
Horizontal scaling still works — 98% at 4 nodes, 97% at 8 in the original run — but per-node throughput is what it is. ~96K tok/s per A4 node with this model and this software. The silicon has more to give. The software isn’t there yet.
Prefix caching: not yet for Qwen 3.5
I also tested prefix caching with shared 512-token prefixes. Zero cache hits. Qwen 3.5’s hybrid architecture probably doesn’t support standard KV block reuse.