When we started working with NVIDIA's GH200, we did what most people do—treated it like an H100 with more RAM. That gets you running, but it leaves a lot on the table.
The GH200 isn't a GPU with a big memory pool. It's a tightly coupled CPU–GPU system: 99GB HBM3, 452GB LPDDR5X, connected by a 900 GB/s coherent link, with 72 ARM Neoverse V2 cores sitting right next to the accelerator. That architecture enables inference patterns that simply aren't possible on discrete GPUs over PCIe.
We built ionattention to find out what those patterns are. It's a C++ inference runtime designed specifically around Grace Hopper's memory hierarchy and coherence model. This post covers the three techniques we think are most interesting—and the benchmark results that came out of them.
Results
We benchmarked on Qwen2.5-7B-Instruct and Qwen3-VL-8B, single GH200, no tensor parallelism.
Together AI reported 298 tok/s on the same workload—measured with their own benchmark tooling. We ran the same tool against ionattention and measured 588 tok/s on a single GH200. p50 latency went from 0.74s to 1.46s—we're actively investigating ways to bring this down.
None of these are tricks you can bolt onto an existing engine. They come from designing around what GH200 actually is.
1. Coherent Memory Makes CUDA Graphs Dynamic
CUDA graphs are a huge win for decode—capture the forward pass once, replay it, skip hundreds of kernel launches. The problem is that graphs are static: parameters like sequence length get baked in at capture time.
The standard fix is to either re-capture graphs (slow) or patch kernel nodes before each replay (tedious, per-node overhead). Both cost you something every step.
Grace Hopper has a third option. NVLink-C2C maintains hardware coherence between CPU and GPU caches. If you place runtime state in the right memory region, the CPU can update it between graph replays and the GPU picks up the new values during execution. No copy, no patch, no re-capture.
This only works on hardware with cache-coherent CPU–GPU interconnects. On PCIe, unified memory accesses trigger page faults; the consistency model is different. On GH200, it just works—and it makes CUDA graphs behave as if they had dynamic parameters at zero per-step cost.
In our benchmarks, this delivered 10–20% lower decode latency on variable-length workloads compared to the node-patching approach. We haven't seen this technique published elsewhere.
2. Eager Writeback of Immutable KV Blocks
When KV cache fills up, you need to evict—push blocks back to CPU memory. The naive way is reactive: wait until you're out of space, then copy a big chunk. Even on GH200's fast link, that's a stall in the tens of milliseconds, right in your decode loop.
The insight we exploited: filled KV blocks are immutable. Once a block has all its token positions written, nothing will modify it again. That means you can copy it to CPU memory in the background, on a spare stream, without coordinating with compute.
We use this property to drive a background mirroring policy: blocks that are finished and haven't been backed up yet get streamed to LPDDR5X, overlapped with GPU work. By the time eviction actually triggers, most blocks are already mirrored. Eviction becomes a metadata update instead of a blocking transfer.
Blocking eviction dropped from 10ms+ to under 0.25ms—over 40× faster. We also run eviction and restoration concurrently on separate streams (NVLink-C2C is bidirectional), which gives another ~1.2× on swap operations.
vLLM's team has identified eager eviction as future work in their KV offloading RFC but hasn't shipped it yet. The lifecycle-driven approach—using immutability as the trigger rather than memory pressure—is, as far as we know, new.
3. Phantom-Tile Scheduling for Small-Batch Attention
Split-K attention—dividing the sequence into tiles processed in parallel—works great at large batch sizes. At small batch sizes, the GPU is mostly idle. Depending on your batch and head configuration, it's common to launch far fewer programs than the chip has SMs. Attention becomes the bottleneck purely because the chip is underoccupied.
Our fix is counterintuitive: at small batch sizes, we deliberately over-provision the grid beyond what the actual KV length requires. The extra tiles hit a bounds check and exit immediately. They exist solely to fill SMs.
It sounds wasteful. Per tile, it is. But the cost of a quick early-exit is negligible compared to leaving most of the chip dark. The GPU's warp scheduler handles the throwaway work efficiently.
At larger batch sizes where occupancy is already healthy, we fall back to a conventional tiling strategy.
This cut attention time by over 60% in the worst-affected regimes and shifted the bottleneck from attention to MLP, which is where you want it. End-to-end, it contributed 10–20% higher throughput at high concurrency.
Supporting Work
The three techniques above are the ones we think are new. The rest applies known ideas, well-suited to GH200:
GPU sampling. We moved temperature + top-p sampling entirely to GPU using a sorting-free approach (similar in spirit to FlashInfer's recent work, with a different algorithm for the top-p cutoff). This was the single largest factor in the VLM jump from 82 to 588 tok/s—sampling had been adding 37–50ms per step on CPU, replaced by ~150 microseconds on GPU.
Three-stream pipelining. Compute, prefetch, and writeback on coordinated streams. Standard HPC technique, but GH200's link bandwidth makes it effective for layer-by-layer weight staging between memory tiers.
ARM co-processing. Speculative draft model on the 72 ARM cores, vision encoder caching on CPU. Contributed an incremental 10–20% throughput on top of GPU-side gains.
Multi-model serving. Shared workspace and KV pools across models, NVLink-C2C weight migration on switch. Under 750ms to swap a 7B model, with CUDA graphs surviving the switch.
The Broader Point
Most inference stacks treat GH200 as a compatibility target—make sure the code runs, maybe use CPU memory as overflow. That misses what makes the hardware interesting.
Grace Hopper's coherent memory fabric, bidirectional high-bandwidth link, and serious ARM CPU enable optimization patterns that don't exist on discrete GPUs. Managed scalars as graph parameters. Background KV streaming driven by data lifecycle. CPU-side draft models running concurrently with GPU decode.
These aren't portable tricks. They're architecture-specific bets, and they paid off: 7,167 tok/s on a 7B model, 588 tok/s on a multimodal pipeline, single chip, no tensor parallelism.
If you're running finetuned or open source models of any kind, come talk to us. We'll meet your needs for cheaper.