Settings

Theme

Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

github.com

6 points by GreenGames 20 days ago · 1 comment · 4 min read

Reader

Hey there, we fused all 24 layers of Qwen3.5-0.8B (a hybrid DeltaNet + Attention model) into a single CUDA kernel launch and made it open-source for everyone to try it.

On an RTX 3090 power-limited to 220W: - 411 tok/s vs 229 tok/s on M5 Max (1.8x) - 1.87 tok/J, beating M5 Max efficiency - 1.55x faster decode than llama.cpp on the same GPU - 3.4x faster prefill

The RTX 3090 launched in 2020. Everyone calls it power-hungry. It isn't, the software is. The conventional wisdom NVIDIA is fast but thirsty. Apple Silicon is slow but sips power. Pick a side.

With stock frameworks, the numbers back that up: Setup | tok/s | Power | tok/J RTX 3090 (llama.cpp) | 267 | 350W | 0.76 M5 Max (LM Studio) | 229 | ~130W | 1.76

Case closed. Except the 3090 has 936 GB/s of bandwidth and 142 TFLOPS of FP16 compute, and llama.cpp extracts 267 tok/s out of it. That ratio is absurd.

Traditional inference dispatches one kernel per operation. For 24 layers, that's roughly 100 launches per token. Every boundary means: - Return control to the CPU - Dispatch the next kernel - Re-fetch weights from global memory - Synchronize threads

Why nobody had done this yet? Qwen3.5-0.8B isn't a vanilla transformer. It alternates: - 18 DeltaNet layers: linear attention with a learned recurrence - 6 Full Attention layers: standard MHA

This hybrid pattern is where frontier models are heading: Qwen3-Next, Kimi Linear, all of them. DeltaNet scales linearly with context length instead of quadratically.

It's new, and nobody has shipped a fused kernel for it. MLX doesn't have DeltaNet kernels at all. llama.cpp supports it generically. Everyone else is waiting. The 267 tok/s wasn't a hardware ceiling, it was the software ceiling for a brand-new architecture.

We wrote a single CUDA kernel that runs the entire forward pass in one dispatch. Data stays in registers and shared memory as it flows through the network. Zero CPU round-trips, zero redundant memory fetches.

- 82 blocks x 512 threads, all SMs occupied - BF16 weights and activations, FP32 accumulation DeltaNet recurrence runs in warp-cooperative F32 registers - Full attention fuses QKV, RoPE, causal softmax, and output projection - Cooperative grid sync replaces kernel launches between layers

Results on the same RTX 3090, same model, same weights: Setup | Prefill (pp520) | Decode (tg128) Megakernel | 37,800 tok/s | 413 tok/s llama.cpp BF16 | 11,247 tok/s | 267 tok/s PyTorch + HF | 7,578 tok/s | 108 tok/s

Then we turned the power down Fewer wasted cycles means less heat, so we swept nvidia-smi -pl: Power limit | Clock | Draw | tok/s | tok/J | Notes 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline 300W | 1935 MHz | 299W | 432 | 1.44 | -5% power, 99.8% speed 220W | 1635 MHz | 220W | 411 | 1.87 | -30% power, 95% speed 150W | 405 MHz | 150W | 194 | 1.29 | clock cliff, too aggressive

At 220W we hit the sweet spot: 95% of the throughput for 70% of the power. Tighter execution converts almost directly into saved watts. Measurement: NVML energy counters for NVIDIA, powermetrics for Apple Silicon, matching Hazy Research's Intelligence Per Watt methodology. Accelerator power only, not wall draw.

Without the megakernel the 3090 barely edges out a laptop chip. With it, a five-year-old GPU beats Apple's latest on throughput, matches it on efficiency, and costs a quarter as much. The NVIDIA vs Apple efficiency gap isn't silicon. It's software.

Try it git clone https://github.com/Luce-Org/luce-megakernel.git cd luce-megakernel pip install -e . python bench_pp_tg.py

Requires: NVIDIA Ampere+ (tested on 3090), CUDA 12+, PyTorch 2.0+, ~1.5GB VRAM.

Code is open source (MIT): https://github.com/Luce-Org/luce-megakernel

Let us know if you have any feedback

emanuele-em 20 days ago

Really cool to see someone actually prove that the NVIDIA vs Apple efficiency gap is mostly a software problem. A 2020 GPU matching M5 Max tok/J at 1.8x the throughput just by fusing all 24 layers into one persistent kernel is a strong result. The DVFS sweep losing only 5% between 420W and 220W is surprising. Have you looked at what this would take on Hopper with TMA?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection