I Ran a 70B Model on a $2,000 Gaming GPU. Here’s the Code.

1 min read Original article ↗

NVIDIA claims 2x faster inference. We tested it on a $2,000 gaming GPU. Here’s what actually happened.

Delanoe Pirard

Press enter or click to view image in full size

Distilling 70 billion parameters into 4 bits. The old art, applied to new weights.

TL;DR

  • Memory reduction: NVFP4 compresses weights to ~4.5 effective bits, which is 3.5x less VRAM than FP16 (NVIDIA Blog, June 2025)
  • Practical sweet spot: 32B in NVFP4 on 1x RTX 5090 (32 GB) is comfortable. 70B needs 2x RTX 5090 or CPU offloading. The 70B weights alone weigh ~39 GB, with KV cache and overhead on top.
  • Performance: NVIDIA’s “2x vs FP8” claim is generation-over-generation (RTX 50 vs RTX 40), not FP4 vs FP8 on the same GPU. On a single RTX 5090, the FP4 vs FP8 gain is ~1.6x (Signal65)
  • Accuracy: The “<1% degradation” holds on DeepSeek-R1–0528 with QAD. Without distillation, and on small models (<7B), degradation can reach 2–8% (arXiv:2509.23202)
  • Hardware required: Blackwell only (RTX 5070, 5080, 5090). RTX 40xx cards emulate without hardware acceleration — pointless in practice.
  • If you don’t have Blackwell: this guide has a section for you. Short answer: GGUF Q4_K_M via llama.cpp.