NVIDIA claims 2x faster inference. We tested it on a $2,000 gaming GPU. Here’s what actually happened.
Press enter or click to view image in full size
TL;DR
- Memory reduction: NVFP4 compresses weights to ~4.5 effective bits, which is 3.5x less VRAM than FP16 (NVIDIA Blog, June 2025)
- Practical sweet spot: 32B in NVFP4 on 1x RTX 5090 (32 GB) is comfortable. 70B needs 2x RTX 5090 or CPU offloading. The 70B weights alone weigh ~39 GB, with KV cache and overhead on top.
- Performance: NVIDIA’s “2x vs FP8” claim is generation-over-generation (RTX 50 vs RTX 40), not FP4 vs FP8 on the same GPU. On a single RTX 5090, the FP4 vs FP8 gain is ~1.6x (Signal65)
- Accuracy: The “<1% degradation” holds on DeepSeek-R1–0528 with QAD. Without distillation, and on small models (<7B), degradation can reach 2–8% (arXiv:2509.23202)
- Hardware required: Blackwell only (RTX 5070, 5080, 5090). RTX 40xx cards emulate without hardware acceleration — pointless in practice.
- If you don’t have Blackwell: this guide has a section for you. Short answer: GGUF Q4_K_M via llama.cpp.