Nvidia’s Nemotron 3 Super is a Bigger Deal Than You Think

March 14th, 2026 • Max Trivedi

Nvidia released Nemotron 3 Super (120B-A12B) earlier this week to a decent but somewhat muted response. As most of the community is benchmark-pilled and this model trails against comparable models like Alibaba’s Qwen3.5-122B-A10B which has already been out for a few weeks, it is easy to see why.

We argue that this model is a major deal for reasons entirely different than its benchmark performance.

(Almost) Fully open source

This is an almost 100% open source model, and by a large margin: weights are open, training datasets are open (10 trillion pre-training tokens), 40 million post-training supervised and alignment samples, 21 RL environment configurations and 10 out of 37 datasets it was trained on, and full evaluation recipes.

This drops the cost of training entirely custom mid-tier models well within the budgets of startups.

Caveat: Nvidia's so-called "Rug-Pull" license allows Nvidia to unilaterally update terms, includes aggressive patent retaliation triggers, and provides asymmetric legal indemnity.

Cheap Agentic Workhorse

The benchmark degradation from the 16-bit version to the 4-bit (NVFP4) version is practically zero. Across general knowledge (MMLU) and long-context retrieval (RULER), the 4-bit model retains ~99.8% of the base model's accuracy. In subjective logic evaluations like Arena-Hard-V2, the 4-bit version actually scored slightly higher.

This creates an ideal "workhorse" candidate for agentic workloads. For most agentic use cases, you want a model that is capable (capability floor is more important than ceiling), reliable, as fast as possible, and low cost regardless of its performance on the so-called 'hard' benchmarks.

Early real-world benchmarks show the 4-bit Nemotron 3 Super maintaining ~62 tokens per second at a massive 512,000 token context window on a single workstation GPU (an RTX Pro 6000). The speed drop from 1K context to 512K context is only 11%. The API providers hosting this model are offering 400+ TPS, which is AI Accelerator tier speed.

If we assume that the majority of single agent runs are not extremely complex, this provides inference consumers such as Agentic Systems the ability to divert some of their load away from frontier labs entirely for pennies on the dollar.

Cool Architectural Features

Taken from Nemotron 3 technical report.

LatentMoE

Standard Mixture of Experts (MoE) architectures rapidly hit a memory bandwidth wall because tokens are routed and processed in their full hidden dimension (d). Nemotron 3 Super circumvents this hardware bottleneck using LatentMoE (Elango et al., Jan 2026), a first-of-its-kind approach that projects tokens from the model dimension (d=4096) into a highly compressed latent space (ℓ=1024) before any routing or expert computation occurs. This reduces both the routed parameter loads and the all-to-all communication traffic across GPUs by a factor of (d/ℓ). Using these bandwidth savings, the model scales to a large 512 experts with an aggressive top-22 routing strategy without increasing the inference time cost.

Mamba-2 / Attention Hybrid

Maintaining huge context windows becomes increasingly harder for most models because the KV cache scales quadratically, eventually consuming more memory than the model itself. Nemotron 3 Super utilizes an 88-layer hybrid backbone built heavily upon Mamba-2 (Dao & Gu, May 2024). The linear-time, constant-memory Mamba-2 blocks handle the bulk of sequence processing. Periodically, sparse self-attention layers are strategically interleaved as "global anchors." This interleaving preserves exact associative recall and long-range information routing capabilities pure State Space Models typically struggle with while completely offloading the majority of the computational memory overhead.

(side note: State space models is an extremely cool concept, you should read up about it in case you don't already know)

Native NVFP4 Pretraining

Nemotron 3 Super was pre-trained natively in 4-bit precision on 10 trillion curated tokens. The model pioneers true hardware-software co-design by wrapping its 4-bit numbers (E2M1) in a two-tier scaling system: every 16 numbers share a local scale (E4M3) to preserve intricate details, while a high-precision global scale (FP32) maintains the integrity of the overall system. To maintain FP8-level training stability and prevent gradient underflow at this scale, the pipeline applies Random Hadamard Transforms (RHTs) to inputs for weight gradients (wgrad) to disperse magnitude outliers, alongside stochastic rounding for unbiased gradient estimation.

Multi-Token Prediction (MTP)

The model is trained with an MTP objective to predict multiple future tokens in a single forward pass. While independent MTP heads traditionally suffer from distribution shifts during inference, Nemotron 3 Super utilizes two MTP layers with shared parameters. This shared-weight formulation exposes a unified prediction head to multiple offsets, heavily regularizing it and improving robustness to autoregressive drafting. This acts as highly stable "native speculative decoding," accelerating generation speeds for long-horizon agentic workflows without the architectural bloat or latency of an external draft model.

Nvidia Strategy

One of the following is true.

Either Nvidia truly cares about the open source and making the world a better place.

By commoditizing state-of-the-art training infrastructure and highly optimized 4-bit models and lowering the barriers to entry, Nvidia is creating demand and fracturing the software moats of traditional model builders.

See "commoditize your complement" by Joel Spolsky.