DiffusionGemma

1 min read Original article ↗

DiffusionGemma abandons the sequential, token-by-token process of typical autoregressive Large Language Models.

Built on Gemma 4 and Gemini Diffusion research, it prioritizes unprecedented speed and parallel layout generation, unlocking novel workflows for developers building real-time interactive AI applications.

Slide 1 of 5

Blazing fast inference

By shifting the decode bottleneck from memory-bandwidth to raw compute, DiffusionGemma generates up to 4x faster token output (achieving over 1,000 tokens per second on a single NVIDIA H100 GPU).

Accessible hardware footprint

Operates as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference. It fits comfortably within the 24GB VRAM limits of a consumer NVIDIA RTX 5090 or 4090 when quantized.

Bi-directional attention

Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing and code infilling.

Intelligent self-correction

The model iteratively refines its own output, allowing it to evaluate the entire text block at once to perfectly close complex formatting and fix mistakes in real-time.

Next-gen compute with NVFP4

Native support for NVIDIA's new NVFP4 (4-bit floating-point) format on Blackwell GPUs dramatically accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy.