LLM Architecture Gallery

Llama 3 8B

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Scale: 8B parameters
Date: 2024-04-18
Decoder type: Dense
Attention: GQA with RoPE
Key detail: Pre-norm baseline; wider than OLMo 2 at a similar scale.

OLMo 2 7B

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

Scale: 7B parameters
Date: 2024-11-25
Decoder type: Dense
Attention: MHA with QK-Norm
Key detail: Uses inside-residual post-norm instead of the usual pre-norm layout.

DeepSeek V3

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Scale: 671B total, 37B active
Date: 2024-12-26
Decoder type: Sparse MoE
Attention: MLA
Key detail: Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

DeepSeek R1

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Scale: 671B total, 37B active
Date: 2025-01-20
Decoder type: Sparse MoE
Attention: MLA
Key detail: Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Gemma 3 27B

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Scale: 27B parameters
Date: 2025-03-11
Decoder type: Dense
Attention: GQA with QK-Norm and 5:1 sliding-window/global attention
Key detail: Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Mistral Small 3.1 24B

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Scale: 24B parameters
Date: 2025-03-18
Decoder type: Dense
Attention: Standard GQA
Key detail: Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Llama 4 Maverick

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Scale: 400B total, 17B active
Date: 2025-04-05
Decoder type: Sparse MoE
Attention: GQA
Key detail: Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Qwen3 235B-A22B

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Scale: 235B total, 22B active
Date: 2025-04-28
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: High-capacity MoE design optimized for serving efficiency without a shared expert.

Qwen3 32B

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Scale: 32B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Reference dense Qwen stack with QK-Norm and 8 KV heads.

Qwen3 4B

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Scale: 4B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Qwen3 8B

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Scale: 8B parameters
Date: 2025-04-28
Decoder type: Dense
Attention: GQA with QK-Norm
Key detail: Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

SmolLM3 3B

Compact dense model that experiments with leaving out positional encodings in selected layers.

Scale: 3B parameters
Date: 2025-06-19
Decoder type: Dense
Attention: GQA with periodic NoPE layers
Key detail: Every fourth layer omits RoPE to test a NoPE-style cadence.

Kimi K2

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Scale: 1T total, 32B active
Date: 2025-07-10
Decoder type: Sparse MoE
Attention: MLA
Key detail: More experts and fewer MLA heads than DeepSeek V3.

GLM-4.5 355B

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Scale: 355B total, 32B active
Date: 2025-07-28
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Starts with three dense layers before MoE routing and keeps a shared expert.

GPT-OSS 120B

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

Scale: 120B parameters
Date: 2025-08-04
Decoder type: Sparse MoE
Attention: GQA with alternating sliding-window and global layers
Key detail: Shared architectural template scaled up for OpenAI's flagship open-weight release.

GPT-OSS 20B

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

Scale: 20B total, 3.6B active
Date: 2025-08-04
Decoder type: Sparse MoE
Attention: GQA with alternating sliding-window and global layers
Key detail: Wider and shallower than Qwen3, with attention bias and sink mechanisms.

Grok 2.5 270B

Rare production-model release that shows an older MoE style with fewer, larger experts.

Scale: 270B parameters
Date: 2025-08-22
Decoder type: Sparse MoE
Attention: GQA
Key detail: Adds an always-on SwiGLU path that effectively behaves like a shared expert.

Qwen3 Next 80B-A3B

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

Scale: 80B total, 3B active
Date: 2025-09-09
Decoder type: Sparse hybrid
Attention: 3:1 Gated DeltaNet and Gated Attention
Key detail: Adds many more experts, a shared expert, and a native 262k context.

MiniMax M2 230B

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

Scale: 230B total, 10B active
Date: 2025-10-23
Decoder type: Sparse MoE
Attention: GQA with QK-Norm and partial RoPE
Key detail: Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.

Kimi Linear 48B-A3B

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

Scale: 48B total, 3B active
Date: 2025-10-30
Decoder type: Sparse hybrid
Attention: 3:1 Kimi Delta Attention and MLA
Key detail: Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.

OLMo 3 32B

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

Scale: 32B parameters
Date: 2025-11-20
Decoder type: Dense
Attention: GQA with QK-Norm and 3:1 sliding-window/global attention
Key detail: Keeps post-norm while scaling width and applying YaRN only on global layers.

OLMo 3 7B

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

Scale: 7B parameters
Date: 2025-11-20
Decoder type: Dense
Attention: MHA with QK-Norm and 3:1 sliding-window/global attention
Key detail: Retains post-norm, keeps MHA, and applies YaRN only on global layers.

DeepSeek V3.2

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

Scale: 671B total, 37B active
Date: 2025-12-01
Decoder type: Sparse MoE
Attention: MLA with DeepSeek Sparse Attention
Key detail: An evolutionary update focused on efficiency rather than a new base layout.

Mistral 3 Large

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

Scale: 673B total, 41B active
Date: 2025-12-02
Decoder type: Sparse MoE
Attention: MLA
Key detail: Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.

Nemotron 3 Nano 30B-A3B

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

Scale: 30B total, 3B active
Date: 2025-12-04
Decoder type: Hybrid MoE
Attention: Mostly Mamba-2 with a few GQA layers
Key detail: Interleaves Mamba-2 and MoE blocks, using attention only sparingly.

Xiaomi MiMo-V2-Flash 309B

Large MoE model that pushes sliding-window attention harder than most contemporaries.

Scale: 309B total, 15B active
Date: 2025-12-16
Decoder type: Sparse MoE
Attention: 5:1 sliding-window/global attention
Key detail: Uses an unusually small 128-token local window plus multi-token prediction.

GLM-4.7 355B

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

Scale: 355B total, 32B active
Date: 2025-12-22
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.

Arcee AI Trinity Large 400B

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

Scale: 400B total, 13B active
Date: 2026-01-27
Decoder type: Sparse MoE
Attention: GQA with gated attention and 3:1 sliding-window/global attention
Key detail: Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.

GLM-5 744B

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

Scale: 744B total, 40B active
Date: 2026-02-11
Decoder type: Sparse MoE
Attention: MLA with DeepSeek Sparse Attention
Key detail: Bigger than GLM-4.7, with more experts and fewer layers.

Nemotron 3 Super 120B-A12B

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Scale: 120B total, 12B active
Date: 2026-03-11
Decoder type: Hybrid MoE
Attention: Mostly Mamba-2 with a few GQA layers
Key detail: Adds latent-space MoE and shared-weight MTP for fast inference.

Step 3.5 Flash 196B

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

Scale: 196B total, 11B active
Date: 2026-02-01
Decoder type: Sparse MoE
Attention: GQA with 3:1 sliding-window attention
Key detail: Uses MTP-3 during both training and inference for unusually high throughput.

Nanbeige 4.1 3B

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Scale: 3B parameters
Date: 2026-02-10
Decoder type: Dense
Attention: GQA
Key detail: Llama-like stack without tying input embeddings to the output layer.

MiniMax M2.5 230B

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

Scale: 230B total, 10B active
Date: 2026-02-12
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.

Tiny Aya 3.35B

Compact multilingual model from Cohere with a rare parallel transformer block.

Scale: 3.35B parameters
Date: 2026-02-13
Decoder type: Dense
Attention: GQA with 3:1 sliding-window attention
Key detail: Runs attention and the MLP in parallel while mixing RoPE with NoPE.

Ling 2.5 1T

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Scale: 1T total, 63B active
Date: 2026-02-15
Decoder type: Sparse hybrid
Attention: Lightning Attention plus MLA
Key detail: Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.

Qwen3.5 397B

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Scale: 397B total, 17B active
Date: 2026-02-16
Decoder type: Sparse hybrid
Attention: 3:1 Gated DeltaNet and Gated Attention
Key detail: Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.

Sarvam 105B

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Scale: 105B total
Date: 2026-03-03
Decoder type: Sparse MoE
Attention: MLA with KV LayerNorm and NoPE + RoPE
Key detail: Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.

Sarvam 30B

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

Scale: 30B total
Date: 2026-03-03
Decoder type: Sparse MoE
Attention: GQA with QK-Norm
Key detail: Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.

Source article

The Big LLM Architecture Comparison

The original comparison article that walks through the architecture figures in context and explains the key design choices across dense, MoE, MLA, and hybrid decoder families.

Read article

The Big LLM Architecture Comparison overview figure

Source article

A Dream of Spring for Open-Weight LLMs

Follow-up article covering the additional open-weight architecture releases from early 2026, including the newer MiniMax, Qwen, Ling, and Sarvam families.

Read article

A Dream of Spring for Open-Weight LLMs hero image