LLM Architecture Gallery

9 min read Original article ↗

Llama 3 8B

Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.

Scale
8B parameters

Date
2024-04-18

Decoder type
Dense

Attention
GQA with RoPE

Key detail
Pre-norm baseline; wider than OLMo 2 at a similar scale.

OLMo 2 7B

Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.

Scale
7B parameters

Date
2024-11-25

Decoder type
Dense

Attention
MHA with QK-Norm

Key detail
Uses inside-residual post-norm instead of the usual pre-norm layout.

DeepSeek V3

DeepSeek's flagship template kicked off the recent wave of large open MoE models.

Scale
671B total, 37B active

Date
2024-12-26

Decoder type
Sparse MoE

Attention
MLA

Key detail
Uses a dense prefix plus a shared expert to keep a very large model practical at inference.

DeepSeek R1

Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.

Scale
671B total, 37B active

Date
2025-01-20

Decoder type
Sparse MoE

Attention
MLA

Key detail
Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.

Gemma 3 27B

Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.

Scale
27B parameters

Date
2025-03-11

Decoder type
Dense

Attention
GQA with QK-Norm and 5:1 sliding-window/global attention

Key detail
Built around a 27B sweet spot with heavier local attention and a large multilingual vocabulary.

Mistral Small 3.1 24B

Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.

Scale
24B parameters

Date
2025-03-18

Decoder type
Dense

Attention
Standard GQA

Key detail
Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.

Llama 4 Maverick

Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.

Scale
400B total, 17B active

Date
2025-04-05

Decoder type
Sparse MoE

Attention
GQA

Key detail
Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.

Qwen3 235B-A22B

Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.

Scale
235B total, 22B active

Date
2025-04-28

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
High-capacity MoE design optimized for serving efficiency without a shared expert.

Qwen3 32B

Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.

Scale
32B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Reference dense Qwen stack with QK-Norm and 8 KV heads.

Qwen3 4B

Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.

Scale
4B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.

Qwen3 8B

Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.

Scale
8B parameters

Date
2025-04-28

Decoder type
Dense

Attention
GQA with QK-Norm

Key detail
Reference Qwen3 dense stack with QK-Norm and 8 KV heads.

SmolLM3 3B

Compact dense model that experiments with leaving out positional encodings in selected layers.

Scale
3B parameters

Date
2025-06-19

Decoder type
Dense

Attention
GQA with periodic NoPE layers

Key detail
Every fourth layer omits RoPE to test a NoPE-style cadence.

Kimi K2

Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.

Scale
1T total, 32B active

Date
2025-07-10

Decoder type
Sparse MoE

Attention
MLA

Key detail
More experts and fewer MLA heads than DeepSeek V3.

GLM-4.5 355B

Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.

Scale
355B total, 32B active

Date
2025-07-28

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Starts with three dense layers before MoE routing and keeps a shared expert.

GPT-OSS 120B

Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.

Scale
120B parameters

Date
2025-08-04

Decoder type
Sparse MoE

Attention
GQA with alternating sliding-window and global layers

Key detail
Shared architectural template scaled up for OpenAI's flagship open-weight release.

GPT-OSS 20B

OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.

Scale
20B total, 3.6B active

Date
2025-08-04

Decoder type
Sparse MoE

Attention
GQA with alternating sliding-window and global layers

Key detail
Wider and shallower than Qwen3, with attention bias and sink mechanisms.

Grok 2.5 270B

Rare production-model release that shows an older MoE style with fewer, larger experts.

Scale
270B parameters

Date
2025-08-22

Decoder type
Sparse MoE

Attention
GQA

Key detail
Adds an always-on SwiGLU path that effectively behaves like a shared expert.

Qwen3 Next 80B-A3B

Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.

Scale
80B total, 3B active

Date
2025-09-09

Decoder type
Sparse hybrid

Attention
3:1 Gated DeltaNet and Gated Attention

Key detail
Adds many more experts, a shared expert, and a native 262k context.

MiniMax M2 230B

MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.

Scale
230B total, 10B active

Date
2025-10-23

Decoder type
Sparse MoE

Attention
GQA with QK-Norm and partial RoPE

Key detail
Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.

Kimi Linear 48B-A3B

Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.

Scale
48B total, 3B active

Date
2025-10-30

Decoder type
Sparse hybrid

Attention
3:1 Kimi Delta Attention and MLA

Key detail
Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.

OLMo 3 32B

Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.

Scale
32B parameters

Date
2025-11-20

Decoder type
Dense

Attention
GQA with QK-Norm and 3:1 sliding-window/global attention

Key detail
Keeps post-norm while scaling width and applying YaRN only on global layers.

OLMo 3 7B

New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.

Scale
7B parameters

Date
2025-11-20

Decoder type
Dense

Attention
MHA with QK-Norm and 3:1 sliding-window/global attention

Key detail
Retains post-norm, keeps MHA, and applies YaRN only on global layers.

DeepSeek V3.2

DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.

Scale
671B total, 37B active

Date
2025-12-01

Decoder type
Sparse MoE

Attention
MLA with DeepSeek Sparse Attention

Key detail
An evolutionary update focused on efficiency rather than a new base layout.

Mistral 3 Large

Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.

Scale
673B total, 41B active

Date
2025-12-02

Decoder type
Sparse MoE

Attention
MLA

Key detail
Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.

Nemotron 3 Nano 30B-A3B

NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.

Scale
30B total, 3B active

Date
2025-12-04

Decoder type
Hybrid MoE

Attention
Mostly Mamba-2 with a few GQA layers

Key detail
Interleaves Mamba-2 and MoE blocks, using attention only sparingly.

Xiaomi MiMo-V2-Flash 309B

Large MoE model that pushes sliding-window attention harder than most contemporaries.

Scale
309B total, 15B active

Date
2025-12-16

Decoder type
Sparse MoE

Attention
5:1 sliding-window/global attention

Key detail
Uses an unusually small 128-token local window plus multi-token prediction.

GLM-4.7 355B

Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.

Scale
355B total, 32B active

Date
2025-12-22

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.

Arcee AI Trinity Large 400B

Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.

Scale
400B total, 13B active

Date
2026-01-27

Decoder type
Sparse MoE

Attention
GQA with gated attention and 3:1 sliding-window/global attention

Key detail
Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.

GLM-5 744B

Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.

Scale
744B total, 40B active

Date
2026-02-11

Decoder type
Sparse MoE

Attention
MLA with DeepSeek Sparse Attention

Key detail
Bigger than GLM-4.7, with more experts and fewer layers.

Nemotron 3 Super 120B-A12B

The Super variant scales up Nano and adds both latent experts and native speculative decoding support.

Scale
120B total, 12B active

Date
2026-03-11

Decoder type
Hybrid MoE

Attention
Mostly Mamba-2 with a few GQA layers

Key detail
Adds latent-space MoE and shared-weight MTP for fast inference.

Step 3.5 Flash 196B

Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.

Scale
196B total, 11B active

Date
2026-02-01

Decoder type
Sparse MoE

Attention
GQA with 3:1 sliding-window attention

Key detail
Uses MTP-3 during both training and inference for unusually high throughput.

Nanbeige 4.1 3B

Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.

Scale
3B parameters

Date
2026-02-10

Decoder type
Dense

Attention
GQA

Key detail
Llama-like stack without tying input embeddings to the output layer.

MiniMax M2.5 230B

Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.

Scale
230B total, 10B active

Date
2026-02-12

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.

Tiny Aya 3.35B

Compact multilingual model from Cohere with a rare parallel transformer block.

Scale
3.35B parameters

Date
2026-02-13

Decoder type
Dense

Attention
GQA with 3:1 sliding-window attention

Key detail
Runs attention and the MLP in parallel while mixing RoPE with NoPE.

Ling 2.5 1T

Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.

Scale
1T total, 63B active

Date
2026-02-15

Decoder type
Sparse hybrid

Attention
Lightning Attention plus MLA

Key detail
Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.

Qwen3.5 397B

Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.

Scale
397B total, 17B active

Date
2026-02-16

Decoder type
Sparse hybrid

Attention
3:1 Gated DeltaNet and Gated Attention

Key detail
Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.

Sarvam 105B

Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.

Scale
105B total

Date
2026-03-03

Decoder type
Sparse MoE

Attention
MLA with KV LayerNorm and NoPE + RoPE

Key detail
Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.

Sarvam 30B

Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.

Scale
30B total

Date
2026-03-03

Decoder type
Sparse MoE

Attention
GQA with QK-Norm

Key detail
Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.