Llama 3 8B
Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.
- Scale
- 8B parameters
- Date
- 2024-04-18
- Decoder type
- Dense
- Attention
- GQA with RoPE
- Key detail
- Pre-norm baseline; wider than OLMo 2 at a similar scale.
Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.
Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.
DeepSeek's flagship template kicked off the recent wave of large open MoE models.
Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.
Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.
Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.
Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.
Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.
Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.
Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.
Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.
Compact dense model that experiments with leaving out positional encodings in selected layers.
Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.
Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.
Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.
OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.
Rare production-model release that shows an older MoE style with fewer, larger experts.
Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.
MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.
Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.
Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.
New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.
DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.
Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.
NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.
Large MoE model that pushes sliding-window attention harder than most contemporaries.
Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.
Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.
Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.
The Super variant scales up Nano and adds both latent experts and native speculative decoding support.
Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.
Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.
Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.
Compact multilingual model from Cohere with a rare parallel transformer block.
Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.
Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.
Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.
Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.
Source article
The original comparison article that walks through the architecture figures in context and explains the key design choices across dense, MoE, MLA, and hybrid decoder families.
Source article
Follow-up article covering the additional open-weight architecture releases from early 2026, including the newer MiniMax, Qwen, Ling, and Sarvam families.