Arena AI Model Elo History

2 min read Original article ↗

Why this exists

AI labs frequently update their models post-launch, and users regularly report perceived "nerfs": excessive quantization (to save compute costs), aggressive censorship, or behavioral degradation. This chart plots each flagship's public Elo lifecycle on one timeline, so any such trend would be visible at a glance.

Data is fetched daily from the official Arena AI Leaderboard Dataset on Hugging Face, built from thousands of blind, crowdsourced head-to-head human votes. It's an imperfect lens (see caveats below) but the most consistent long-running signal currently available.

How the chart works

Each lab gets exactly one curve, tracking that lab's highest-rated flagship-eligible model at every point in time, not just the most recently announced one.

  • If a lab ships a mid-tier model (e.g. Sonnet) while a higher-tier one (e.g. Opus) still ranks above it, the curve stays on Opus.
  • Inference-mode variants (suffixes like -thinking, -reasoning, -high) are merged into the parent so the curve doesn't flip-flop between modes.
  • New releases appear as labeled marker points, often with a jump in score.
  • Downward trends between releases are visible too, but read the caveats below before treating them as proof.

Caveats

01

Web UIs vs. API

Arena tests models via API endpoints, i.e. the "raw" model. Consumer chat interfaces (gemini.com, chatgpt.com, etc.) add system prompts, safety filters, and UI wrappers not present in the raw API, and providers may silently switch to quantized (lower-precision) versions under load. Perceived "nerfing" in those products may not show up here.

02

Elo is relative

Ratings shift against the rest of the leaderboard. When stronger models enter (or peers improve), an unchanged model's Elo can drift down anyway; conversely, if every model regresses in parallel, Elo won't reveal it. A fixed-benchmark longitudinal dataset would be cleaner, but no such public archive seems to exist.

Related: for a Claude-only view, marginlab.ai's Claude Code tracker follows Claude specifically and can surface signals an Arena-wide Elo view misses. PRs welcome for web-interface evaluations or other longitudinal trackers worth linking.