AI IQ | AI Model IQ Leaderboard and Benchmark Charts

6 min read Original article ↗

How smart is your AI model, really?

AI IQ intelligently estimates the IQs of popular AI models

AI IQ Newsletter

Get the weekly AI model intelligence newsletter

New launches, benchmark shifts, cost-performance winners, and practical guidance on which models are worth using.

How AI IQ estimates model intelligence

  1. We archive source captures from public benchmark leaderboards and extract only source-backed values
  2. We map each benchmark score to an implied IQ using calibrated difficulty curves
  3. We group 18 benchmarks into five reasoning dimensions: fluid abstraction, mathematical, programmatic, critical, and agentic
  4. We conservatively fill missing benchmark and dimension estimates only inside the scoring pipeline
  5. Every derived IQ averages all five dimensions, so missing coverage cannot make a model look better by omission

Effective cost & iso-curves

Effective cost on the X-axis is sticker price for 1M I/O Tokens × token usage multiplier. 1M I/O Tokens means 1M input tokens plus 1M output tokens, priced at the model's published rates.

Iso-curves trace lines of equal preference for IQ versus cost. The slider weights quality vs cost: center is 1:1, drag toward Cost to make cost matter more, or toward IQ to make quality matter more. Models above and to the right of a curve are strictly better.

Tracking frontier progress

Each dot is a model with a known release date and a derived IQ estimate. Models are positioned left-to-right by release date, so the chart shows how the frontier changes over time rather than just where models rank today.

Provider-colored lines connect each lab's flagship frontier checkpoints. Codex, mini, nano, flash, coder, and smaller open-weight variants are omitted so the chart tracks each lab's main offering rather than every SKU.

This view is most useful for spotting whether a new release is actually ahead of its direct predecessor, or whether source coverage and conservative imputations are shaping the comparison.

How AI IQ estimates emotional intelligence

  1. We pull in each model's Text Arena Elo score and EQ-Bench 3 Elo score
  2. We map each source score to an estimated EQ using calibrated piecewise-linear scales
  3. EQ-Bench 3 is retained as the dedicated emotional/social reasoning signal, but treated as style-sensitive because it is judged by Claude
  4. Anthropic models receive a 300-point Elo adjustment on EQ-Bench before mapping
  5. The composite EQ requires both source-backed components, then averages the available Text Arena and EQ-Bench signals

IQ and EQ tradeoffs

IQ summarizes benchmark-based reasoning ability across fluid abstraction, mathematical reasoning, programmatic reasoning, critical reasoning, and agentic reasoning dimensions.

EQ estimates interaction quality from Text Arena and EQ-Bench 3 signals, then maps those scores onto the same kind of normalized scale so models can be compared directly.

Iso-curves trace lines of equal preference between IQ and EQ. The slider weights the two: center is 1:1, drag toward EQ to make EQ matter more, or toward IQ to make IQ matter more. Models above and to the right of a curve are strictly better at that preference.

Three dimensions, one view

Most charts on this page reduce model comparison to two axes. This one keeps all three: EQ (X), IQ (Y), and effective cost (Z, log-scaled — the depth axis). Effective cost is sticker price for 1M I/O Tokens multiplied by the blended usage multiplier.

Drag to rotate the cloud. The dashed line is the central tradeoff axis: it is perpendicular to the isoquant surface at the middle of the cube and points toward higher IQ, higher EQ, and lower effective cost. Models nearer the green end are stronger all-around deals; models nearer the red end give up capability, cost efficiency, or both.

Color = provider, matching the legend below.

AI IQ Newsletter

Get the weekly AI model intelligence newsletter

New model launches, benchmark shifts, cost-performance winners, and practical guidance on which models are actually worth using.

Read on Substack

IQ Methodology

Each benchmark's hand-calibrated mapping from raw score to implied IQ

$$\text{IQ} = \tfrac{1}{5}\!\left(\text{IQ}_{\text{Fluid}} + \text{IQ}_{\text{Math}} + \text{IQ}_{\text{Prog}} + \text{IQ}_{\text{Critical}} + \text{IQ}_{\text{Agentic}}\right)$$

18 benchmarks, 5 dimensions

Each benchmark score is mapped to an implied IQ via a calibrated anchor curve. Benchmarks are grouped into five dimensions — Fluid Abstraction, Mathematical Reasoning, Programmatic Reasoning, Critical Reasoning, and Agentic Reasoning — and each dimension IQ is the average of its benchmarks.

Hard, ungameable benchmarks retain high IQ ceilings. Easier or data-contaminated benchmarks have compressed ceilings to limit their influence above 100.

FrontierMath Tier 1-3 and ProofBench are included in the Mathematical Reasoning dimension and also shown as standalone cost-scatters on the IQ page.

Models need at least 2 of 5 dimensions covered to receive a derived IQ. Missing benchmarks and dimensions are conservatively imputed only inside the scoring pipeline, while raw benchmark charts keep source-backed values. Every derived composite uses all five dimensions.

Direct predecessor lineage is used first when it is explicit. Remaining missing dimensions use a matched lower-quartile cap based on models with similar capability across the other dimensions.

Each model's composite IQ plotted against each dimension IQ, with linear fit and R²

How dimensions relate to composite IQ

Composite IQ is the equal-weight average of five dimension IQs: Fluid Abstraction, Mathematical Reasoning, Programmatic Reasoning, Critical Reasoning, and Agentic Reasoning. Derived scores always use all five dimensions, with missing dimensions conservatively filled before averaging.

R² measures how much of the variance in composite IQ is explained by each dimension. A higher R² means that dimension is a stronger predictor of overall model intelligence.

EQ Methodology

Each benchmark's hand-calibrated mapping from Elo score to implied EQ

$$\text{EQ} = \operatorname{avg}\!\left(\text{EQ}_{\text{EQ-Bench}}, \text{EQ}_{\text{Text Arena}}\right)$$

2 benchmarks, 1 composite

Each source score is mapped to an implied EQ via a hand-calibrated anchor curve. Text Arena Elo reflects broad conversational ability as judged by human preference. EQ-Bench 3 measures emotional/social reasoning in challenging roleplays, but is AI-judged and sensitive to response style.

Composite EQ requires both source-backed components. One-source rows remain visible in component charts but do not receive a composite EQ.

Anthropic family-bias adjustment

EQ-Bench 3 is judged by Claude (an Anthropic model), which can favor Claude-like response style and penalize models with substantially different styles. We keep it because it is the strongest recent dedicated emotional/social reasoning signal, but subtract a 300-point Elo adjustment from Anthropic rows before mapping to implied EQ. Text Arena is unaffected.

Each model's composite EQ plotted against each benchmark's implied EQ, with linear fit and R²

How benchmarks relate to composite EQ

Composite EQ is the equal-weighted average of Text Arena Elo (human-preference ranking) and EQ-Bench 3 (AI-judged emotional/social reasoning). Models need both source-backed components to receive a composite EQ.

R² measures how much of the variance in composite EQ is explained by each benchmark. A higher R² means that benchmark is a stronger predictor of overall model EQ.