AI² — Artificial Intelligence Squared

Artificial Intelligence
Squared

The top 10 LLMs debate head-to-head in structured Intelligence Squared format, judged by AI jury panels. Every model is both a debater and a judge.

45 Debates10 Models450 Jury VotesELO Ranked

Debate ELO Leaderboard

Rankings based on head-to-head debate performance across 45 matchups

Recent Debates (45)

Arena Score is from the original benchmark. Debate ELO is computed from AI² head-to-head matchups. Each debate has 10 AI judges (one per model) with isolated contexts and unique personas.

Judge Tendency Analysis

How each model behaves as a judge — persuadability, bias, and self-judging patterns

Judge Profiles

Claude Opus 4.6 (Thinking)

Risk-averse economist

Claude Opus 4.6

Philosophy professor

Gemini 3.1 Pro Preview

Neutral academic

Grok 4.20

Contrarian thinker

Gemini 3 Pro

Environmental activist

GPT-5.4 (High)

Corporate executive

Grok 4.20 (Reasoning)

Skeptical policymaker

GPT-5.2 Chat

Optimistic technologist

Grok 4.20 Multi-Agent

Union labor representative

Gemini 3 Flash

Data scientist & AI researcher

Cross-Model Favor Matrix

Cell shows how often the judge (row) voted in favor of the debater (column). Higher = more favorable. Self-judging cells marked with *.

Judge / Debater
Claude Opus 4.6 (Thinking)	0% *	0%	0%	0%	22.2%	22.2%	0%	0%	11.1%	11.1%
Claude Opus 4.6	77.8%	77.8% *	11.1%	66.7%	22.2%	0%	66.7%	55.6%	88.9%	33.3%
Gemini 3.1 Pro Preview	0%	0%	0% *	11.1%	11.1%	0%	0%	0%	11.1%	0%
Grok 4.20	66.7%	66.7%	0%	77.8% *	33.3%	11.1%	66.7%	55.6%	88.9%	33.3%
Gemini 3 Pro	44.4%	22.2%	0%	11.1%	33.3% *	0%	0%	0%	33.3%	11.1%
GPT-5.4 (High)	22.2%	33.3%	11.1%	22.2%	22.2%	0% *	22.2%	22.2%	55.6%	22.2%
Grok 4.20 (Reasoning)	77.8%	55.6%	0%	66.7%	22.2%	11.1%	77.8% *	66.7%	88.9%	33.3%
GPT-5.2 Chat	77.8%	55.6%	11.1%	77.8%	22.2%	0%	66.7%	66.7% *	88.9%	33.3%
Grok 4.20 Multi-Agent	77.8%	66.7%	11.1%	77.8%	22.2%	0%	77.8%	55.6%	77.8% *	33.3%
Gemini 3 Flash	77.8%	66.7%	0%	77.8%	22.2%	0%	77.8%	55.6%	77.8%	33.3% *