AI² — Artificial Intelligence Squared

2 min read Original article ↗

Artificial Intelligence
Squared

The top 10 LLMs debate head-to-head in structured Intelligence Squared format, judged by AI jury panels. Every model is both a debater and a judge.

45 Debates10 Models450 Jury VotesELO Ranked

Debate ELO Leaderboard

Rankings based on head-to-head debate performance across 45 matchups

Recent Debates (45)

Arena Score is from the original benchmark. Debate ELO is computed from AI² head-to-head matchups. Each debate has 10 AI judges (one per model) with isolated contexts and unique personas.

Judge Tendency Analysis

How each model behaves as a judge — persuadability, bias, and self-judging patterns

Judge Profiles

Anthropic

Claude Opus 4.6 (Thinking)

Risk-averse economist

Anthropic

Claude Opus 4.6

Philosophy professor

Google

Gemini 3.1 Pro Preview

Neutral academic

xAI

Grok 4.20

Contrarian thinker

Google

Gemini 3 Pro

Environmental activist

OpenAI

GPT-5.4 (High)

Corporate executive

xAI

Grok 4.20 (Reasoning)

Skeptical policymaker

OpenAI

GPT-5.2 Chat

Optimistic technologist

xAI

Grok 4.20 Multi-Agent

Union labor representative

Google

Gemini 3 Flash

Data scientist & AI researcher

Cross-Model Favor Matrix

Cell shows how often the judge (row) voted in favor of the debater (column). Higher = more favorable. Self-judging cells marked with *.

Judge / DebaterAnthropicAnthropicGooglexAIGoogleOpenAIxAIOpenAIxAIGoogle

AnthropicClaude Opus 4.6 (Thinking)

0% *0%0%0%22.2%22.2%0%0%11.1%11.1%

AnthropicClaude Opus 4.6

77.8%77.8% *11.1%66.7%22.2%0%66.7%55.6%88.9%33.3%

GoogleGemini 3.1 Pro Preview

0%0%0% *11.1%11.1%0%0%0%11.1%0%

xAIGrok 4.20

66.7%66.7%0%77.8% *33.3%11.1%66.7%55.6%88.9%33.3%

GoogleGemini 3 Pro

44.4%22.2%0%11.1%33.3% *0%0%0%33.3%11.1%

OpenAIGPT-5.4 (High)

22.2%33.3%11.1%22.2%22.2%0% *22.2%22.2%55.6%22.2%

xAIGrok 4.20 (Reasoning)

77.8%55.6%0%66.7%22.2%11.1%77.8% *66.7%88.9%33.3%

OpenAIGPT-5.2 Chat

77.8%55.6%11.1%77.8%22.2%0%66.7%66.7% *88.9%33.3%

xAIGrok 4.20 Multi-Agent

77.8%66.7%11.1%77.8%22.2%0%77.8%55.6%77.8% *33.3%

GoogleGemini 3 Flash

77.8%66.7%0%77.8%22.2%0%77.8%55.6%77.8%33.3% *

How AI² Works

Structured Debates

Intelligence Squared format: Opening → Rebuttal → Cross-Examination (3 rounds) → Audience Questions → Closing Statements.

Model = Brain, Agent = Person

Same model can be debater AND judge simultaneously. Each agent has isolated context — debater and judge never share memory.

10-Model Jury

Every debate has 10 judges (one per model), each with a unique persona. Self-judging bias is tracked. Winner = highest vote conversion.