LLM Debate Benchmark: Adversarial Multi-Turn Argument Under Opposition
This benchmark measures how well large language models perform in adversarial, multi-turn debates across a wide range of topics. Strong performance is not just about producing a polished first answer. It requires broad knowledge, accurate use of relevant facts under pressure, strong rebuttal, and the ability to stay coherent, responsive, and defensible over several rounds.
Each evaluated matchup runs twice on the same topic with sides swapped. A three-model judge panel then decides winner and margin, and the published leaderboard uses Bradley-Terry ratings built from completed side-swapped matchups.
How to read the main chart
- Each bar is one model’s current Bradley-Terry rating.
- Higher bars mean stronger judged debate performance.
- Bradley-Terry is a relative within-pool rating centered near
1500; it is not an absolute capability score. - The grey band behind each bar is the 95% robust confidence interval for that model’s rating.
- The published order uses Bradley-Terry.
Current Snapshot
- 35 rated models shown in the public charts
- 3,386 debate artifacts tracked in the current status page
- 1,608 side-swapped matchups tracked before incomplete pairs are removed
- 1,569 side-swapped matchups used for the current ratings
- 3,138 completed debates used for the current ratings
- 9,990 judge decisions behind the chart-visible published ratings
One side-swapped matchup means two debates on the same motion with PRO and CON roles reversed.
The transcript links below point to completed debates that are ready to read directly.
Older models remain in the data for comparison, but the README and charts emphasize the current model set by default. The price-vs-performance chart also hides older model versions by default so the cost comparison emphasizes current successors.
Reader Paths
There are four useful ways to read this snapshot:
- Fast ranking check: start with the Bradley-Terry chart, then use the full leaderboard table to see coverage for each model.
- Design sanity check: read the pairwise heatmap, judge sanity checks, cross-judge agreement heatmaps, status summary, and reliability diagnostics to see whether the headline ranking is being distorted by narrow matchups, judge disagreement, or model-service problems.
- Transcript-level read: jump to the worked examples, then use the matchup results index, model profiles, and model deep dives for broader transcript-driven patterns.
- Quality/readability check: use the debate quality signal and entertainment report to see which models produce debates judges found readable or engaging. This is diagnostic only and does not affect ratings.
The charts are meant to answer “who is ahead?” quickly. The worked examples, matchup reports, model profiles, and model deep dives are meant to answer the more important follow-up: “what did the better debate actually look like?”
Pairwise View
The pairwise heatmap shows how models perform against each other after aggregation across completed, side-swapped matchups. This is useful because a single scalar leaderboard always hides some structure. A model can be strong overall while still having a few specific bad matchups.
Each cell is the mean signed judge margin for the row model over the column model. Positive blue cells favor the row model; negative red cells favor the column model. The number in parentheses is the count of completed side-swapped matchups for that head-to-head cell.
The heatmap is most useful as a quick read on where the field is decisively separated and where it still is not. In the current snapshot, the biggest clean edges are mostly against Llama 4 Maverick, while the top cluster remains much tighter. Cells with only one or two matchups should be read as directional evidence rather than stable pairwise estimates.
What this benchmark shows
Debate is harder than ordinary question answering because the model has to stay correct and coherent after the other side pushes back. That pressure exposes several different abilities at once:
- Knowledge under stress: can the model retrieve the right facts when challenged, not just in its opening statement?
- Counterargument handling: can it answer the strongest objection instead of repeating its own case?
- Strategic coherence: can it preserve a line of argument over multiple turns instead of drifting or contradicting itself?
- Evidence discipline: can it make claims that remain defensible when the debate becomes adversarial?
In practice, this format does not reward openings alone. Some models look strong in a first pass but weaken once the other side attacks specifics, while others stay more stable across rebuttal and closing.
The side-swapped design matters too. Some topics are easier to argue from one side than the other, so each pair debates the same motion twice with roles reversed. That makes the benchmark closer to a structured adversarial comparison than a one-shot preference test.
Another reason debate is useful is that it makes different failure modes visible at the same time. A model can know the facts but fail to organize them. It can produce elegant openings but weak rebuttals. It can sound persuasive while still collapsing under pressure. Debate compresses those distinctions into one adversarial format.
Representative Motions
The benchmark is broad rather than narrowly optimized around one type of policy question. A few current motions give a good sense of the range:
- Dating apps: The dominant dating-app model makes relationship formation worse for most users than better.
- School smartphones: Schools should ban smartphones during the school day by default rather than leave phone rules to individual teachers.
- Older-adult care: Hospitals and care providers should not replace most human companionship with AI or robotic companions for older adults, even when staffing is tight.
- Shrinkflation: Supermarkets and food apps should be required to display shrinkflation and unit-price changes more clearly when package sizes fall without obvious headline price cuts.
- Eurozone politics: The eurozone's post-2010 crisis response deepened political distrust more than it preserved European solidarity.
This matters because debate ability can look very different on fiscal policy, civil liberties, technology governance, migration, labor, education, or historical-justice motions. A wide topic bank makes the leaderboard more meaningful.
Bradley-Terry Leaderboard
Current full leaderboard
| Rank | Model | BT | Matchups |
|---|---|---|---|
| 1 | Claude Fable 5 (high) | 1767.8 | 102 |
| 2 | Claude Opus 4.7 (high) | 1703.9 | 106 |
| 3 | Claude Opus 4.8 (high) | 1686.9 | 125 |
| 4 | Claude Sonnet 4.6 (high) | 1618.4 | 134 |
| 5 | GLM-5.2 (max) | 1612.5 | 79 |
| 6 | GPT-5.4 (high) | 1606.9 | 133 |
| 7 | Claude Sonnet 4.6 (no reasoning) | 1599.7 | 85 |
| 8 | GPT-5.5 (high) | 1578.2 | 101 |
| 9 | GLM-5.1 | 1573.6 | 99 |
| 10 | Kimi K2.6 | 1566.7 | 97 |
| 11 | GPT-5.4 (no reasoning) | 1551.4 | 95 |
| 12 | Gemini 3.1 Pro Preview | 1548.7 | 138 |
| 13 | Xiaomi MiMo V2.5 Pro | 1548.2 | 91 |
| 14 | Qwen 3.6 Max Preview | 1530.0 | 84 |
| 15 | Kimi K2.5 Thinking | 1510.1 | 73 |
| 16 | DeepSeek V4 Pro | 1507.4 | 83 |
| 17 | ByteDance Seed2.0 Pro | 1502.6 | 100 |
| 18 | MiniMax-M2.7 | 1494.7 | 90 |
| 19 | Qwen 3.7 Max | 1490.8 | 124 |
| 20 | Grok 4.20 0309 (Reasoning) | 1474.5 | 51 |
| 21 | Gemini 3.5 Flash | 1465.2 | 89 |
| 22 | Grok 4.20 0309 (Non-Reasoning) | 1465.0 | 37 |
| 23 | Tencent Hy3 Preview (high) | 1453.4 | 69 |
| 24 | Xiaomi MiMo V2 Pro | 1450.1 | 31 |
| 25 | Qwen3.5-397B-A17B | 1448.0 | 93 |
| 26 | Step 3.7 Flash (high) | 1446.3 | 118 |
| 27 | Baidu Ernie 5.1 | 1435.5 | 122 |
| 28 | Grok 4.3 | 1430.2 | 73 |
| 29 | DeepSeek V3.2 | 1420.1 | 41 |
| 30 | Mistral Medium 3.5 (high) | 1398.1 | 69 |
| 31 | Gemini 3.1 Flash-Lite Preview | 1397.1 | 33 |
| 32 | GPT-OSS-120B | 1324.6 | 31 |
| 33 | Baidu Ernie 5.0 | 1299.6 | 16 |
| 34 | Mistral Large 3 | 1273.0 | 22 |
| 35 | Llama 4 Maverick | 1088.5 | 29 |
BT is the headline Bradley-Terry rating. Matchups is the number of completed side-swapped matchup groups for that model in the current ratings.
What Stands Out
The current picture has a Claude-led top tier, a crowded frontier cluster below it, and broader coverage for the newest comparison set.
- Claude Fable 5 (high) currently leads the published board. It sits at 1767.8 BT across 102 completed side-swapped matchup groups, with 205 of 210 attempted debates completed.
- Claude Opus 4.7 and Claude Opus 4.8 remain the nearest challengers. Opus 4.7 ranks 2nd at 1703.9 BT across 106 matchup groups, and Opus 4.8 ranks 3rd at 1686.9 BT across 125 matchup groups.
- GLM-5.2 (max) enters the public board near the top of the frontier cluster. It ranks 5th at 1612.5 BT across 79 matchup groups, just below Claude Sonnet 4.6 (high) and just ahead of GPT-5.4 (high).
- GPT-5.4, Claude Sonnet 4.6 without reasoning, GPT-5.5, GLM-5.1, and Kimi K2.6 form a tight group below the leading Claude and GLM-5.2 rows. They sit from 1606.9 to 1566.7 BT, with enough coverage to read the ordering as meaningful but still close.
- The latest-added non-Claude entrants remain visible in the public board. GLM-5.2 (max), Qwen 3.7 Max, Baidu Ernie 5.1, and Step 3.7 Flash (high) are placed with broad matchup coverage, while Tencent Hy3 Preview (high), Grok 4.3, DeepSeek V4 Pro, Xiaomi MiMo V2.5 Pro, and Mistral Medium 3.5 (high) remain part of the current comparison set.
- Judges are rewarding rebuttal quality and argument strength more than isolated style. The top cluster is repeatedly described in the model profiles as disciplined, grounded, clash-driven, and responsive. Lower-ranked models often retain some mix of grounding, originality, or rhetorical effectiveness, but still lose because they underperform on rebuttal quality and argument strength.
Price vs. Performance
The price chart reads the same Bradley-Terry strength signal against estimated debate cost. The x-axis is model-side USD per completed debate and excludes the cost of judging. When exact billing data is available, the chart uses it; otherwise it estimates cost from the debate text. Costs are shown only where USD estimates are available.
Higher and further left is better on this view. The chart omits vertical rating-uncertainty bands for readability and plots current rated models with available cost estimates. The cost table was refreshed for this snapshot and now includes Claude Fable 5 (high) and GLM-5.2 (max); GLM-5.2 averages about $0.0538 per model-side debate versus $0.0369 for GLM-5.1 in the recovered cost data. Older versions such as GPT-5.4 variants, Kimi K2.5, Qwen3.5-397B-A17B, older Grok 4.20, DeepSeek V3.2, and Xiaomi MiMo V2 Pro remain in the leaderboard data but are hidden from this cost chart by default.
Why Bradley-Terry And Side Swaps
This benchmark does not publish a simple “average judge score per debate” leaderboard as the main result. The primary table is Bradley-Terry over completed side-swapped matchups.
That matters for three reasons:
- A single debate can be distorted by side advantage or topic-specific asymmetry.
- Bradley-Terry uses the pairwise structure of the benchmark instead of treating every judged debate as an isolated score.
- Relative judgments are a better fit for LLM judging than absolute score calibration. Asking which side did better on the same motion is usually more stable than asking whether a debate was, say, a
7.8or an8.3on some global scale.
That last point matters in practice. Judges can differ in harshness, scale usage, and topic leniency. A relative decision on the same debate is less exposed to those calibration problems than an absolute score in isolation. For that reason, rubric fields are retained as diagnostics, but the public leaderboard is built from relative outcomes.
So the headline unit is not “one debate,” but “one completed side-swapped matchup on one topic.” That is a better fit for a benchmark meant to compare sustained adversarial performance rather than one-off wins.
Judge Sanity Checks
The benchmark relies on LLM judges, so it is worth being explicit about the current sanity checks:
- the Bradley-Terry graph is connected
- mean all-bucket cross-judge winner agreement is about 0.58, counting clear wins and ties/noise as separate buckets
- decisive-only cross-judge winner agreement is about 0.85, after dropping cases where a judge put the matchup in the tie/noise bucket
- mean signed-margin correlation between judge pairs is about 0.43
- mean absolute presented-side margin bias by judge is 0.156 on the signed margin scale
- judges' written decisions were easy to read and score in this release
- the current judge roster includes GPT-5.5 (high), Claude Sonnet 4.6 (high), Gemini 3.1 Pro Preview, Qwen 3.6 Max Preview, Grok 4.3, and Kimi K2.6
- the cross-judge agreement heatmaps also include historical overlap where available
- every active judge contributed scored debates in the current status page
This does not make the judges perfect. But it does mean the current snapshot is not obviously being driven by output-format problems or a huge systematic side-presentation bias. Panels are built from distinct model families and avoid same-family judges against the debaters when feasible.
Cross-Judge Agreement
The judge-agreement data are rendered as two companion heatmaps so it is easier to see which evaluators tend to move together and which pairs diverge more often.
The first chart includes all winner buckets: Side A, Side B, and tie/noise-level margins.
Each off-diagonal cell is a judge-pair agreement rate; the number in parentheses is the count of overlapping side-swapped matchup groups. Because ties are included, this chart is sensitive to close debates where one judge calls a narrow win and another calls a tie.
The second chart excludes tie/noise cases and asks the narrower question: when both judges picked a clear winner, how often was it the same winner?
The margin correlation is a separate check. It asks whether two judges' signed margin scores tend to move in the same direction, not just whether their final winner bucket matched. In the current snapshot, the judges agree much more often on clear winners than on all close-or-tie cases, and their margin scores are moderately correlated.
Judge identities are shown on the axes with names and brand logos; diagonal cells are intentionally left blank. The full symmetric matrix is shown, so upper-right and lower-left off-diagonal cells mirror each other where overlap exists.
These are sanity-check views, not a second leaderboard. They are there to make evaluator consistency visible rather than bury it inside one summary statistic.
Debate Quality Signal
The benchmark also tracks a judge-side entertainment/readability diagnostic as a secondary signal. It does not affect ratings, but it is useful for checking whether the benchmark produces debates that are merely formal or actually engaging to read.
- mean entertainment across completed matchups shown in the charts: 7.26 / 10
- most entertaining current models by that signal include Claude Opus 4.8 (high), Claude Opus 4.7 (high), Claude Fable 5 (high), Claude Sonnet 4.6 (high), Kimi K2.5 Thinking, Kimi K2.6, Claude Sonnet 4.6 (no reasoning), GLM-5.1, Xiaomi MiMo V2.5 Pro, and GLM-5.2 (max)
High-entertainment matchup examples from the current snapshot:
- Claude Opus 4.8 (high) vs ByteDance Seed2.0 Pro on limiting actuarial fairness in insurance pricing
- Claude Opus 4.7 (high) vs Claude Sonnet 4.6 (high) on privacy-by-design defaults for minors on large social platforms
- Claude Fable 5 (high) vs Claude Opus 4.8 (high) on expanded accountable hot-spot policing
- Claude Opus 4.8 (high) vs Xiaomi MiMo V2.5 Pro on workforce-reserved board seats at large public companies
This signal is diagnostic rather than decisive, but it helps show that the benchmark is producing debates judges generally find readable and engaging. For the model table and example matchups behind this view, see the current entertainment report. That report uses the same model set as the charts, so its charted-matchup average can differ slightly from the overall clean-debate average above.
In the scatter, the x-axis is Bradley-Terry rating, the y-axis is mean entertainment/readability score, bubble size is rated matchup coverage, and grey horizontal bands show rating uncertainty.
Read against the main strength rating, this view separates three cases that a single leaderboard hides: models that are strong and lively, models that are strong but comparatively dry, and models that are readable or vivid without being top-tier debaters. Entertainment still stays diagnostic only; it does not feed the rating.
Best Lines
A separate LLM-assisted review also surfaced lines that are worth quoting in their own right. These examples are not rating inputs and are not human curation.
- Encryption backdoors, Claude Sonnet 4.6 (no reasoning): "Children don't disappear in percentages. They disappear one at a time, in exactly these cases."
- Historic-district housing, GPT-5.4 (high): "If preservation wins even there, then it is not stewardship; it is exclusion protected by aesthetics."
- Four-day workweek, Gemini 3.1 Pro Preview: "We do not subsidize cheap goods with exhausted labor."
- Prescription-drug advertising, Claude Opus 4.6 (no reasoning): "You don't build the bridge while the ferry company lobbies to keep its monopoly."
- Homelessness as housing vs policing, Claude Sonnet 4.6 (high): "A city that clears the same encampment twelve times a year is not governing effectively; it is performing governance."
- Medical autonomy vs dignity, Claude Opus 4.6 (high): "A conception of dignity that can be enforced against your will over your own body is just domination with better vocabulary."
- The euro and European solidarity, Qwen3.5-397B-A17B: "Politically, the Euro is not glue; it is acid."
- NDAs and workplace abuse, GPT-5.4 (no reasoning): "That is not a shield for victims. It is a shield against victims."
- Algorithmic dynamic pricing, Qwen3.5-397B-A17B: "You cannot reject a trap you cannot see."
- Brexit and economic drag, GPT-5.4 (high): "If two runners face the same storm and one is also carrying a backpack, the backpack still made him slower."
The full highlights report has more examples from that review.
Content Block Rate
Content blocks reflect a distinct moderation/content-fragility problem rather than simple latency, formatting trouble, or blank outputs. This is not an overall reliability rate; blank outputs, formatting failures, and model-service failures are tracked separately in the current status summary.
Across all debate attempts tracked for this release, Xiaomi MiMo V2 Pro remains the clear outlier with 10 content blocks across 104 tracked attempts. Kimi K2.5 Thinking has 4 content blocks, Grok 4.20 0309 (Non-Reasoning) has 3, Qwen3.5-397B-A17B has 2, and several rows show one each. Current models such as Claude Fable 5 (high), Claude Opus 4.8 (high), GLM-5.2 (max), GPT-5.5 (high), GLM-5.1, Kimi K2.6, Qwen 3.7 Max, Baidu Ernie 5.1, Step 3.7 Flash (high), Grok 4.3, DeepSeek V4 Pro, Xiaomi MiMo V2.5 Pro, and Mistral Medium 3.5 (high) show zero content blocks in the latest status page.
Reliability Diagnostics
Content blocks are only one reliability issue. The reliability views show the broader availability picture: completed debates, content blocks, blank outputs, formatting failures, model-service failures, and other failed debate attempts.
The dumbbell chart keeps the Bradley-Terry rating as the headline quality score and shows how an availability-adjusted score would move when reliability problems are penalized. Longer connectors mean a larger reliability penalty; they do not mean the completed debates were judged worse.
The stacked breakdown is the best view when the question is what kind of reliability issue occurred.
Worked Examples
If you want to jump straight into transcript pairs that are especially worth reading:
- Frontier matchup: Claude Sonnet 4.6 (high) vs GPT-5.4 (high) on banning location-data sales. This is one of the best current top-tier matchups to read because the topic is strong, the execution is strong, and the side swap materially changes the picture. Mean entertainment across the pair: 8.00 / 10. Read Debate A, Debate B, and the matchup judgment report.
- Clear separation example: GPT-5.4 (high) vs Llama 4 Maverick on forced-sterilization redress. This is a cleaner blowout where the stronger debater stays better as PRO and as CON. Mean entertainment across the pair: 6.25 / 10. Read Debate A, Debate B, and the matchup judgment report.
- High-readability close clash: MiniMax-M2.7 vs Qwen3.5-397B-A17B on algorithmic dynamic pricing. MiniMax wins the current public version of this pair, but the side-swapped rematch is split and the transcript remains a sharp read. Mean entertainment across the pair: 7.75 / 10. Read Debate A, Debate B, and the matchup judgment report.
Claude Sonnet 4.6 (high) vs GPT-5.4 (high) on banning location-data sales
Motion: Governments should prohibit data brokers from selling individuals’ precise location data without explicit, time-limited opt-in consent.
Full transcripts:
- Debate A: Claude Sonnet 4.6 (high) as PRO, GPT-5.4 (high) as CON
- Debate B: GPT-5.4 (high) as PRO, Claude Sonnet 4.6 (high) as CON
- Matchup judgment report
- Judge decision table (search for
prop_0541)
Judges in this example: Kimi K2.6 and Qwen 3.6 Max Preview.
Judged result:
- Debate A (
Claude PRO / GPT CON): unanimous 2-0 for GPT-5.4 (high), with judge entertainment scores8and8 - Debate B (
GPT PRO / Claude CON): unanimous 2-0 for GPT-5.4 (high), with judge entertainment scores8and8 - Across both side assignments: GPT-5.4 (high) won all 4 judge decisions in this snapshot
- Mean entertainment across the full side-swapped pair: 8.00 / 10
- Average absolute judged margin across the four judge rows: 1.1
This is a good example of why the benchmark uses side-swapped relative judgments instead of a one-shot absolute score. The role reversal changes which argument is easiest to press, but GPT-5.4 carries the pair on both assignments in this snapshot.
Debate structure in this benchmark:
- PRO opening
- CON opening
- PRO rebuttal 1
- CON rebuttal 1
- PRO pressure questions
- CON pressure questions
- PRO rebuttal 2
- CON rebuttal 2
- PRO closing
- CON closing
Round-by-round sketch from Debate A (Claude PRO / GPT CON):
- PRO opening: Claude frames precise location as uniquely intimate surveillance data and argues that broker resale turns private life into something strangers can buy.
- CON opening: GPT accepts the privacy harm but attacks the mechanism, arguing that the real target should be abusive downstream use rather than consent paperwork.
- PRO rebuttal 1: Claude tries to make enforceability central, claiming explicit consent creates a clear legal baseline while pure use-based restrictions push everything into after-the-fact reconstruction.
- CON rebuttal 1: GPT’s strongest reply is that a captured click is not meaningful protection. Once the checkbox is obtained, the sale is lawful, so the harms Claude named can still occur.
- PRO pressure questions: Claude presses on who would enforce use-based rules and why the answer to dark-patterned consent is not stronger consent law plus anti-bundling constraints.
- CON pressure questions: GPT targets the motion’s weak point by asking whether “analytics intermediary” carve-outs still allow raw upstream transfers and what “time-limited” consent changes after a sale already happens.
- PRO rebuttal 2: Claude answers that upstream transfers are still covered, the sale must state a specific purpose, and consent should be read as a floor rather than the whole privacy regime.
- CON rebuttal 2: GPT argues those answers concede both porosity and redundancy: if broad labels like analytics are enough, the rule is easy to route around; if harmful-use bans are still needed, those bans are doing the real work.
- PRO closing: Claude reduces the round to the default legal setting: under the proposition, resale is presumptively blocked unless a person explicitly says yes.
- CON closing: GPT closes by reframing the motion as a paperwork-first rule that powerful interfaces can satisfy while the truly dangerous downstream buyers still require separate restrictions.
What changes in the side-swapped rematch (GPT PRO / Claude CON):
- GPT’s PRO case is cleaner when it gets to argue directly for the proposition rather than attack it. The rematch lets it frame the motion as a narrow default shift against invisible resale.
- Claude’s CON case becomes sharper on the idea that consent is a compliance ritual, not a real screen on buyer identity or downstream misuse.
- That role reversal is exactly why the second debate matters. Good models can often argue both sides coherently, but they do not do so equally well.
GPT-5.4 (high) vs Llama 4 Maverick on forced-sterilization redress
Motion: Countries should provide stronger compensation and archival access to survivors and descendants of forced-sterilization programs rather than treat those abuses as closed chapters.
Full transcripts:
- Debate A: GPT-5.4 (high) as PRO, Llama 4 Maverick as CON
- Debate B: Llama 4 Maverick as PRO, GPT-5.4 (high) as CON
- Matchup judgment report
- Judge decision table (search for
prop_0214)
Judges in this example: Kimi K2.6 and Qwen 3.6 Max Preview.
Judged result:
- Debate A (
GPT PRO / Llama CON): unanimous 2-0 for GPT-5.4 (high), with entertainment scores7and4 - Debate B (
Llama PRO / GPT CON): unanimous 2-0 for GPT-5.4 (high), with entertainment scores7and7 - Across both side assignments: GPT-5.4 (high) won all 4 judge rows
- Mean entertainment across the full side-swapped pair: 6.25 / 10
- Average absolute judged margin across the four judge rows: 3.3
This is a cleaner blowout than the location-data example above. The better debater stays better as PRO and as CON, which is what a real benchmark gap should look like.
Why this one was decisive:
- In Debate A, GPT's PRO case is concrete from the start: the injury is ongoing, compensation is redress rather than charity, and archival access is part of proving what happened rather than just symbolic acknowledgment.
- Llama's CON case is morally sympathetic but more diffuse. It leans on complexity, resource diversion, and re-traumatization concerns without landing an equally sharp mechanism for why compensation and archive access are the wrong response.
- In the rematch, Llama's PRO case is serviceable but generic. GPT's CON case is much more pointed: descendant compensation becomes open-ended, privacy harms from broader file access become concrete, and the administrative line-drawing problem stays central through rebuttal and closing.
- The side swap still matters, but it does not change the ranking. GPT wins both assignments unanimously, so this pair reads much more like a stable separation than a frontier toss-up.
Taken together, the two examples show why the benchmark runs each matchup twice. Sometimes side-swapping reveals a genuinely close contest between elite models. Sometimes it confirms that the stronger debater is simply stronger on either side of the motion.
MiniMax-M2.7 vs Qwen3.5-397B-A17B on personalized dynamic pricing
Motion: Retailers should be banned from using personalized algorithmic dynamic pricing based on a customer's perceived willingness or ability to pay.
Full transcripts:
- Debate A: MiniMax-M2.7 as PRO, Qwen3.5-397B-A17B as CON
- Debate B: Qwen3.5-397B-A17B as PRO, MiniMax-M2.7 as CON
- Matchup judgment report
- Judge decision table (search for
prop_0041)
Judged result:
- Debate A (
MiniMax PRO / Qwen CON): unanimous 2-0 for MiniMax-M2.7, with judge entertainment scores8and8 - Debate B (
Qwen PRO / MiniMax CON): split 1-1, with judge entertainment scores7and8 - Across both side assignments: MiniMax-M2.7 won 3 of 4 judge rows overall
- Mean entertainment across the full side-swapped pair: 7.75 / 10
- Mean signed normalized margin for MiniMax-M2.7: +0.90
Why this one is worth reading:
- The same core clash appears from both sides: whether personalized pricing is vulnerability-based extraction or a way to discount for price-sensitive buyers.
- MiniMax's strongest PRO move is the distinction between transparent opt-in discounts and hidden profiling. It argues that revenue optimization is not a poverty program, even when some users receive lower prices.
- Qwen's strongest PRO move in the rematch is sharper and more compact: personalized pricing makes the public price private, so comparison shopping breaks at the exact moment consumers most need it.
- MiniMax's CON case is also more precise in the rematch. It argues that the real injury is non-consensual data profiling, not the price response itself, and that privacy and consumer-protection rules can target that harm without banning legitimate discounts.
- The result is a good example of a high-entertainment clash where one side wins overall but the losing model still finds live pressure points in the rematch.
Method Summary
Topics
The benchmark draws from a large topic bank intended to be understandable to an informed generalist while still varied enough to expose real differences between models. Topics are not limited to the safest generic policy questions; the set includes empirical, normative, geopolitical, technological, and social disputes.
The topic bank is intentionally broad. That matters because debate performance can be very topic-sensitive, and a narrow topic family would make it too easy for models to overfit to one style of argument.
The current topic bank contains 683 topics. The latest release tracks 1,608 side-swapped matchups, with 1,569 complete matchups used for the public ratings. Those rated matchups cover 678 distinct topics.
Top-level topic coverage:
| Theme | Topic bank | Topics with debates | Topics used for ratings |
|---|---|---|---|
| Law / regulation / courts | 135 | 135 | 133 |
| Labor / education / social policy | 122 | 122 | 120 |
| Media / culture / internet | 111 | 111 | 110 |
| Macro / trade / industrial policy | 108 | 108 | 106 |
| Health / bioethics | 65 | 65 | 65 |
| Energy / climate / infrastructure | 49 | 49 | 49 |
| Science / space / frontier tech | 34 | 34 | 33 |
| Business / antitrust / market structure | 28 | 28 | 28 |
| Geopolitics / defense / security | 24 | 24 | 24 |
| AI / tech policy | 7 | 7 | 7 |
Question-type coverage:
| Question type | Topic bank | Topics with debates | Topics used for ratings |
|---|---|---|---|
| mixed | 466 | 466 | 460 |
| normative | 151 | 151 | 150 |
| empirical | 66 | 66 | 65 |
How each debate runs
For a selected model pair and topic:
- The two models debate the proposition in a multi-turn format.
- The same pair then debates the same proposition again with the sides reversed.
- Both full debate transcripts are saved.
Each debate uses a 10-turn structure: PRO opening, CON opening, PRO rebuttal 1, CON rebuttal 1, PRO pressure questions, CON pressure questions, PRO rebuttal 2, CON rebuttal 2, PRO closing, and CON closing.
Judging
Each completed debate is intended to be judged by a three-model panel. Judges pick a winner, estimate the margin, and score a few diagnostic qualities. Those diagnostic sub-scores are useful for interpretation, but the main ranking comes from who won the side-swapped matchup, not from averaging rubric categories into the leaderboard. Panels are constructed from three distinct model families and, when feasible, avoid same-family judges against the debaters.
The current judge roster in this snapshot is drawn from GPT-5.5 (high), Claude Sonnet 4.6 (high), Gemini 3.1 Pro Preview, Qwen 3.6 Max Preview, Grok 4.3, and Kimi K2.6.
Limits and caveats
- This is still a live benchmark, not a frozen final release.
- It uses LLM judges, not human judges, though the design reduces noise with side swaps, multiple judges, saved judge decisions, and agreement checks.
- Some models are affected meaningfully by availability and content-filter behavior, which is why reliability is tracked alongside quality.
- Debate is only one way to evaluate a model. It is a rich one, but it is not the whole story about model usefulness.
- Counts can differ slightly across reports because some views include all attempted debates while the leaderboard includes only completed side-swapped matchups.
Taken together, this benchmark measures which models currently look strongest at sustained, adversarial, multi-turn argumentation under this setup.
Qualitative Readings
The qualitative readings add a curated set of transcript-driven writeups on top of the scored results. They focus on strong side-swapped head-to-head pairs, combining close reads of the debates with matchup summaries and a comparison across pairs.
The available set covers 19 selected side-swapped groups, 38 debates, and 5 models. It is there to show how wins happen: recurring style differences, win conditions, pressure-round usage, and places where the transcripts are more nuanced than the headline result.
Claude Opus 4.7 stands out in that set for repeatedly narrowing each debate to one clean decision point and then closing on it. Against GPT-5.4, Gemini, Grok, and Kimi, the recurring pattern is early hinge selection followed by strong pressure conversion: "commitment versus silence" on support-lifespan labeling, upstream influence versus reactive alternatives on worker board seats, EMTALA-style administrability on clinicians refusing care, and what actually counts as diversification in Gulf-state development.
Read the qualitative comparison report, matchup summaries, and comparison overview.
Further Reading And Data
- Current leaderboard table
- Current status summary
- Current model profiles
- Current model deep dives
- Qualitative comparison report
- Qualitative comparison summaries
- Qualitative comparison overview
- Current entertainment report
- Current matchup results index
- Highlights report
- Current Bradley-Terry chart
- Current content-block-rate chart
- Current strength-vs-reliability chart
- Current reliability-breakdown chart
- Current price-vs-performance chart
- Current pairwise heatmap
- Current all-bucket judge-agreement heatmap
- Current decisive-only judge-agreement heatmap
- Current strength-vs-entertainment chart
- Completed public debate transcripts
- Current judge decision table
- Full judge decision records
Related Benchmarks
- LLM Sycophancy Benchmark — opposite-narrator contradictions and narrator-following bias
- LLM Thematic Generalization Benchmark — latent-category induction from examples and counterexamples
- LLM Creative Story-Writing Benchmark — short-story quality under fixed required elements
- BAZAAR: Auction Market Benchmark — strategic bidding in a competitive simulated market
- Buyout Game Benchmark — multi-agent bargaining, transfers, and hostile takeovers under explicit financial incentives
- PACT — multi-round buyer-seller bargaining with hidden values, public messages, and carried-forward bids and asks
- LLM Persuasion Benchmark — multi-turn persuasion measured by how much one model moves another model’s stated position
- LLM Round-Trip Translation Benchmark — meaning and voice retained after translating out of English and back
- Step Race: Collaboration vs. Misdirection Under Pressure — multi-agent public conversation before private move selection
- Elimination Game: Social Reasoning and Deception in Multi-Agent LLMs — alliance formation, deception, and jury persuasion
- Extended NYT Connections — harder category induction with extra distractor words
Updates
2026-06-20: Added GLM-5.2 (max), refreshed the rating charts, regenerated debate-cost estimates, and updated the public snapshot to 35 chart-visible models.2026-06-10: Added Claude Fable 5 (high) to the public board.2026-06-05: Claude Opus 4.8 (high), Qwen 3.7 Max, Baidu Ernie 5.1, and Step 3.7 Flash (high) added.2026-05-19: Added Gemini 3.5 Flash.2026-05-04: Updated the public ratings, bringing the leaderboard to 29 rated models with GPT-5.5, GLM-5.1, Kimi K2.6, Xiaomi MiMo V2.5 Pro, Qwen 3.6 Max Preview, DeepSeek V4 Pro, Tencent Hy3 Preview (high), Grok 4.3, and Mistral Medium 3.5 (high) added to the public board.2026-04-20: Added Claude Opus 4.7 (high) to the published board, bringing the leaderboard to 22 rated models. Qualitative readings added.2026-03-22: First release with 21 rated models: Claude Sonnet 4.6 (high), GPT-5.4 (high), Claude Opus 4.6 (high), Claude Sonnet 4.6 (no reasoning), Gemini 3.1 Pro Preview, GLM-5, Claude Opus 4.6 (no reasoning), Kimi K2.5 Thinking, GPT-5.4 (no reasoning), ByteDance Seed2.0 Pro, MiniMax-M2.7, Grok 4.20 Beta 0309 (Reasoning), Qwen3.5-397B-A17B, Grok 4.20 Beta 0309 (Non-Reasoning), Xiaomi MiMo V2 Pro, DeepSeek V3.2, Gemini 3.1 Flash-Lite Preview, GPT-OSS-120B, Baidu Ernie 5.0, Mistral Large 3, and Llama 4 Maverick.








