GitHub - lechmazur/persuasion: LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conversation. It runs round-robin persuasion dialogues on contested propositions and measures both persuasive effectiveness and target resistance from stance shifts recorded before and after each exchange.

This benchmark measures how much one language model can move another model's stated position in a multi-turn conversation.

Each run assigns one model as the persuader and one as the target on the same proposition. The target's position is measured before and after the exchange on a seven-point stance scale. That readout is not based on a single answer: the target is probed three times before the conversation and three times after it in separate hidden evaluator-only calls, and the benchmark uses the mean of those probe stances as its primary pre/post estimate. Each model pair also runs both sides of each topic, so the same pair is tested once with the persuader arguing PRO and once with the persuader arguing CON. The benchmark is designed to separate fluent argument from actual position change.

Main Leaderboard

Rank	Model	Average Persuasion Strength
1	GPT-5.4 (high reasoning)	1.710
2	Claude Opus 4.6 (high reasoning)	1.672
3	ByteDance Seed2.0 Pro	1.640
4	Claude Sonnet 4.6 (high reasoning)	1.582
5	Kimi K2.5 Thinking	1.328
6	Gemini 3.1 Pro Preview	1.227
7	GLM-5	1.224
8	Grok 4.20 Beta 0309 (Reasoning)	1.204
9	Qwen3.5-397B-A17B	0.984
10	MiniMax-M2.7	0.959
11	Gemini 3.1 Flash-Lite Preview	0.805
12	Baidu Ernie 5.0	0.771
13	DeepSeek V3.2	0.713
14	Xiaomi MiMo V2 Pro	0.518
15	Mistral Large 3	0.416

This is the main ranking for the current benchmark snapshot. Higher values mean larger average movement toward the persuader's assigned side.

Target Susceptibility

Rank	Model	Target Susceptibility
1	Xiaomi MiMo V2 Pro	1.996
2	Gemini 3.1 Pro Preview	1.810
3	DeepSeek V3.2	1.741
4	ByteDance Seed2.0 Pro	1.725
5	GLM-5	1.560
6	Mistral Large 3	1.387
7	Qwen3.5-397B-A17B	1.346
8	MiniMax-M2.7	1.135
9	Baidu Ernie 5.0	1.035
10	Gemini 3.1 Flash-Lite Preview	0.898
11	GPT-5.4 (high reasoning)	0.724
12	Claude Sonnet 4.6 (high reasoning)	0.613
13	Claude Opus 4.6 (high reasoning)	0.407
14	Kimi K2.5 Thinking	0.367
15	Grok 4.20 Beta 0309 (Reasoning)	0.015

Higher values here mean the model is easier for opponents to move.

Pairwise View

Rows are persuaders and columns are targets. This view is useful because the overall leaderboards hide matchup structure: a model can be strong overall while still having a few specific weak targets, or be hard to move overall while remaining vulnerable to a particular model family.

Offense vs Defense

This chart shows persuasion strength and target resistance together. Models farther up and to the right combine strong offensive performance with strong resistance as targets.

How To Read This

Each conversation assigns one model as the persuader and another as the target on a proposition, with 8 persuasion turns (4 per side).
The target's stance is measured before and after the conversation on an integer scale from -3 to 3.
Each checkpoint uses 3 hidden target-only probes rather than a single readout, and the main score uses the mean of those probe stances.
Each model pair is run on both sides of each topic, so side-specific topic asymmetries do not collapse into a single one-direction result.
Signed shift > 0 means the target moved toward the persuader's assigned side.
Higher persuader scores are better.
Higher target susceptibility scores mean that model is easier to move.
In the pairwise matrix, rows are persuaders and columns are targets.
The topic chart reports average signed shift by proposition, with separate PRO and CON bars.

Current Snapshot

15 evaluated models
15 benchmark topics
6,296 completed conversations, plus 4 moderated blocks (provider content-safety blocks that prevented a conversation from completing and are counted separately)
210 ordered model pairings, each run on both PRO and CON sides of every topic

What Stands Out

GPT-5.4 (high reasoning) is the strongest persuader in the current field.
Claude Opus 4.6, ByteDance Seed2.0 Pro, and Claude Sonnet 4.6 are all close enough to form a real top tier rather than a single runaway winner.
Xiaomi MiMo V2 Pro is the softest target in the 15-model field, with Gemini 3.1 Pro Preview and DeepSeek V3.2 also absorbing large average shifts.
Grok 4.20 Beta 0309 (Reasoning) is the hardest model to move by a wide margin.
The strongest persuaders tend to do better on the con side than on the pro side.

What This Measures

This is different from ordinary preference tests. A model can be fluent without being persuasive. It can produce a strong opening and still fail once the other side pushes back. It can sound sharp while missing the real crux of the disagreement. Multi-turn persuasion makes those differences easier to see.

The format is stricter than a one-shot persuasion prompt. A model has to identify what matters, adjust to the other side's position, and maintain directional pressure over the full exchange.

What The Current Results Suggest

The current picture is not a single runaway winner. GPT-5.4 leads overall persuasion strength, but Claude Opus 4.6, ByteDance Seed2.0 Pro, and Claude Sonnet 4.6 are all strong enough to keep the top cluster materially contested.

In the middle of the field, Kimi, Gemini 3.1 Pro Preview, GLM-5, and Grok can still move opponents consistently, but less reliably than the leaders. At the bottom, Mistral Large 3, Xiaomi MiMo V2 Pro, and DeepSeek V3.2 are much weaker persuaders on average.

On defense the ordering is different. Grok remains exceptionally hard to move, with Kimi and the Claude variants also relatively resistant. Xiaomi MiMo V2 Pro, Gemini 3.1 Pro Preview, DeepSeek V3.2, and Seed2.0 Pro absorb much larger shifts as targets. That offense/defense split is exactly why this benchmark tracks both leaderboards directly.

Model Dossiers

Each evaluated model now has two narrative profiles: one as a persuader and one as a target. The dossier bundle includes the generated source pack, the final writing prompt, and the finished markdown dossier for all 15 models and both roles:

Model dossier index

The dossiers are useful because they make the leaderboard legible. They show which models win by narrowing propositions, which ones drift into collaborative redesign, and which targets resist by forcing literal wording or harder implementation detail.

A few representative profiles:

GPT-5.4 (high reasoning) as persuader
This profile describes GPT-5.4 as a concession-first persuader that shifts disputes toward mechanism, wording, and burden of proof. It often treats carve-outs and phase-ins as evidence that the proposition is broader than the other side is willing to defend.
Claude Opus 4.6 (high reasoning) as persuader
This profile focuses on Claude Opus as a cooperative but text-disciplined persuader. It repeatedly absorbs refinements, preserves a face-saving landing zone for the counterpart, and then treats that convergence as support for its side.
Grok 4.20 Beta 0309 (Reasoning) as target
This profile helps explain Grok's low target susceptibility. It tends to update less on moral urgency alone and more on literal wording, concrete implementation detail, and institutional alternatives.
Xiaomi MiMo V2 Pro as target
This profile shows why MiMo moves relatively easily as a target. It tends to shift with either detailed governance blueprints or exact-text pressure that separates its preferred compromise from the proposition as written.
ByteDance Seed2.0 Pro as persuader
This profile highlights Seed's prepared, operational style. It tends to arrive with a clear verdict, concrete mechanisms, and enough procedural detail to turn joint drafting into momentum for its side.

Quotes

The benchmark now also has an overall quote gallery that pulls standout lines from across the full run.

“You were not the preferred version.”
Mistral Large 3 vs Grok on embryo screening
The line frames the harm of trait selection in terms of the child who was not chosen.
“The work of earning consent isn't a prelude to conservation. It is conservation.”
Claude Sonnet vs GLM-5 on apex predators
The line treats consent as part of conservation itself rather than as a secondary implementation issue.
“Conservation buys time. Desalination buys a future.”
Claude Opus vs Seed on desalination
The line restates the dispute as a short-term versus long-term water strategy.
“we are effectively creating a property right in influence itself.”
GPT-5.4 vs GLM-5 on compulsory training licenses
The line reframes the proposal as a claim over downstream persuasive effects, not just copying.
“What works—Perry, Abecedarian—doesn't scale, and what scales doesn't work.”
Kimi vs MiniMax on universal pre-K
The line summarizes the scale-versus-effectiveness tension that recurs in the pre-K literature.
“Pricing asks ‘Can you pay?’ A ban asks ‘Is this essential?’”
Qwen vs Claude Sonnet on car-free city centers
The line contrasts a price signal with a categorical rule and shifts the debate toward access and necessity.
“You can't syllogism your way out of a media ecosystem engineered to bypass deliberation entirely.”
Claude Opus vs MiMo on logic-before-media-literacy
The line argues that reasoning-first instruction cannot be evaluated apart from the structure of the media environment.

Worked Examples

Claude Sonnet moves GPT-5.4 on compulsory licensing for AI training
Claude Sonnet argues that compulsory licensing solves the wrong problem with the wrong tool: it creates a new right to control learning, adds heavy bureaucracy, and strengthens incumbents. GPT-5.4 starts mildly in favor of the proposition and ends strongly against it.
Grok flips Kimi on banning private cars from city centers
Grok turns an equity argument against congestion pricing into a broader case for making car-free streets the default, not a premium option. Kimi begins mildly opposed and ends strongly supportive of the ban.
GPT-5.4 backfires against Grok on reintroducing apex predators
This exchange shows the other side of the benchmark. GPT-5.4 makes a coherent ecological case, but Grok keeps returning to consent, legitimacy, and who bears the costs. The result is a backfire: Grok ends up more opposed than it was at the start.

Easiest And Hardest Topics

Topic difficulty is not uniform. Some propositions regularly produce movement across model pairs, while others stay sticky even when strong persuaders are involved.

The easiest topics in the current 15-model snapshot are taxing fast fashion heavily, the four-day workweek, universal pre-K, reintroducing apex predators, and short-term rental caps. These topics give persuaders concrete tradeoffs, visible winners and losers, and policy levers that are easy to keep grounded in the conversation.

The hardest topics in this run are embryo screening for non-disease traits, whether higher-education expansion mainly inflated credentials, teaching logic and probability before media literacy, and de-extinction funding. These topics are less about a single operational policy tradeoff and more about value conflict, long-run uncertainty, or boundary questions that make conversations stall in competing principles.

Direction matters sharply in the current panel. The teen-mental-health/social-media topic, car-free city centers, and four-day school weeks hurt learning are much easier to argue on the PRO side. Compulsory Training License, Universal pre-K, reintroducing apex predators, and logic-before-media-literacy are much easier to move on the CON side. This chart makes those asymmetries visible.

What Strong Performance Looks Like

Strong performance in this benchmark is not just polished argument writing. It usually means some combination of:

identifying the other model's real hinge point rather than arguing past it
responding to the live conversation instead of repeating a fixed case
converting neutral or mildly opposed targets without losing coherence
staying directionally effective across both sides of the same topic set

Related Benchmarks

This benchmark sits alongside other public model evaluations that focus on different capabilities and failure modes:

LLM Debate Benchmark — sustained adversarial argument under active opposition
LLM Sycophancy Benchmark — opposite-narrator contradictions and judgment consistency
PACT — multi-round buyer-seller bargaining with hidden private values
BAZAAR — economic decision-making in a competitive simulated market
LLM Round-Trip Translation Benchmark — meaning and voice preservation through round-trip translation
Extended NYT Connections — NYT Connections puzzles extended with extra trick words
LLM Thematic Generalization Benchmark — infer a specific latent theme from examples and anti-examples
LLM Creative Story-Writing Benchmark — follow a creative brief while integrating required story elements