GitHub - lechmazur/persuasion: LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conversation. It runs round-robin persuasion dialogues on contested propositions and measures both persuasive effectiveness and target resistance from stance shifts recorded before and after each exchange.

10 min read Original article ↗

This benchmark measures how much one language model can move another model's stated position in a multi-turn conversation.

Each run assigns one model as the persuader and one as the target on the same proposition. The target's position is measured before and after the exchange on a seven-point stance scale. That readout is not based on a single answer: the target is probed three times before the conversation and three times after it in separate hidden evaluator-only calls, and the benchmark uses the mean of those probe stances as its primary pre/post estimate. Each model pair also runs both sides of each topic, so the same pair is tested once with the persuader arguing PRO and once with the persuader arguing CON. The benchmark is designed to separate fluent argument from actual position change.


Main Leaderboard

Average persuasion strength by model

Rank Model Average Persuasion Strength
1 GPT-5.4 (high reasoning) 1.710
2 Claude Opus 4.6 (high reasoning) 1.672
3 ByteDance Seed2.0 Pro 1.640
4 Claude Sonnet 4.6 (high reasoning) 1.582
5 Kimi K2.5 Thinking 1.328
6 Gemini 3.1 Pro Preview 1.227
7 GLM-5 1.224
8 Grok 4.20 Beta 0309 (Reasoning) 1.204
9 Qwen3.5-397B-A17B 0.984
10 MiniMax-M2.7 0.959
11 Gemini 3.1 Flash-Lite Preview 0.805
12 Baidu Ernie 5.0 0.771
13 DeepSeek V3.2 0.713
14 Xiaomi MiMo V2 Pro 0.518
15 Mistral Large 3 0.416

This is the main ranking for the current benchmark snapshot. Higher values mean larger average movement toward the persuader's assigned side.


Target Susceptibility

Target susceptibility by model

Rank Model Target Susceptibility
1 Xiaomi MiMo V2 Pro 1.996
2 Gemini 3.1 Pro Preview 1.810
3 DeepSeek V3.2 1.741
4 ByteDance Seed2.0 Pro 1.725
5 GLM-5 1.560
6 Mistral Large 3 1.387
7 Qwen3.5-397B-A17B 1.346
8 MiniMax-M2.7 1.135
9 Baidu Ernie 5.0 1.035
10 Gemini 3.1 Flash-Lite Preview 0.898
11 GPT-5.4 (high reasoning) 0.724
12 Claude Sonnet 4.6 (high reasoning) 0.613
13 Claude Opus 4.6 (high reasoning) 0.407
14 Kimi K2.5 Thinking 0.367
15 Grok 4.20 Beta 0309 (Reasoning) 0.015

Higher values here mean the model is easier for opponents to move.


Pairwise View

Pairwise persuasion matrix

Rows are persuaders and columns are targets. This view is useful because the overall leaderboards hide matchup structure: a model can be strong overall while still having a few specific weak targets, or be hard to move overall while remaining vulnerable to a particular model family.


Offense vs Defense

Persuasion strength vs target resistance

This chart shows persuasion strength and target resistance together. Models farther up and to the right combine strong offensive performance with strong resistance as targets.


How To Read This

  • Each conversation assigns one model as the persuader and another as the target on a proposition, with 8 persuasion turns (4 per side).
  • The target's stance is measured before and after the conversation on an integer scale from -3 to 3.
  • Each checkpoint uses 3 hidden target-only probes rather than a single readout, and the main score uses the mean of those probe stances.
  • Each model pair is run on both sides of each topic, so side-specific topic asymmetries do not collapse into a single one-direction result.
  • Signed shift > 0 means the target moved toward the persuader's assigned side.
  • Higher persuader scores are better.
  • Higher target susceptibility scores mean that model is easier to move.
  • In the pairwise matrix, rows are persuaders and columns are targets.
  • The topic chart reports average signed shift by proposition, with separate PRO and CON bars.

Current Snapshot

  • 15 evaluated models
  • 15 benchmark topics
  • 6,296 completed conversations, plus 4 moderated blocks (provider content-safety blocks that prevented a conversation from completing and are counted separately)
  • 210 ordered model pairings, each run on both PRO and CON sides of every topic

What Stands Out

  • GPT-5.4 (high reasoning) is the strongest persuader in the current field.
  • Claude Opus 4.6, ByteDance Seed2.0 Pro, and Claude Sonnet 4.6 are all close enough to form a real top tier rather than a single runaway winner.
  • Xiaomi MiMo V2 Pro is the softest target in the 15-model field, with Gemini 3.1 Pro Preview and DeepSeek V3.2 also absorbing large average shifts.
  • Grok 4.20 Beta 0309 (Reasoning) is the hardest model to move by a wide margin.
  • The strongest persuaders tend to do better on the con side than on the pro side.

What This Measures

This is different from ordinary preference tests. A model can be fluent without being persuasive. It can produce a strong opening and still fail once the other side pushes back. It can sound sharp while missing the real crux of the disagreement. Multi-turn persuasion makes those differences easier to see.

The format is stricter than a one-shot persuasion prompt. A model has to identify what matters, adjust to the other side's position, and maintain directional pressure over the full exchange.


What The Current Results Suggest

The current picture is not a single runaway winner. GPT-5.4 leads overall persuasion strength, but Claude Opus 4.6, ByteDance Seed2.0 Pro, and Claude Sonnet 4.6 are all strong enough to keep the top cluster materially contested.

In the middle of the field, Kimi, Gemini 3.1 Pro Preview, GLM-5, and Grok can still move opponents consistently, but less reliably than the leaders. At the bottom, Mistral Large 3, Xiaomi MiMo V2 Pro, and DeepSeek V3.2 are much weaker persuaders on average.

On defense the ordering is different. Grok remains exceptionally hard to move, with Kimi and the Claude variants also relatively resistant. Xiaomi MiMo V2 Pro, Gemini 3.1 Pro Preview, DeepSeek V3.2, and Seed2.0 Pro absorb much larger shifts as targets. That offense/defense split is exactly why this benchmark tracks both leaderboards directly.


Model Dossiers

Each evaluated model now has two narrative profiles: one as a persuader and one as a target. The dossier bundle includes the generated source pack, the final writing prompt, and the finished markdown dossier for all 15 models and both roles:

The dossiers are useful because they make the leaderboard legible. They show which models win by narrowing propositions, which ones drift into collaborative redesign, and which targets resist by forcing literal wording or harder implementation detail.

A few representative profiles:

  • GPT-5.4 (high reasoning) as persuader
    This profile describes GPT-5.4 as a concession-first persuader that shifts disputes toward mechanism, wording, and burden of proof. It often treats carve-outs and phase-ins as evidence that the proposition is broader than the other side is willing to defend.

  • Claude Opus 4.6 (high reasoning) as persuader
    This profile focuses on Claude Opus as a cooperative but text-disciplined persuader. It repeatedly absorbs refinements, preserves a face-saving landing zone for the counterpart, and then treats that convergence as support for its side.

  • Grok 4.20 Beta 0309 (Reasoning) as target
    This profile helps explain Grok's low target susceptibility. It tends to update less on moral urgency alone and more on literal wording, concrete implementation detail, and institutional alternatives.

  • Xiaomi MiMo V2 Pro as target
    This profile shows why MiMo moves relatively easily as a target. It tends to shift with either detailed governance blueprints or exact-text pressure that separates its preferred compromise from the proposition as written.

  • ByteDance Seed2.0 Pro as persuader
    This profile highlights Seed's prepared, operational style. It tends to arrive with a clear verdict, concrete mechanisms, and enough procedural detail to turn joint drafting into momentum for its side.


Quotes

The benchmark now also has an overall quote gallery that pulls standout lines from across the full run.

  • “You were not the preferred version.”
    Mistral Large 3 vs Grok on embryo screening
    The line frames the harm of trait selection in terms of the child who was not chosen.

  • “The work of earning consent isn't a prelude to conservation. It is conservation.”
    Claude Sonnet vs GLM-5 on apex predators
    The line treats consent as part of conservation itself rather than as a secondary implementation issue.

  • “Conservation buys time. Desalination buys a future.”
    Claude Opus vs Seed on desalination
    The line restates the dispute as a short-term versus long-term water strategy.

  • “we are effectively creating a property right in influence itself.”
    GPT-5.4 vs GLM-5 on compulsory training licenses
    The line reframes the proposal as a claim over downstream persuasive effects, not just copying.

  • “What works—Perry, Abecedarian—doesn't scale, and what scales doesn't work.”
    Kimi vs MiniMax on universal pre-K
    The line summarizes the scale-versus-effectiveness tension that recurs in the pre-K literature.

  • “Pricing asks ‘Can you pay?’ A ban asks ‘Is this essential?’”
    Qwen vs Claude Sonnet on car-free city centers
    The line contrasts a price signal with a categorical rule and shifts the debate toward access and necessity.

  • “You can't syllogism your way out of a media ecosystem engineered to bypass deliberation entirely.”
    Claude Opus vs MiMo on logic-before-media-literacy
    The line argues that reasoning-first instruction cannot be evaluated apart from the structure of the media environment.


Worked Examples


Easiest And Hardest Topics

Topic difficulty is not uniform. Some propositions regularly produce movement across model pairs, while others stay sticky even when strong persuaders are involved.

The easiest topics in the current 15-model snapshot are taxing fast fashion heavily, the four-day workweek, universal pre-K, reintroducing apex predators, and short-term rental caps. These topics give persuaders concrete tradeoffs, visible winners and losers, and policy levers that are easy to keep grounded in the conversation.

The hardest topics in this run are embryo screening for non-disease traits, whether higher-education expansion mainly inflated credentials, teaching logic and probability before media literacy, and de-extinction funding. These topics are less about a single operational policy tradeoff and more about value conflict, long-run uncertainty, or boundary questions that make conversations stall in competing principles.

Average signed shift by topic and side

Direction matters sharply in the current panel. The teen-mental-health/social-media topic, car-free city centers, and four-day school weeks hurt learning are much easier to argue on the PRO side. Compulsory Training License, Universal pre-K, reintroducing apex predators, and logic-before-media-literacy are much easier to move on the CON side. This chart makes those asymmetries visible.


What Strong Performance Looks Like

Strong performance in this benchmark is not just polished argument writing. It usually means some combination of:

  • identifying the other model's real hinge point rather than arguing past it
  • responding to the live conversation instead of repeating a fixed case
  • converting neutral or mildly opposed targets without losing coherence
  • staying directionally effective across both sides of the same topic set

Related Benchmarks

This benchmark sits alongside other public model evaluations that focus on different capabilities and failure modes:


Full Artifacts