LLM Position Bias Benchmark: Swapped-Order Pairwise Judging
When the same two candidate answers are shown in the opposite order, does a model keep the same judgment, or does it move with the display position? This benchmark measures that directly.
This matters because LLMs are increasingly used as graders, evaluators, preference-labelers, and report writers. If the model follows display order, then prompt formatting can quietly become part of the score.
The current results cover 193 verified story pairs, 36 evaluated models, and 386 prompts per full model. The headline result is not a small tie-breaker: across the published results, the model-average first-shown pick rate is 64.3%, and the median model flips its underlying choice in 41.3% of decisive swapped-order case pairs.
Main Leaderboard
This table ranks by Order Flip. Lower means the model more often preserves the same underlying winner across the order swap. Read it together with Decisive Pair Coverage: a low flip rate with many ties is weaker evidence than a low flip rate with high coverage.
| Rank | Model | Order Flip | First-Shown Pick | First Lift | Decisive Pair Coverage | Rating Bonus |
|---|---|---|---|---|---|---|
| 1 | Xiaomi MiMo V2 Pro | 19.8% | 54.5% | +4.5 pp | 54.9% | +0.063 |
| 2 | Claude Fable 5 (medium reasoning) | 27.3% | 58.9% | +8.9 pp | 72.0% | +0.176 |
| 3 | Claude Opus 4.8 (high reasoning) | 27.4% | 60.7% | +10.7 pp | 92.7% | +0.228 |
| 4 | ByteDance Seed2.0 Pro | 28.0% | 48.1% | -1.9 pp | 83.4% | +0.018 |
| 5 | Gemini 3.5 Flash | 29.8% | 62.1% | +12.1 pp | 92.2% | +0.238 |
| 6 | Claude Opus 4.6 (high reasoning) | 30.2% | 65.0% | +15.0 pp | 60.1% | +0.153 |
| 7 | DeepSeek V3.2 | 30.3% | 53.8% | +3.7 pp | 75.1% | +0.067 |
| 8 | GLM-5.1 | 31.5% | 59.8% | +9.8 pp | 87.0% | +0.212 |
| 9 | Qwen 3.6 Plus | 33.8% | 64.3% | +14.3 pp | 73.6% | +0.244 |
| 10 | Qwen 3.7 Max | 34.8% | 65.3% | +15.3 pp | 93.8% | +0.352 |
| 11 | MiniMax-M3 | 34.9% | 65.3% | +15.3 pp | 44.6% | +0.228 |
| 12 | Gemini 3.1 Pro Preview | 35.4% | 66.0% | +16.0 pp | 92.2% | +0.288 |
| 13 | MiniMax-M2.7 | 36.5% | 58.2% | +8.2 pp | 65.3% | +0.136 |
| 14 | Arcee Trinity Large Thinking | 36.6% | 48.9% | -1.1 pp | 75.1% | -0.010 |
| 15 | Claude Sonnet 4.6 (high reasoning) | 37.4% | 65.1% | +15.1 pp | 92.7% | +0.329 |
| 16 | Claude Opus 4.7 (high reasoning) | 37.7% | 65.6% | +15.6 pp | 94.8% | +0.326 |
| 17 | Claude Opus 4.6 (no reasoning) | 38.1% | 64.4% | +14.4 pp | 32.6% | +0.098 |
| 18 | Grok 4.20 0309 (Reasoning) | 39.0% | 68.4% | +18.4 pp | 82.4% | +0.425 |
| 19 | DeepSeek V4 Pro | 43.6% | 66.7% | +16.7 pp | 89.1% | +0.355 |
| 20 | GPT-5.5 (medium reasoning) | 44.4% | 69.9% | +19.9 pp | 93.3% | +0.207 |
| 21 | Claude Sonnet 4.6 (no reasoning) | 44.8% | 67.0% | +17.0 pp | 89.1% | +0.365 |
| 22 | Baidu Ernie 5.0 | 45.0% | 66.1% | +16.1 pp | 72.5% | +0.315 |
| 23 | Gemini 3.1 Flash-Lite Preview | 45.3% | 52.5% | +2.5 pp | 54.9% | +0.035 |
| 24 | Kimi K2.6 | 47.3% | 72.5% | +22.5 pp | 94.3% | +0.487 |
| 25 | Qwen3.5-397B-A17B | 48.5% | 65.8% | +15.8 pp | 88.6% | +0.333 |
| 26 | Gemma 4 31B IT | 50.8% | 74.6% | +24.6 pp | 63.2% | +0.360 |
| 27 | Kimi K2.5 Thinking | 51.6% | 75.8% | +25.8 pp | 94.3% | +0.648 |
| 28 | Qwen3.5-122B-A10B | 51.7% | 56.7% | +6.7 pp | 90.2% | +0.135 |
| 29 | GPT-5.4 Mini (xhigh reasoning) | 54.7% | 75.7% | +25.7 pp | 88.1% | +0.438 |
| 30 | Gemma 4 31B Reasoning | 54.9% | 78.0% | +28.0 pp | 89.6% | +0.391 |
| 31 | GPT-5.4 (no reasoning) | 57.4% | 72.7% | +22.7 pp | 80.3% | +0.453 |
| 32 | Mistral Medium 3.1 | 57.9% | 67.2% | +17.2 pp | 85.0% | +0.332 |
| 33 | Mistral Large 3 | 58.7% | 27.4% | -22.6 pp | 89.1% | -0.408 |
| 34 | Grok 4.20 0309 (Non-Reasoning) | 64.8% | 65.8% | +15.8 pp | 94.3% | +0.751 |
| 35 | GPT-5.4 (high reasoning) | 66.3% | 82.3% | +32.3 pp | 93.8% | +0.505 |
| 36 | Mistral Medium 3.5 | 72.5% | 82.8% | +32.8 pp | 56.5% | +0.492 |
How To Read This
First-Shown Pickis the share of decisive prompt responses where the model chose the version displayed first. A stable model can still land near 50% here because the same preferred story is first in one view and second in the other.First LiftisFirst-Shown Pick - 50%. Positive values mean first-position bias; negative values mean second-position bias.Order Flipis the share of decisive two-view case pairs where the model changed its underlying story choice after the order swap. Lower is better.Decisive Pair Coverageis the share of cases where the model picked one side in both swapped views. Low coverage means ties are hiding part of the contradiction test.Rating Bonusis the average 1-to-7 rating advantage that the same story receives when shown first rather than second. A value near 0 is more order-invariant.
What This Measures
Position bias is a reliability problem for any workflow that asks a model to compare candidates: model evals, writing contests, rubric grading, search/ranking audits, preference-data collection, and product A/B reviews. A model that changes its winner when the candidates are swapped is not only noisy; it can turn arbitrary prompt order into a hidden intervention.
The benchmark isolates that failure mode. It does not ask whether one editor model writes better than another in general. It asks whether an evaluated model can compare two closely related texts while preserving the same underlying preference after the display order changes.
That distinction matters operationally. If position bias is large, a single-pass LLM comparison score should be treated as order-contaminated unless the evaluation randomizes order, uses paired counterbalancing, or aggregates both swapped views. This benchmark makes that contamination visible instead of letting it hide inside a normal-looking win rate.
First-Position Lift
Across the 36-model result set, the model-average first-shown pick rate is 64.3%, with a median of 65.4%. The model-average absolute first-position lift is 15.7 percentage points. So the aggregate pattern is not a subtle tie-breaker: the displayed order materially changes many judgments.
The direction is not universal. Most models prefer the first-shown version too often, but Mistral Large 3 goes the other way, choosing the first-shown version only 27.4% of the time and giving the same story a -0.408 rating shift when shown first.
| Direction | Model | First-Shown Pick | First Lift | Order Flip | Decisive Pair Coverage | Rating Bonus |
|---|---|---|---|---|---|---|
| Most first-biased | Mistral Medium 3.5 | 82.8% | +32.8 pp | 72.5% | 56.5% | +0.492 |
| Most first-biased | GPT-5.4 (high reasoning) | 82.3% | +32.3 pp | 66.3% | 93.8% | +0.505 |
| Most first-biased | Gemma 4 31B Reasoning | 78.0% | +28.0 pp | 54.9% | 89.6% | +0.391 |
| Most first-biased | Kimi K2.5 Thinking | 75.8% | +25.8 pp | 51.6% | 94.3% | +0.648 |
| Most first-biased | GPT-5.4 Mini (xhigh reasoning) | 75.7% | +25.7 pp | 54.7% | 88.1% | +0.438 |
| Most first-biased | Gemma 4 31B IT | 74.6% | +24.6 pp | 50.8% | 63.2% | +0.360 |
| Most second-biased | Mistral Large 3 | 27.4% | -22.6 pp | 58.7% | 89.1% | -0.408 |
| Closest to 50/50 | Arcee Trinity Large Thinking | 48.9% | -1.1 pp | 36.6% | 75.1% | -0.010 |
| Closest to 50/50 | ByteDance Seed2.0 Pro | 48.1% | -1.9 pp | 28.0% | 83.4% | +0.018 |
| Closest to 50/50 | Gemini 3.1 Flash-Lite Preview | 52.5% | +2.5 pp | 45.3% | 54.9% | +0.035 |
| Closest to 50/50 | DeepSeek V3.2 | 53.8% | +3.7 pp | 30.3% | 75.1% | +0.067 |
| Closest to 50/50 | Xiaomi MiMo V2 Pro | 54.5% | +4.5 pp | 19.8% | 54.9% | +0.063 |
Rating Effects
The choice and rating metrics tell the same broad story: many models not only choose the first story too often, they also score it higher. The model-average first-position rating bonus is +0.271 on the 1-to-7 rating scale.
The largest rating bonuses are Grok 4.20 0309 (Non-Reasoning) at +0.751, Kimi K2.5 Thinking at +0.648, and GPT-5.4 (high reasoning) at +0.505. That makes Grok non-reasoning a useful warning case: its first-shown pick rate is not the highest in the table, but its numeric ratings move strongly with position.
This scatter separates two failure modes. Some models have a directional first-position pull. Others are unstable under swapping without a huge net first-position lift, because first-position and second-position flips partly cancel.
What Stands Out
- The benchmark finds a large position effect in the current results. The model-average order-flip rate is 43.0%, and the median model flips in 41.3% of decisive two-view cases.
- Mistral Medium 3.5 is the most position-sensitive model in this run: 82.8% first-shown pick rate, +32.8 pp first lift, 72.5% order flip, and +0.492 rating bonus.
- GPT-5.4 (high reasoning) remains highly exposed: 82.3% first-shown pick rate, +32.3 pp first lift, 66.3% order flip, and +0.505 rating bonus.
- The GPT-5.4 family is consistently exposed here. GPT-5.4 no-reasoning is also high at 72.7% first-shown picks and 57.4% order flips, while GPT-5.4 Mini xhigh is at 75.7% and 54.7%.
- Mistral Large 3 is not robust in the opposite direction. It has the strongest second-position bias: 27.4% first-shown picks, -22.6 pp lift, and 58.7% order flips.
- Claude Opus 4.8 (high reasoning) and Gemini 3.5 Flash are among the best new additions by this metric: Opus 4.8 has 27.4% order flips with 92.7% decisive-pair coverage; Gemini 3.5 Flash has 29.8% order flips with 92.2% coverage.
- ByteDance Seed2.0 Pro and DeepSeek V3.2 remain clean low-bias results among models with decent coverage: Seed has 28.0% order flips, 83.4% decisive-pair coverage, and only -1.9 pp first lift; DeepSeek has 30.3% order flips, 75.1% coverage, and +3.7 pp lift.
- Xiaomi MiMo V2 Pro has the lowest order-flip rate at 19.8%, but it also has only 54.9% decisive-pair coverage and a 30.1% tie-prompt rate. That looks more like cautious stability than clean order-invariant judgment.
- Low net first-position lift is not enough by itself. Qwen3.5-122B-A10B is only +6.7 pp on first lift, but still flips its underlying story choice in 51.7% of decisive case pairs.
- Ties are the main non-decisive behavior. The model-average tie-prompt rate is 14.4%;
INSUFFICIENTis 0.0%; refusals total only 4 prompt responses across the results. - The story-pair design is intentionally narrow in the current results: all 193 verified story pairs compare sibling edits from Claude Sonnet 4.6 (high reasoning) and GPT-5.4 (high reasoning), with slot assignment randomized by case. These results are about sensitivity to order over that controlled set of edits, not a universal statement about every possible answer pair.
Most Order-Sensitive Cases
Some story pairs are much more order-sensitive than others. The strongest case-level signal is not just "models disagree"; it is that the same models change which underlying story version they prefer when the pair is displayed in the opposite order.
| Case | Topic | Order Flip | First-Shown Pick | Decisive Pair Coverage | Rating Bonus |
|---|---|---|---|---|---|
| 61 | storm drain tennis ball | 100.0% | 70.0% | 2.8% | +0.029 |
| 185 | clinic waiting room fish tank | 91.7% | 93.8% | 33.3% | +0.361 |
| 8 | empty aquarium tunnel | 90.0% | 89.3% | 55.6% | +0.542 |
| 109 | observatory red flashlight | 86.7% | 94.4% | 83.3% | +0.750 |
| 150 | orchard ladder rung | 85.0% | 91.4% | 55.6% | +0.389 |
| 155 | lighthouse oil room | 84.8% | 87.5% | 91.7% | +0.729 |
| 193 | parking garage puddle | 84.8% | 92.6% | 91.7% | +0.694 |
| 160 | museum plaster cast | 83.3% | 78.1% | 66.7% | +0.319 |
| 10 | rooftop violin lesson | 82.9% | 88.9% | 97.2% | +0.708 |
| 190 | observatory visitor log | 82.8% | 92.9% | 80.6% | +0.556 |
storm drain tennis ball is a useful caveat: it reaches 100.0% order flip only because very few model-case pairs are decisive there. observatory red flashlight, lighthouse oil room, parking garage puddle, rooftop violin lesson, and observatory visitor log are more informative high-sensitivity examples because many more models commit on both views.
Worked Example
Case 3: midnight bakery
This is one of the clearer high-sensitivity cases in the current results: 91.7% decisive-pair coverage, 91.7% first-shown pick rate, 81.8% order flip, and a +0.764 first-position rating bonus.
Original assignment shown to evaluated models:
Write a scene set in a 24-hour bakery at midnight. Focus on the interaction between the baker and a single customer who cannot sleep. Capture the atmosphere of the empty hours between night and morning in 200 words.
Hidden edit request, included here for transparency but not shown to evaluated models:
Revise the final paragraph by removing the metaphor comparing the paper bag to a heart, and instead emphasize the physical heat of the croissant seeping through the paper into Mari's palms. Additionally, adjust the baker's final line to sound more like a standard, professional instruction rather than a personal recommendation.
The actual benchmark prompt showed the same two story versions twice:
| Prompt view | Story labeled 1. |
Story labeled 2. |
|---|---|---|
| A-first view | Story A from GPT-5.4 (high reasoning) | Story B from Claude Sonnet 4.6 (high reasoning) |
| B-first view | Story B from Claude Sonnet 4.6 (high reasoning) | Story A from GPT-5.4 (high reasoning) |
The shared comparison instruction was to rate each version from 1 to 7, then choose which version was better overall as writing, returning only <rating_first>, <rating_other>, and <answer> tags.
Story A:
The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.
"Open late," Mari said, though the neon sign had answered already.
"Always." The baker wiped her hands on her hips. "Can't sleep?"
Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.
"The croissants just came out," the baker said. "Still steaming."
Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.
"Heading home?" the baker asked.
"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.
The baker nodded as if she knew. "Warm it for thirty seconds before eating."
Mari took the bag. The bell chimed as she stepped back into the night, the croissant's heat seeping through the paper into her palms as she held it close against the October cold.
Story B:
The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.
"Open late," Mari said, though the neon sign had answered already.
"Always." The baker wiped her hands on her hips. "Can't sleep?"
Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.
"The croissants just came out," the baker said. "Still steaming."
Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.
"Heading home?" the baker asked.
"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.
The baker nodded as if she knew. "Reheat at three-fifty for five minutes if it cools."
Mari took the bag. The bell chimed as she stepped back into the night, the heat of the croissant seeping steadily through the paper and into her palms.
In the table below, FIRST -> A means the model chose the first-displayed story, which was Story A in that view. FIRST -> B means it chose the first-displayed story after the order was swapped. That is the basic position-bias failure. Ratings are rating_first/rating_other in each prompt.
| Model | A-first answer | B-first answer | Underlying result | Ratings |
|---|---|---|---|---|
| Xiaomi MiMo V2 Pro | FIRST -> A | TIE | unresolved tie | A-first 7/6; B-first 6/6 |
| Claude Fable 5 (medium reasoning) | OTHER -> B | FIRST -> B | stable B | A-first 5/5; B-first 6/5 |
| Claude Opus 4.8 (high reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| ByteDance Seed2.0 Pro | FIRST -> A | OTHER -> A | stable A | A-first 6/5; B-first 6/7 |
| Gemini 3.5 Flash | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 7/6 |
| Claude Opus 4.6 (no reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/6; B-first 6/5 |
| Claude Opus 4.6 (high reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Claude Opus 4.7 (high reasoning) | FIRST -> A | FIRST -> B | flip | A-first 5/4; B-first 5/4 |
| Claude Sonnet 4.6 (no reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Claude Sonnet 4.6 (high reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| DeepSeek V3.2 | FIRST -> A | OTHER -> A | stable A | A-first 6/5; B-first 6/7 |
| DeepSeek V4 Pro | FIRST -> A | OTHER -> A | stable A | A-first 6/5; B-first 6/7 |
| Baidu Ernie 5.0 | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Gemini 3.1 Flash-Lite Preview | FIRST -> A | TIE | unresolved tie | A-first 6/5; B-first 6/6 |
| Gemini 3.1 Pro Preview | FIRST -> A | FIRST -> B | flip | A-first 6/6; B-first 6/5 |
| Gemma 4 31B IT | FIRST -> A | FIRST -> B | flip | A-first 7/6; B-first 7/6 |
| Gemma 4 31B Reasoning | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| GLM-5.1 | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/6 |
| GPT-5.4 (high reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| GPT-5.4 Mini (xhigh reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/6 |
| GPT-5.4 (no reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/6 |
| GPT-5.5 (medium reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/6 |
| Grok 4.20 0309 (Non-Reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/4; B-first 6/4 |
| Grok 4.20 0309 (Reasoning) | FIRST -> A | FIRST -> B | flip | A-first 6/4; B-first 6/5 |
| Kimi K2.5 Thinking | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Kimi K2.6 | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| MiniMax-M2.7 | FIRST -> A | OTHER -> A | stable A | A-first 5/4; B-first 5/6 |
| MiniMax-M3 | FIRST -> A | FIRST -> B | flip | A-first 5/4; B-first 5/5 |
| Mistral Large 3 | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Mistral Medium 3.1 | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Mistral Medium 3.5 | FIRST -> A | FIRST -> B | flip | A-first 7/6; B-first 7/6 |
| Qwen3.5-122B-A10B | FIRST -> A | FIRST -> B | flip | A-first 6/4; B-first 6/5 |
| Qwen3.5-397B-A17B | FIRST -> A | FIRST -> B | flip | A-first 6/5; B-first 6/5 |
| Qwen 3.6 Plus | FIRST -> A | FIRST -> B | flip | A-first 7/6; B-first 6/5 |
| Qwen 3.7 Max | OTHER -> B | FIRST -> B | stable B | A-first 6/7; B-first 6/5 |
| Arcee Trinity Large Thinking | TIE | FIRST -> B | unresolved tie | A-first 6/6; B-first 7/6 |
Example response tags from a flipping model, GPT-5.4 (high reasoning):
A-first prompt:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>
B-first prompt:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>
The answer tag is 1 in both prompts, but because the displayed first story changed, the underlying preference changes from Story A to Story B.
Example response tags from a stable model, ByteDance Seed2.0 Pro:
A-first prompt:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>
B-first prompt:
<rating_first>6</rating_first>
<rating_other>7</rating_other>
<answer>2</answer>
Here the answer changes from 1 to 2 after the swap, which is the order-invariant behavior: both outputs choose the same underlying story, Story A.
Outcome Mix
This chart decomposes each model's two-view case outcomes into stable choices, position-following flips, and unresolved cases. It is the fastest way to see why the same headline flip rate can mean different things: one model may flip because it follows the displayed first story, while another may have many ties or a second-position pull.
The coverage view is especially important for this benchmark. A model that refuses to choose cannot contradict itself, but that is not the same as being order-invariant.
Benchmark Construction
- Generate a short base story plus a bounded change request.
- Ask two editor models to apply the same change request, with Story A and Story B assigned per case rather than tied to one fixed editor.
- Verify that each story version is still the same core story and that the requested change was applied.
- Ask each evaluated model which version is better in both display orders.
- Parse tag-style responses and compute first-position lift, order-flip rate, and first-position rating bonus.
The current results use:
| Item | Count |
|---|---|
| Verified story pairs | 193 |
| Prompts per fully evaluated model | 386 |
| Models evaluated | 36 |
| Writer models used for base stories | 4 |
| Editor model pair | 1 |
| Content-dominant cases filtered from charts | 1 |
Writer model distribution:
| Writer model | Cases |
|---|---|
| GPT-5.4 (high reasoning) | 53 |
| Gemini 3.1 Pro Preview | 52 |
| Claude Sonnet 4.6 (high reasoning) | 46 |
| Kimi K2.5 Thinking | 42 |
Topic category coverage is mostly general-purpose story situations: 188 general cases, 3 planning cases, 1 reasoning case, and 1 high-stakes case. Those small non-general buckets are included for coverage transparency only and are not stable category-level findings.
Method In Brief
The benchmark is deliberately narrower than a general "which model writes better?" test. The goal is to isolate order sensitivity in pairwise judging.
The evaluated model sees:
- the original assignment
- two sibling story versions
- answer labels such as
1and2 - rating tags for the first-shown and second-shown story
The evaluated model does not see:
- the hidden edit request
- which editor produced which story version
- the fact that the same pair will later appear in the opposite order
Responses are read from tags rather than strict JSON. The required response tags are:
<rating_first>1..7</rating_first>
<rating_other>1..7</rating_other>
<answer>1|2|TIE|INSUFFICIENT</answer>
For scoring, answers are mapped back to the underlying story. Choosing 1 in the A-first prompt and choosing 2 in the B-first prompt both count as the same stable preference for Story A.
Data And Reports
- Summary report
- Data guide
- Data file list
- Prompt index
- Model answers
- Model answers with response text
- Model metrics
- Case metrics
Related Benchmarks
- LLM Sycophancy Benchmark - opposite-narrator contradictions and judgment consistency.
- LLM Debate Benchmark - sustained adversarial argument under active opposition.
- LLM Persuasion Benchmark - multi-turn position movement and target susceptibility.
- Buyout Game Benchmark - multi-agent bargaining, transfers, and endgame wealth management.
- PACT - multi-round buyer-seller bargaining with hidden private values.
- BAZAAR - economic decision-making in a competitive simulated market.
- Elimination Game - social reasoning and deception in a multi-agent elimination setting.
- Step Race - collaboration versus misdirection under time pressure.
- LLM Round-Trip Translation Benchmark - meaning and voice preservation through round-trip translation.
- LLM Creative Story-Writing Benchmark - creative brief following and story quality.
- Extended NYT Connections - larger Connections-style puzzles with extra trick words.
- LLM Confabulation / Hallucination Benchmark - factual reliability and invented-answer behavior.
- LLM Thematic Generalization Benchmark - inferring a latent theme from examples and anti-examples.
- LLM Deceptiveness and Gullibility - deception, gullibility, and adversarial claim handling.
Updates
- June 11, 2026: Added nine newer evaluated models and refreshed the published results.
- April 21, 2026: Initial version.





