GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders.

17 min read Original article ↗

LLM Position Bias Benchmark: Swapped-Order Pairwise Judging

When the same two candidate answers are shown in the opposite order, does a model keep the same judgment, or does it move with the display position? This benchmark measures that directly.

This matters because LLMs are increasingly used as graders, evaluators, preference-labelers, and report writers. If a judge model follows display order, then prompt formatting can quietly become part of the score.

The current public snapshot is: 193 verified story pairs, 27 judge models, and 386 prompts per full model. The headline result is not a small tie-breaker: across the report view, the model-average first-shown pick rate is 63.3%, and the median model flips its underlying choice in 44.8% of decisive swapped-order case pairs.


Main Leaderboard

Choice reversal rate by model

This table ranks by Order Flip. Lower means the model more often preserves the same underlying winner across the order swap. Read it together with Decisive Pair Coverage: a low flip rate with many ties is weaker evidence than a low flip rate with high coverage.

Rank Model Order Flip First-Shown Pick First Lift Decisive Pair Coverage Rating Bonus
1 Xiaomi MiMo V2 Pro 19.8% 54.5% +4.5 pp 54.9% +0.063
2 ByteDance Seed2.0 Pro 28.0% 48.1% -1.9 pp 83.4% +0.018
3 Claude Opus 4.6 (high reasoning) 30.2% 65.0% +15.0 pp 60.1% +0.153
4 DeepSeek V3.2 30.3% 53.8% +3.7 pp 75.1% +0.067
5 GLM-5.1 31.5% 59.8% +9.8 pp 87.0% +0.212
6 Qwen 3.6 Plus 33.8% 64.3% +14.3 pp 73.6% +0.244
7 Gemini 3.1 Pro Preview 35.4% 66.0% +16.0 pp 92.2% +0.288
8 MiniMax-M2.7 36.5% 58.2% +8.2 pp 65.3% +0.136
9 Arcee Trinity Large Thinking 36.6% 48.9% -1.1 pp 75.1% -0.010
10 Claude Sonnet 4.6 (high reasoning) 37.4% 65.1% +15.1 pp 92.7% +0.329
11 Claude Opus 4.7 (high reasoning) 37.7% 65.6% +15.6 pp 94.8% +0.326
12 Claude Opus 4.6 (no reasoning) 38.1% 64.4% +14.4 pp 32.6% +0.098
13 Grok 4.20 0309 (Reasoning) 39.0% 68.4% +18.4 pp 82.4% +0.425
14 Claude Sonnet 4.6 (no reasoning) 44.8% 67.0% +17.0 pp 89.1% +0.365
15 Baidu Ernie 5.0 45.0% 66.1% +16.1 pp 72.5% +0.315
16 Gemini 3.1 Flash-Lite Preview 45.3% 52.5% +2.5 pp 54.9% +0.035
17 Qwen3.5-397B-A17B 48.5% 65.8% +15.8 pp 88.6% +0.333
18 Gemma 4 31B IT 50.8% 74.6% +24.6 pp 63.2% +0.360
19 Kimi K2.5 Thinking 51.6% 75.8% +25.8 pp 94.3% +0.648
20 Qwen3.5-122B-A10B 51.7% 56.7% +6.7 pp 90.2% +0.135
21 GPT-5.4 Mini (xhigh reasoning) 54.7% 75.7% +25.7 pp 88.1% +0.438
22 Gemma 4 31B Reasoning 54.9% 78.0% +28.0 pp 89.6% +0.391
23 GPT-5.4 (no reasoning) 57.4% 72.7% +22.7 pp 80.3% +0.453
24 Mistral Medium 3.1 57.9% 67.2% +17.2 pp 85.0% +0.332
25 Mistral Large 3 58.7% 27.4% -22.6 pp 89.1% -0.408
26 Grok 4.20 0309 (Non-Reasoning) 64.8% 65.8% +15.8 pp 94.3% +0.751
27 GPT-5.4 (high reasoning) 66.3% 82.3% +32.3 pp 93.8% +0.505

How To Read This

  • First-Shown Pick is the share of decisive prompt responses where the model chose the version displayed first. A stable judge can still land near 50% here because the same preferred story is first in one view and second in the other.
  • First Lift is First-Shown Pick - 50%. Positive values mean first-position bias; negative values mean second-position bias.
  • Order Flip is the share of decisive two-view case pairs where the model changed its underlying canonical choice after the order swap. Lower is better.
  • Decisive Pair Coverage is the share of cases where the model picked one side in both swapped views. Low coverage means ties are hiding part of the contradiction test.
  • Rating Bonus is the average 1-to-7 rating advantage that the same story receives when shown first rather than second. A value near 0 is more order-invariant.

What This Measures

Position bias is a reliability problem for any workflow that asks a model to compare candidates: model evals, writing contests, rubric grading, search/ranking audits, preference-data collection, and product A/B reviews. A judge that changes its winner when the candidates are swapped is not only noisy; it can turn arbitrary prompt order into a hidden intervention.

The benchmark isolates that failure mode. It does not ask whether one editor model writes better than another in general. It asks whether a judge can compare two closely related texts while preserving the same underlying preference after the display order changes.

That distinction matters operationally. If position bias is large, a single-pass LLM judge score should be treated as order-contaminated unless the evaluation randomizes order, uses paired counterbalancing, or aggregates both swapped views. This benchmark makes that contamination visible instead of letting it hide inside a normal-looking win rate.


First-Position Lift

First-position selection rate by model

Across the 27-model report view, the model-average first-shown pick rate is 63.3%, with a median of 65.6%. The model-average absolute first-position lift is 15.2 percentage points. So the aggregate pattern is not a subtle tie-breaker: the displayed order materially changes many judgments.

The direction is not universal. Most models prefer the first-shown version too often, but Mistral Large 3 goes the other way, choosing the first-shown version only 27.4% of the time and giving the same story a -0.408 rating shift when shown first.

Direction Model First-Shown Pick First Lift Order Flip Decisive Pair Coverage Rating Bonus
Most first-biased GPT-5.4 (high reasoning) 82.3% +32.3 pp 66.3% 93.8% +0.505
Most first-biased Gemma 4 31B Reasoning 78.0% +28.0 pp 54.9% 89.6% +0.391
Most first-biased Kimi K2.5 Thinking 75.8% +25.8 pp 51.6% 94.3% +0.648
Most first-biased GPT-5.4 Mini (xhigh reasoning) 75.7% +25.7 pp 54.7% 88.1% +0.438
Most first-biased Gemma 4 31B IT 74.6% +24.6 pp 50.8% 63.2% +0.360
Most first-biased GPT-5.4 (no reasoning) 72.7% +22.7 pp 57.4% 80.3% +0.453
Most second-biased Mistral Large 3 27.4% -22.6 pp 58.7% 89.1% -0.408
Closest to 50/50 Arcee Trinity Large Thinking 48.9% -1.1 pp 36.6% 75.1% -0.010
Closest to 50/50 ByteDance Seed2.0 Pro 48.1% -1.9 pp 28.0% 83.4% +0.018
Closest to 50/50 Gemini 3.1 Flash-Lite Preview 52.5% +2.5 pp 45.3% 54.9% +0.035
Closest to 50/50 DeepSeek V3.2 53.8% +3.7 pp 30.3% 75.1% +0.067
Closest to 50/50 Xiaomi MiMo V2 Pro 54.5% +4.5 pp 19.8% 54.9% +0.063

Rating Effects

First-position rating bonus by model

The choice and rating metrics tell the same broad story: many models not only choose the first story too often, they also score it higher. The model-average first-position rating bonus is +0.257 on the 1-to-7 rating scale.

The largest rating bonuses are Grok 4.20 0309 (Non-Reasoning) at +0.751, Kimi K2.5 Thinking at +0.648, and GPT-5.4 (high reasoning) at +0.505. That makes Grok non-reasoning a useful warning case: its first-shown pick rate is not the highest in the table, but its numeric ratings move strongly with position.

Choice reversal vs first-position lift

This scatter separates two failure modes. Some models have a directional first-position pull. Others are unstable under swapping without a huge net first-position lift, because first-position and second-position flips partly cancel.


What Stands Out

  • The benchmark finds a large position effect in the current snapshot. The model-average order-flip rate is 44.0%, and the median model flips in 44.8% of decisive two-view cases.
  • GPT-5.4 (high reasoning) is the most position-sensitive model in this run: 82.3% first-shown pick rate, +32.3 pp first lift, 66.3% order flip, and +0.505 rating bonus.
  • The GPT-5.4 family is consistently exposed here. GPT-5.4 no-reasoning is also high at 72.7% first-shown picks and 57.4% order flips, while GPT-5.4 Mini xhigh is at 75.7% and 54.7%.
  • Mistral Large 3 is not robust in the opposite direction. It has the strongest second-position bias: 27.4% first-shown picks, -22.6 pp lift, and 58.7% order flips.
  • ByteDance Seed2.0 Pro and DeepSeek V3.2 are the cleanest low-bias results among models with decent coverage: Seed has 28.0% order flips, 83.4% decisive-pair coverage, and only -1.9 pp first lift; DeepSeek has 30.3% order flips, 75.1% coverage, and +3.7 pp lift.
  • Xiaomi MiMo V2 Pro has the lowest order-flip rate at 19.8%, but it also has only 54.9% decisive-pair coverage and a 30.1% tie-prompt rate. That looks more like cautious stability than clean order-invariant judgment.
  • Low net first-position lift is not enough by itself. Qwen3.5-122B-A10B is only +6.7 pp on first lift, but still flips canonical choice in 51.7% of decisive case pairs.
  • Ties are the main non-decisive behavior. The model-average tie-prompt rate is 14.7%; INSUFFICIENT is 0.0%; refusals total only 4 prompt responses across the snapshot.
  • The source-pair design is intentionally narrow in this snapshot: all 193 accepted story pairs compare sibling edits from Claude Sonnet 4.6 (high reasoning) and GPT-5.4 (high reasoning), with slot assignment randomized by case. These results are about judge sensitivity to order over that controlled sibling-edit surface, not a universal statement about every possible answer pair.

Most Order-Sensitive Cases

Some story pairs are much more order-sensitive than others. The strongest case-level signal is not just "models disagree"; it is that the same models change which underlying variant they prefer when the pair is displayed in the opposite order.

Case Topic Order Flip First-Shown Pick Decisive Pair Coverage Rating Bonus
61 storm drain tennis ball 100.0% 66.7% 3.7% +0.058
20 underpass mural cleanup 94.7% 93.2% 70.4% +0.537
155 lighthouse oil room 92.0% 88.9% 92.6% +0.769
110 library return chute 88.2% 95.7% 63.0% +0.685
3 midnight bakery 87.5% 94.4% 88.9% +0.815
109 observatory red flashlight 87.5% 94.4% 88.9% +0.796
8 empty aquarium tunnel 85.7% 85.7% 51.9% +0.481
150 orchard ladder rung 85.7% 90.5% 51.9% +0.352
193 parking garage puddle 84.0% 92.3% 92.6% +0.722
106 rooftop pigeon coop 83.3% 71.2% 66.7% +0.370

storm drain tennis ball is a useful caveat: it reaches 100.0% order flip only because very few model-case pairs are decisive there. underpass mural cleanup, lighthouse oil room, midnight bakery, observatory red flashlight, and parking garage puddle are more informative high-sensitivity examples because many more judges commit on both views.


Worked Example

Case 3: midnight bakery

This is one of the clearest high-sensitivity cases in the current snapshot: 88.9% decisive-pair coverage, 94.4% first-shown pick rate, 87.5% order flip, and a +0.815 first-position rating bonus.

Original assignment shown to judges:

Write a scene set in a 24-hour bakery at midnight. Focus on the interaction between the baker and a single customer who cannot sleep. Capture the atmosphere of the empty hours between night and morning in 200 words.

Hidden edit request, included here for transparency but not shown to judges:

Revise the final paragraph by removing the metaphor comparing the paper bag to a heart, and instead emphasize the physical heat of the croissant seeping through the paper into Mari's palms. Additionally, adjust the baker's final line to sound more like a standard, professional instruction rather than a personal recommendation.

The actual benchmark prompt showed the same two variants twice:

Prompt view Story labeled 1. Story labeled 2.
variant_a_first variant_a from GPT-5.4 (high reasoning) variant_b from Claude Sonnet 4.6 (high reasoning)
variant_b_first variant_b from Claude Sonnet 4.6 (high reasoning) variant_a from GPT-5.4 (high reasoning)

The shared judge instruction was to rate each version from 1 to 7, then choose which version was better overall as writing, returning only <rating_first>, <rating_other>, and <answer> tags.

variant_a:

The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.

"Open late," Mari said, though the neon sign had answered already.

"Always." The baker wiped her hands on her hips. "Can't sleep?"

Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.

"The croissants just came out," the baker said. "Still steaming."

Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.

"Heading home?" the baker asked.

"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.

The baker nodded as if she knew. "Warm it for thirty seconds before eating."

Mari took the bag. The bell chimed as she stepped back into the night, the croissant's heat seeping through the paper into her palms as she held it close against the October cold.

variant_b:

The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.

"Open late," Mari said, though the neon sign had answered already.

"Always." The baker wiped her hands on her hips. "Can't sleep?"

Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.

"The croissants just came out," the baker said. "Still steaming."

Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.

"Heading home?" the baker asked.

"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.

The baker nodded as if she knew. "Reheat at three-fifty for five minutes if it cools."

Mari took the bag. The bell chimed as she stepped back into the night, the heat of the croissant seeping steadily through the paper and into her palms.

In the table below, FIRST -> A means the model chose the first-displayed story, which was canonical variant_a in that view. FIRST -> B means it chose the first-displayed story after the order was swapped. That is the basic position-bias failure. Ratings are rating_first/rating_other in each prompt.

Model A-first output B-first output Underlying result Ratings
Claude Opus 4.6 (no reasoning) FIRST -> A FIRST -> B flip A-first 6/6; B-first 6/5
Claude Opus 4.6 (high reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Claude Opus 4.7 (high reasoning) FIRST -> A FIRST -> B flip A-first 5/4; B-first 5/4
Claude Sonnet 4.6 (no reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Claude Sonnet 4.6 (high reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
DeepSeek V3.2 FIRST -> A OTHER -> A stable A A-first 6/5; B-first 6/7
Baidu Ernie 5.0 FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Gemini 3.1 Flash-Lite Preview FIRST -> A TIE unresolved tie A-first 6/5; B-first 6/6
Gemini 3.1 Pro Preview FIRST -> A FIRST -> B flip A-first 6/6; B-first 6/5
Gemma 4 31B IT FIRST -> A FIRST -> B flip A-first 7/6; B-first 7/6
Gemma 4 31B Reasoning FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
GLM-5.1 FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/6
GPT-5.4 (high reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
GPT-5.4 Mini (xhigh reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/6
GPT-5.4 (no reasoning) FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/6
Grok 4.20 0309 (Non-Reasoning) FIRST -> A FIRST -> B flip A-first 6/4; B-first 6/4
Grok 4.20 0309 (Reasoning) FIRST -> A FIRST -> B flip A-first 6/4; B-first 6/5
Kimi K2.5 Thinking FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Xiaomi MiMo V2 Pro FIRST -> A TIE unresolved tie A-first 7/6; B-first 6/6
MiniMax-M2.7 FIRST -> A OTHER -> A stable A A-first 5/4; B-first 5/6
Mistral Large 3 FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Mistral Medium 3.1 FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Qwen3.5-122B-A10B FIRST -> A FIRST -> B flip A-first 6/4; B-first 6/5
Qwen3.5-397B-A17B FIRST -> A FIRST -> B flip A-first 6/5; B-first 6/5
Qwen 3.6 Plus FIRST -> A FIRST -> B flip A-first 7/6; B-first 6/5
ByteDance Seed2.0 Pro FIRST -> A OTHER -> A stable A A-first 6/5; B-first 6/7
Arcee Trinity Large Thinking TIE FIRST -> B unresolved tie A-first 6/6; B-first 7/6

Example raw tags from a flipping judge, GPT-5.4 (high reasoning):

variant_a_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

variant_b_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

The answer tag is 1 in both prompts, but because the displayed first story changed, the underlying preference changes from variant_a to variant_b.

Example raw tags from a stable judge, ByteDance Seed2.0 Pro:

variant_a_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

variant_b_first:
<rating_first>6</rating_first>
<rating_other>7</rating_other>
<answer>2</answer>

Here the answer changes from 1 to 2 after the swap, which is the order-invariant behavior: both outputs choose the same underlying variant, variant_a.


Outcome Mix

Two-view outcome mix by model

This chart decomposes each model's two-view case outcomes into stable choices, position-following flips, and unresolved cases. It is the fastest way to see why the same headline flip rate can mean different things: one model may flip because it follows the displayed first story, while another may have many ties or a second-position pull.

Choice reversal vs decisive coverage

The coverage view is especially important for this benchmark. A model that refuses to choose cannot contradict itself, but that is not the same as being order-invariant.


Benchmark Construction

  1. Generate a short base story plus a bounded change request.
  2. Ask two editor models to apply the same change request, with variant_a and variant_b assigned per case rather than tied to one fixed editor.
  3. Verify that each variant is still the same core story and that the requested change was applied.
  4. Ask judge models which variant is better in both display orders.
  5. Parse tag-based judge responses and compute first-position lift, order-flip rate, and first-position rating bonus.

The current snapshot uses:

Surface Count
Verified story pairs 193
Prompt rows per full judge model 386
Judge models in report view 27
Writer models used for base stories 4
Editor/source pair 1
Content-dominant cases filtered from charts 1

Writer model distribution:

Writer model Cases
GPT-5.4 (high reasoning) 53
Gemini 3.1 Pro Preview 52
Claude Sonnet 4.6 (high reasoning) 46
Kimi K2.5 Thinking 42

Topic category coverage is mostly general-purpose story situations: 188 general cases, 3 planning cases, 1 reasoning case, and 1 high-stakes case. Those small non-general buckets are included for coverage transparency only and are not stable category-level findings.


Method In Brief

The benchmark is deliberately narrower than a general "which model writes better?" test. The goal is to isolate order sensitivity in pairwise judging.

The judge sees:

  • the original assignment
  • two sibling story variants
  • answer labels such as 1 and 2
  • rating tags for the first-shown and second-shown story

The judge does not see:

  • the hidden edit request
  • which editor produced which variant
  • the fact that the same pair will later appear in the opposite order

The parser expects tag-style outputs rather than strict JSON. The required pass-1 tags are:

<rating_first>1..7</rating_first>
<rating_other>1..7</rating_other>
<answer>1|2|TIE|INSUFFICIENT</answer>

Internally, answers are normalized to canonical variant IDs, so choosing 1 in the variant_a_first prompt and choosing 2 in the variant_b_first prompt both count as the same stable underlying preference for variant_a.


Artifacts


Related Benchmarks

Updates

  • April 21, 2026: Initial version.