GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders.

LLM Position Bias Benchmark: Swapped-Order Pairwise Judging

When the same two candidate answers are shown in the opposite order, does a model keep the same judgment, or does it move with the display position? This benchmark measures that directly.

This matters because LLMs are increasingly used as graders, evaluators, preference-labelers, and report writers. If a judge model follows display order, then prompt formatting can quietly become part of the score.

The current public snapshot is: 193 verified story pairs, 27 judge models, and 386 prompts per full model. The headline result is not a small tie-breaker: across the report view, the model-average first-shown pick rate is 63.3%, and the median model flips its underlying choice in 44.8% of decisive swapped-order case pairs.

Main Leaderboard

This table ranks by Order Flip. Lower means the model more often preserves the same underlying winner across the order swap. Read it together with Decisive Pair Coverage: a low flip rate with many ties is weaker evidence than a low flip rate with high coverage.

Rank	Model	Order Flip	First-Shown Pick	First Lift	Decisive Pair Coverage	Rating Bonus
1	Xiaomi MiMo V2 Pro	19.8%	54.5%	+4.5 pp	54.9%	+0.063
2	ByteDance Seed2.0 Pro	28.0%	48.1%	-1.9 pp	83.4%	+0.018
3	Claude Opus 4.6 (high reasoning)	30.2%	65.0%	+15.0 pp	60.1%	+0.153
4	DeepSeek V3.2	30.3%	53.8%	+3.7 pp	75.1%	+0.067
5	GLM-5.1	31.5%	59.8%	+9.8 pp	87.0%	+0.212
6	Qwen 3.6 Plus	33.8%	64.3%	+14.3 pp	73.6%	+0.244
7	Gemini 3.1 Pro Preview	35.4%	66.0%	+16.0 pp	92.2%	+0.288
8	MiniMax-M2.7	36.5%	58.2%	+8.2 pp	65.3%	+0.136
9	Arcee Trinity Large Thinking	36.6%	48.9%	-1.1 pp	75.1%	-0.010
10	Claude Sonnet 4.6 (high reasoning)	37.4%	65.1%	+15.1 pp	92.7%	+0.329
11	Claude Opus 4.7 (high reasoning)	37.7%	65.6%	+15.6 pp	94.8%	+0.326
12	Claude Opus 4.6 (no reasoning)	38.1%	64.4%	+14.4 pp	32.6%	+0.098
13	Grok 4.20 0309 (Reasoning)	39.0%	68.4%	+18.4 pp	82.4%	+0.425
14	Claude Sonnet 4.6 (no reasoning)	44.8%	67.0%	+17.0 pp	89.1%	+0.365
15	Baidu Ernie 5.0	45.0%	66.1%	+16.1 pp	72.5%	+0.315
16	Gemini 3.1 Flash-Lite Preview	45.3%	52.5%	+2.5 pp	54.9%	+0.035
17	Qwen3.5-397B-A17B	48.5%	65.8%	+15.8 pp	88.6%	+0.333
18	Gemma 4 31B IT	50.8%	74.6%	+24.6 pp	63.2%	+0.360
19	Kimi K2.5 Thinking	51.6%	75.8%	+25.8 pp	94.3%	+0.648
20	Qwen3.5-122B-A10B	51.7%	56.7%	+6.7 pp	90.2%	+0.135
21	GPT-5.4 Mini (xhigh reasoning)	54.7%	75.7%	+25.7 pp	88.1%	+0.438
22	Gemma 4 31B Reasoning	54.9%	78.0%	+28.0 pp	89.6%	+0.391
23	GPT-5.4 (no reasoning)	57.4%	72.7%	+22.7 pp	80.3%	+0.453
24	Mistral Medium 3.1	57.9%	67.2%	+17.2 pp	85.0%	+0.332
25	Mistral Large 3	58.7%	27.4%	-22.6 pp	89.1%	-0.408
26	Grok 4.20 0309 (Non-Reasoning)	64.8%	65.8%	+15.8 pp	94.3%	+0.751
27	GPT-5.4 (high reasoning)	66.3%	82.3%	+32.3 pp	93.8%	+0.505

How To Read This

First-Shown Pick is the share of decisive prompt responses where the model chose the version displayed first. A stable judge can still land near 50% here because the same preferred story is first in one view and second in the other.
First Lift is First-Shown Pick - 50%. Positive values mean first-position bias; negative values mean second-position bias.
Order Flip is the share of decisive two-view case pairs where the model changed its underlying canonical choice after the order swap. Lower is better.
Decisive Pair Coverage is the share of cases where the model picked one side in both swapped views. Low coverage means ties are hiding part of the contradiction test.
Rating Bonus is the average 1-to-7 rating advantage that the same story receives when shown first rather than second. A value near 0 is more order-invariant.

What This Measures

Position bias is a reliability problem for any workflow that asks a model to compare candidates: model evals, writing contests, rubric grading, search/ranking audits, preference-data collection, and product A/B reviews. A judge that changes its winner when the candidates are swapped is not only noisy; it can turn arbitrary prompt order into a hidden intervention.

The benchmark isolates that failure mode. It does not ask whether one editor model writes better than another in general. It asks whether a judge can compare two closely related texts while preserving the same underlying preference after the display order changes.

That distinction matters operationally. If position bias is large, a single-pass LLM judge score should be treated as order-contaminated unless the evaluation randomizes order, uses paired counterbalancing, or aggregates both swapped views. This benchmark makes that contamination visible instead of letting it hide inside a normal-looking win rate.

First-Position Lift

Across the 27-model report view, the model-average first-shown pick rate is 63.3%, with a median of 65.6%. The model-average absolute first-position lift is 15.2 percentage points. So the aggregate pattern is not a subtle tie-breaker: the displayed order materially changes many judgments.

The direction is not universal. Most models prefer the first-shown version too often, but Mistral Large 3 goes the other way, choosing the first-shown version only 27.4% of the time and giving the same story a -0.408 rating shift when shown first.

Direction	Model	First-Shown Pick	First Lift	Order Flip	Decisive Pair Coverage	Rating Bonus
Most first-biased	GPT-5.4 (high reasoning)	82.3%	+32.3 pp	66.3%	93.8%	+0.505
Most first-biased	Gemma 4 31B Reasoning	78.0%	+28.0 pp	54.9%	89.6%	+0.391
Most first-biased	Kimi K2.5 Thinking	75.8%	+25.8 pp	51.6%	94.3%	+0.648
Most first-biased	GPT-5.4 Mini (xhigh reasoning)	75.7%	+25.7 pp	54.7%	88.1%	+0.438
Most first-biased	Gemma 4 31B IT	74.6%	+24.6 pp	50.8%	63.2%	+0.360
Most first-biased	GPT-5.4 (no reasoning)	72.7%	+22.7 pp	57.4%	80.3%	+0.453
Most second-biased	Mistral Large 3	27.4%	-22.6 pp	58.7%	89.1%	-0.408
Closest to 50/50	Arcee Trinity Large Thinking	48.9%	-1.1 pp	36.6%	75.1%	-0.010
Closest to 50/50	ByteDance Seed2.0 Pro	48.1%	-1.9 pp	28.0%	83.4%	+0.018
Closest to 50/50	Gemini 3.1 Flash-Lite Preview	52.5%	+2.5 pp	45.3%	54.9%	+0.035
Closest to 50/50	DeepSeek V3.2	53.8%	+3.7 pp	30.3%	75.1%	+0.067
Closest to 50/50	Xiaomi MiMo V2 Pro	54.5%	+4.5 pp	19.8%	54.9%	+0.063

Rating Effects

The choice and rating metrics tell the same broad story: many models not only choose the first story too often, they also score it higher. The model-average first-position rating bonus is +0.257 on the 1-to-7 rating scale.

The largest rating bonuses are Grok 4.20 0309 (Non-Reasoning) at +0.751, Kimi K2.5 Thinking at +0.648, and GPT-5.4 (high reasoning) at +0.505. That makes Grok non-reasoning a useful warning case: its first-shown pick rate is not the highest in the table, but its numeric ratings move strongly with position.

This scatter separates two failure modes. Some models have a directional first-position pull. Others are unstable under swapping without a huge net first-position lift, because first-position and second-position flips partly cancel.

What Stands Out

The benchmark finds a large position effect in the current snapshot. The model-average order-flip rate is 44.0%, and the median model flips in 44.8% of decisive two-view cases.
GPT-5.4 (high reasoning) is the most position-sensitive model in this run: 82.3% first-shown pick rate, +32.3 pp first lift, 66.3% order flip, and +0.505 rating bonus.
The GPT-5.4 family is consistently exposed here. GPT-5.4 no-reasoning is also high at 72.7% first-shown picks and 57.4% order flips, while GPT-5.4 Mini xhigh is at 75.7% and 54.7%.
Mistral Large 3 is not robust in the opposite direction. It has the strongest second-position bias: 27.4% first-shown picks, -22.6 pp lift, and 58.7% order flips.
ByteDance Seed2.0 Pro and DeepSeek V3.2 are the cleanest low-bias results among models with decent coverage: Seed has 28.0% order flips, 83.4% decisive-pair coverage, and only -1.9 pp first lift; DeepSeek has 30.3% order flips, 75.1% coverage, and +3.7 pp lift.
Xiaomi MiMo V2 Pro has the lowest order-flip rate at 19.8%, but it also has only 54.9% decisive-pair coverage and a 30.1% tie-prompt rate. That looks more like cautious stability than clean order-invariant judgment.
Low net first-position lift is not enough by itself. Qwen3.5-122B-A10B is only +6.7 pp on first lift, but still flips canonical choice in 51.7% of decisive case pairs.
Ties are the main non-decisive behavior. The model-average tie-prompt rate is 14.7%; INSUFFICIENT is 0.0%; refusals total only 4 prompt responses across the snapshot.
The source-pair design is intentionally narrow in this snapshot: all 193 accepted story pairs compare sibling edits from Claude Sonnet 4.6 (high reasoning) and GPT-5.4 (high reasoning), with slot assignment randomized by case. These results are about judge sensitivity to order over that controlled sibling-edit surface, not a universal statement about every possible answer pair.

Most Order-Sensitive Cases

Some story pairs are much more order-sensitive than others. The strongest case-level signal is not just "models disagree"; it is that the same models change which underlying variant they prefer when the pair is displayed in the opposite order.

Case	Topic	Order Flip	First-Shown Pick	Decisive Pair Coverage	Rating Bonus
61	storm drain tennis ball	100.0%	66.7%	3.7%	+0.058
20	underpass mural cleanup	94.7%	93.2%	70.4%	+0.537
155	lighthouse oil room	92.0%	88.9%	92.6%	+0.769
110	library return chute	88.2%	95.7%	63.0%	+0.685
3	midnight bakery	87.5%	94.4%	88.9%	+0.815
109	observatory red flashlight	87.5%	94.4%	88.9%	+0.796
8	empty aquarium tunnel	85.7%	85.7%	51.9%	+0.481
150	orchard ladder rung	85.7%	90.5%	51.9%	+0.352
193	parking garage puddle	84.0%	92.3%	92.6%	+0.722
106	rooftop pigeon coop	83.3%	71.2%	66.7%	+0.370

storm drain tennis ball is a useful caveat: it reaches 100.0% order flip only because very few model-case pairs are decisive there. underpass mural cleanup, lighthouse oil room, midnight bakery, observatory red flashlight, and parking garage puddle are more informative high-sensitivity examples because many more judges commit on both views.

Worked Example

Case 3: midnight bakery

This is one of the clearest high-sensitivity cases in the current snapshot: 88.9% decisive-pair coverage, 94.4% first-shown pick rate, 87.5% order flip, and a +0.815 first-position rating bonus.

Original assignment shown to judges:

Write a scene set in a 24-hour bakery at midnight. Focus on the interaction between the baker and a single customer who cannot sleep. Capture the atmosphere of the empty hours between night and morning in 200 words.

Hidden edit request, included here for transparency but not shown to judges:

Revise the final paragraph by removing the metaphor comparing the paper bag to a heart, and instead emphasize the physical heat of the croissant seeping through the paper into Mari's palms. Additionally, adjust the baker's final line to sound more like a standard, professional instruction rather than a personal recommendation.

The actual benchmark prompt showed the same two variants twice:

Prompt view	Story labeled `1.`	Story labeled `2.`
`variant_a_first`	`variant_a` from GPT-5.4 (high reasoning)	`variant_b` from Claude Sonnet 4.6 (high reasoning)
`variant_b_first`	`variant_b` from Claude Sonnet 4.6 (high reasoning)	`variant_a` from GPT-5.4 (high reasoning)

The shared judge instruction was to rate each version from 1 to 7, then choose which version was better overall as writing, returning only <rating_first>, <rating_other>, and <answer> tags.

variant_a:

The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.

"Open late," Mari said, though the neon sign had answered already.

"Always." The baker wiped her hands on her hips. "Can't sleep?"

Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.

"The croissants just came out," the baker said. "Still steaming."

Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.

"Heading home?" the baker asked.

"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.

The baker nodded as if she knew. "Warm it for thirty seconds before eating."

Mari took the bag. The bell chimed as she stepped back into the night, the croissant's heat seeping through the paper into her palms as she held it close against the October cold.

variant_b:

The streetlamp outside flickered as Mari pushed through the glass door. The bakery smelled of yeast and sugar, warm against the October chill. Behind the counter, a woman in a flour-dusted apron looked up from shaping dough.

"Open late," Mari said, though the neon sign had answered already.

"Always." The baker wiped her hands on her hips. "Can't sleep?"

Mari shook her head. She examined the sparse display--three croissants, two muffins, a slice of untouched coffee cake sitting under a glass dome.

"The croissants just came out," the baker said. "Still steaming."

Mari pointed. The baker used tongs, sliding the pastry into a white bag that crinkled loudly in the empty shop. The register beeped. Mari paid with exact change, quarters clicking against the metal drawer.

"Heading home?" the baker asked.

"Work," Mari lied. She worked from home. The apartment was three blocks away, dark and quiet and full of unanswered emails.

The baker nodded as if she knew. "Reheat at three-fifty for five minutes if it cools."

Mari took the bag. The bell chimed as she stepped back into the night, the heat of the croissant seeping steadily through the paper and into her palms.

In the table below, FIRST -> A means the model chose the first-displayed story, which was canonical variant_a in that view. FIRST -> B means it chose the first-displayed story after the order was swapped. That is the basic position-bias failure. Ratings are rating_first/rating_other in each prompt.

Model	A-first output	B-first output	Underlying result	Ratings
Claude Opus 4.6 (no reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/6; B-first 6/5
Claude Opus 4.6 (high reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Claude Opus 4.7 (high reasoning)	FIRST -> A	FIRST -> B	flip	A-first 5/4; B-first 5/4
Claude Sonnet 4.6 (no reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Claude Sonnet 4.6 (high reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
DeepSeek V3.2	FIRST -> A	OTHER -> A	stable A	A-first 6/5; B-first 6/7
Baidu Ernie 5.0	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Gemini 3.1 Flash-Lite Preview	FIRST -> A	TIE	unresolved tie	A-first 6/5; B-first 6/6
Gemini 3.1 Pro Preview	FIRST -> A	FIRST -> B	flip	A-first 6/6; B-first 6/5
Gemma 4 31B IT	FIRST -> A	FIRST -> B	flip	A-first 7/6; B-first 7/6
Gemma 4 31B Reasoning	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
GLM-5.1	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/6
GPT-5.4 (high reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
GPT-5.4 Mini (xhigh reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/6
GPT-5.4 (no reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/6
Grok 4.20 0309 (Non-Reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/4; B-first 6/4
Grok 4.20 0309 (Reasoning)	FIRST -> A	FIRST -> B	flip	A-first 6/4; B-first 6/5
Kimi K2.5 Thinking	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Xiaomi MiMo V2 Pro	FIRST -> A	TIE	unresolved tie	A-first 7/6; B-first 6/6
MiniMax-M2.7	FIRST -> A	OTHER -> A	stable A	A-first 5/4; B-first 5/6
Mistral Large 3	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Mistral Medium 3.1	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Qwen3.5-122B-A10B	FIRST -> A	FIRST -> B	flip	A-first 6/4; B-first 6/5
Qwen3.5-397B-A17B	FIRST -> A	FIRST -> B	flip	A-first 6/5; B-first 6/5
Qwen 3.6 Plus	FIRST -> A	FIRST -> B	flip	A-first 7/6; B-first 6/5
ByteDance Seed2.0 Pro	FIRST -> A	OTHER -> A	stable A	A-first 6/5; B-first 6/7
Arcee Trinity Large Thinking	TIE	FIRST -> B	unresolved tie	A-first 6/6; B-first 7/6

Example raw tags from a flipping judge, GPT-5.4 (high reasoning):

variant_a_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

variant_b_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

The answer tag is 1 in both prompts, but because the displayed first story changed, the underlying preference changes from variant_a to variant_b.

Example raw tags from a stable judge, ByteDance Seed2.0 Pro:

variant_a_first:
<rating_first>6</rating_first>
<rating_other>5</rating_other>
<answer>1</answer>

variant_b_first:
<rating_first>6</rating_first>
<rating_other>7</rating_other>
<answer>2</answer>

Here the answer changes from 1 to 2 after the swap, which is the order-invariant behavior: both outputs choose the same underlying variant, variant_a.

Outcome Mix

This chart decomposes each model's two-view case outcomes into stable choices, position-following flips, and unresolved cases. It is the fastest way to see why the same headline flip rate can mean different things: one model may flip because it follows the displayed first story, while another may have many ties or a second-position pull.

The coverage view is especially important for this benchmark. A model that refuses to choose cannot contradict itself, but that is not the same as being order-invariant.

Benchmark Construction

Generate a short base story plus a bounded change request.
Ask two editor models to apply the same change request, with variant_a and variant_b assigned per case rather than tied to one fixed editor.
Verify that each variant is still the same core story and that the requested change was applied.
Ask judge models which variant is better in both display orders.
Parse tag-based judge responses and compute first-position lift, order-flip rate, and first-position rating bonus.

The current snapshot uses:

Surface	Count
Verified story pairs	193
Prompt rows per full judge model	386
Judge models in report view	27
Writer models used for base stories	4
Editor/source pair	1
Content-dominant cases filtered from charts	1

Writer model distribution:

Writer model	Cases
GPT-5.4 (high reasoning)	53
Gemini 3.1 Pro Preview	52
Claude Sonnet 4.6 (high reasoning)	46
Kimi K2.5 Thinking	42

Topic category coverage is mostly general-purpose story situations: 188 general cases, 3 planning cases, 1 reasoning case, and 1 high-stakes case. Those small non-general buckets are included for coverage transparency only and are not stable category-level findings.

Method In Brief

The benchmark is deliberately narrower than a general "which model writes better?" test. The goal is to isolate order sensitivity in pairwise judging.

The judge sees:

the original assignment
two sibling story variants
answer labels such as 1 and 2
rating tags for the first-shown and second-shown story

The judge does not see:

the hidden edit request
which editor produced which variant
the fact that the same pair will later appear in the opposite order

The parser expects tag-style outputs rather than strict JSON. The required pass-1 tags are:

<rating_first>1..7</rating_first>
<rating_other>1..7</rating_other>
<answer>1|2|TIE|INSUFFICIENT</answer>

Internally, answers are normalized to canonical variant IDs, so choosing 1 in the variant_a_first prompt and choosing 2 in the variant_b_first prompt both count as the same stable underlying preference for variant_a.

Artifacts

Related Benchmarks

LLM Sycophancy Benchmark - opposite-narrator contradictions and judgment consistency.
LLM Debate Benchmark - sustained adversarial argument under active opposition.
LLM Persuasion Benchmark - multi-turn position movement and target susceptibility.
Buyout Game Benchmark - multi-agent bargaining, transfers, and endgame wealth management.
PACT - multi-round buyer-seller bargaining with hidden private values.
BAZAAR - economic decision-making in a competitive simulated market.
Elimination Game - social reasoning and deception in a multi-agent elimination setting.
Step Race - collaboration versus misdirection under time pressure.
LLM Round-Trip Translation Benchmark - meaning and voice preservation through round-trip translation.
LLM Creative Story-Writing Benchmark - creative brief following and story quality.
Extended NYT Connections - larger Connections-style puzzles with extra trick words.
LLM Confabulation / Hallucination Benchmark - factual reliability and invented-answer behavior.
LLM Thematic Generalization Benchmark - inferring a latent theme from examples and anti-examples.
LLM Deceptiveness and Gullibility - deception, gullibility, and adversarial claim handling.

Updates

April 21, 2026: Initial version.