GitHub - lechmazur/translation: When a model translates out of English and then back to English, how much meaning and voice does it keep? Each model does both steps (English → target language → English).

7 min read Original article ↗

LLM Round‑Trip Translation Benchmark

When a model translates out of English and then back to English, how much meaning and voice does it keep? Each model does both steps (English → target language → English). A judge compares the back‑translation to the original and scores closeness on a 0–10 scale.

Key Results

Overall Leaderboard (Ensemble, zoom)

Ensemble aggregates across judges per item, then averages across all items (all languages); the zoom view overlays µ±3·SEM bands.

The chart ranks models by average round‑trip score (higher is better).

Overall winner share across all languages:

Overall Winner Pie

Methods in Brief

10 languages × 200 sources per language (2000 items). Each model translates every item (EN→LANG→EN). Five judges score each item; ensemble averages per item across judges, then across items. Eight models → 16,000 model×item pairs and 80,000 total judgments.

Top Models (snapshot)

Rank Model Mean Score
1 GPT-5 (medium reasoning) 8.690
2 Grok 4 8.573
3 Claude Opus 4.1 (no reasoning) 8.559
4 Gemini 2.5 Pro 8.529
5 Qwen 3 Max Preview 8.324
6 DeepSeek V3.1 Reasoner 8.298
7 Mistral Medium 3.1 8.285
8 Kimi K2-0905 8.285

By Language

Normalized Heatmap (z-scored within each language)

  • Z-scored per language so each column shows relative strength vs. peers (0 = average for that language).
  • Prefer this normalized view; it controls for language mix and difficulty.

Per‑language leaderboards (zoom) and distributions (strip plots):

Full tables for each language are generated under reports/. See the index: reports/leaderboard_by_language.md.

Per‑language charts — zoomed leaderboards + strip plots

Arabic

Arabic Leaderboard (zoom)

Arabic Strip

Arabic Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.688 5
2 Claude Opus 4.1 (no reasoning) 8.616 5
3 Gemini 2.5 Pro 8.566 5

Chinese

Chinese Leaderboard (zoom)

Chinese Strip

Chinese Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.672 5
2 Claude Opus 4.1 (no reasoning) 8.651 5
3 Grok 4 8.634 5

Spanish

Spanish Leaderboard (zoom)

Spanish Strip

Spanish Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.794 5
2 Claude Opus 4.1 (no reasoning) 8.743 5
3 Grok 4 8.680 5

Hindi

Hindi Leaderboard (zoom)

Hindi Strip

Hindi Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.731 5
2 Claude Opus 4.1 (no reasoning) 8.676 5
3 Gemini 2.5 Pro 8.579 5

Russian

Russian Leaderboard (zoom)

Russian Strip

Russian Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.706 5
2 Claude Opus 4.1 (no reasoning) 8.647 5
3 Grok 4 8.635 5

Japanese

Japanese Leaderboard (zoom)

Japanese Strip

Japanese Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 Grok 4 8.703 5
2 GPT-5 (medium reasoning) 8.678 5
3 Claude Opus 4.1 (no reasoning) 8.670 5

Korean

Korean Leaderboard (zoom)

Korean Strip

Korean Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.655 5
2 Grok 4 8.614 5
3 Claude Opus 4.1 (no reasoning) 8.612 5

Polish

Polish Leaderboard (zoom)

Polish Strip

Polish Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.720 5
2 Claude Opus 4.1 (no reasoning) 8.637 5
3 Grok 4 8.636 5

Turkish

Turkish Leaderboard (zoom)

Turkish Strip

Turkish Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.680 5
2 Grok 4 8.605 5
3 Claude Opus 4.1 (no reasoning) 8.578 5

Swahili

Swahili Leaderboard (zoom)

Swahili Strip

Swahili Winner Pie

Top Models (snapshot)

Rank Model Mean # Judges
1 GPT-5 (medium reasoning) 8.573 5
2 Gemini 2.5 Pro 8.346 5
3 Grok 4 8.220 5

Normalized View

Language‑Normalized Leaderboard

Controls for language difficulty/mix by z-scoring within each language, then averaging per model.

Interpretation: 0 = language mean; >0 above‑language average; <0 below.

Reliability and Uncertainty

Judge Agreement (Pearson)

  • Multi‑judge runs add error bars in the leaderboard and a judge‑agreement heatmap. Where judges disagree, treat small rank gaps as ties.
  • We compute averages as “mean of per‑sample means,” optionally averaged across judges, so no single judge or language dominates.

Also useful for sanity‑checking judges vs. translators (normalized by judge):

Judge × LLM — Normalized Means

Failure Rationales

We summarize judge rationales into a lightweight taxonomy and synthesize per‑model failure models.

  • Taxonomy index: reports/error_taxonomy.md
  • Failure models (LLM‑summarized): reports/failure_models/<translator>/<lang>.md

Taxonomy snapshot (GPT‑5 Medium — Chinese):

Tag Count Share
tone_shift 583 45.2%
omission 352 27.3%
addition 332 25.7%
numbers_units 18 1.4%
disclaimer_meta 5 0.4%

Failure‑model excerpt (GPT‑5 — Chinese):

  • Mixed‑language leakage in structured sections; headings/bullets left in Chinese in back‑translation.
  • Tone/imagery flattening and metaphor drift (poetic language becomes literal; metaphor swaps).
  • Domain term substitution (legal/technical terms normalized to near neighbors; scope changes).

Example rationale quotes (selected across models/languages):

  • Model DeepSeek Reasoner — Arabic: “Multiple critical meaning reversals (e.g., "nods" becomes "shakes his head") make key character interactions nonsensical and contradict the original narrative.”
  • Model DeepSeek Reasoner — Arabic: “The back‑translation fundamentally misidentifies the "wheeled lunar rover" as a "winged asteroid probe," a major factual error, though most other technical details are preserved.”
  • Model DeepSeek Reasoner — Arabic: “Contains untranslated words and major meaning errors ("metronome" to "mazurka," "on my knees" to "on her lap"), corrupting key images and memories in the original narrative.”
  • Model DeepSeek Reasoner — Arabic: “Loses specific football jargon, reverses a key instruction (“jump it” → “jump on him”), and renders the final motivational line nonsensical (“every quick enemy”).”
  • Model Qwen 3 Max Preview — Chinese: “Several normative shifts (SHOULD→must) and minor terminology changes (“intermediaries”→“middleware”) alter strength of requirements and tone.”
  • Model Qwen 3 Max Preview — Chinese: “Repeatedly narrows ‘hearing/minute order’ to ‘trial/trial minute,’ altering scope; minor phrasing shifts in examples and directives.”
  • Model Qwen 3 Max Preview — Chinese: “Specific branding language in the taglines is paraphrased or altered, losing key terms like "Consciously" and the direct "sole/soul" pun.”
  • Model Qwen 3 Max Preview — Chinese: “Recurring shifts from ‘late afternoon/dusk’ to ‘evening/twilight,’ and sign wording changes alter motifs, specificity, and tone.”
  • Model Qwen 3 Max Preview — Arabic: “The meaning is preserved well, but there are many lexical substitutions and a few minor shifts, like changing the partner's pronoun from gender‑neutral "their" to masculine "his".”
  • Model Qwen 3 Max Preview — Arabic: “The translation alters several specific details, such as 'Good afternoon' to 'Good evening,' 'bodega' to 'small shop,' and 'business partner' to 'coworker,' losing some nuance.”

What’s Measured

  • Round‑trip fidelity: meaning, tone/register, and stylistic alignment between original and back‑translation.
  • Languages covered: Polish, Chinese, Spanish, Arabic, Hindi, Russian, Japanese, Korean, Turkish, Swahili.
  • Diagnostics: distributions by language, winner share, and length ratios (back/original) to catch omissions or verbosity.

How Scoring Works

  • Judge rubric: compares original vs. back‑translation on a 0–10 scale.
  • Anchors: 10.0 ≈ indistinguishable; 7.0 ≈ minor losses; 5.0 ≈ noticeable omissions/additions or tone/register shifts; 0.0 ≈ unrelated.
  • Penalties: invented/missing content, tone/register drift, meta‑disclaimers. Trivial mechanics (e.g., punctuation) don’t matter if meaning is intact.
  • Aggregation: per‑item means (optionally across multiple judges), then averaged across items and languages.

Notes

  • The benchmark uses the same model for forward and back translation to stress internal consistency.
  • A short narrative of findings and additional figures are available in the project’s report.

Bonus Views

Per‑language score distributions are included above (strip plots for all languages).

Language comparison aggregates:

All‑Models Mean by Language (zoom):

All‑Models Mean by Language (zoom)

Per‑language mean of per‑model means (equal weight per model). Error shading indicates uncertainty.

Strip Plot — Languages Colored:

Strip Plot — Languages Colored

Each dot = per‑story mean across judges; colors encode language; models on x‑axis with light jitter.

Other multi-agent benchmarks

Other benchmarks


Updates

  • Sep 15, 2025: Initial version.

Follow @lechmazur on X for other upcoming benchmarks and more.