Dataset
ErrataBench tests models against a dataset of English text from various sources (literature, law, technical manuals). Each source has been altered with a wide variety of writing errors, across several categories. These errors are tracked in a corresponding file.
Source text is altered using a corruptor model + review model pair; the corruptor is instructed to make small changes to the text to insert errors of specific types, sampled from the error category taxonomy. The review model reviews the changes as a second opinion before they are accepted into the dataset.
Agent Loop
The benchmark runs a simple agent loop over each altered text, showing the model chunks of up to 2000 words at a time while keeping whole paragraphs together. The model is simply asked to proofread the text carefully, fixing errors while preserving meaning and details. The model is not given any specifics about what types of errors are hiding in the text or how many of them there are.
The agent loop gives the model up to 3 turns with each chunk, during which the model can modify that text using simple tools: find_and_replace (to make surgical changes) and replace_paragraph (for wider rewrites). If the model changes the text to what it was originally, the error is immediately counted as fixed. If the model changes it to something else, an LLM judge is used to decide if that alternative is also correct.
Metrics
The headline Fix rate score is an equal-weighted per-dataset mean, not a pooled average over all runs. For each source dataset d, ErrataBench averages the successful-run quality scores for that dataset to get mean(q_d), then averages those dataset-level means across datasets. That keeps datasets with more successful repeats from counting more heavily than datasets with fewer.
The other scatterplot metrics are derived similarly. Cost Efficiency is corrected issues per USD at the run level, then averaged by dataset and across datasets.Speed is corrected issues per minute of run time, aggregated the same way. Tool Call Efficiency is shown as issues fixed per 100 benchmark-scoped tool characters, which is an inverted display of the underlying tool-chars-per- resolved-issue metric. Turns Used is the mean average turns per chunk across datasets.
The scatterplot's consistency metric is median_d range(q_d), the median per-dataset range of quality. For each dataset, ErrataBench measures the spread between the highest and lowest successful-run quality score, then takes the median of those dataset-level ranges. In the UI this appears as Consistency. Lower values mean the model's proofreading quality is more consistent from run to run, while the median makes the metric less sensitive to a few noisy source texts.
Reproducing Results
Source code, dataset, and results are available on GitHub. Results can be reproduced using an OpenRouter API key. The repository also includes a tool for creating your own dataset using any source text, making it easy to run this benchmark against new samples.
The benchmark has two primary inputs which determine how the agent loop works: the chunk size and number of turns per chunk. Changing these values can produce different results - specifically, larger chunks and fewer turns favors models that do less reasoning. High-reasoning models tend to get lost when presented larger chunks and fewer turns to edit them with. The data published here was generated by running the benchmark with --samples 3 --chunk-size 2000 --max-turns-per-chunk 3, which was decided on as a reasonable general setting.
The full list of error categories used by ErrataBench can be found below.