tl;dr - As of 2026-04-25, models are quite bad at reading music. Friends at OpenAI, Anthropic, and Google: Please saturate this benchmark! kthx.
This repo benchmarks the performance of an LLM on the task of musical OCR. The LLM is given an image of some written music. Its job is to determine what notes are written.
Right now this benchmark is very easy; it's only single notes on a single staff. No chords, no rests, no ties, etc etc. I stopped here because the models were so bad at it, it didn't make sense to make something harder.
This repo was entirely vibe-coded (using Codex with GPT 5.4-high). I have not looked at any of the code.
Test results
On the current 48-example hidden private_test split:
| Model | Exact Match | Avg F1 | Edit Distance |
|---|---|---|---|
gpt-5.4 |
8.3% | 0.3265 | 2.9375 |
gpt-5.5 |
0.0% | 0.1581 | 3.7708 |
claude-opus-4-7 |
2.1% | 0.1647 | 3.5833 |
gemini-3-flash-preview |
4.2% | 0.2657 | 3.4583 |
(Exact Match is the fraction of tests that the model gets exactly correct.
Avg F1 is the average of the
F-scores from each testcase.)
These numbers come from a split generated from a private seed that's not committed to the repo. The results on the public dataset are similar.
The harness defaults gpt-5.5 to no reasoning; otherwise it spends the full
256-token output budget on reasoning and returns no JSON answers.
In case it's not clear, these results are not good for any of the models I tested. If you have another model you want me to test, send a PR adding support for it.
Example Wrong Outputs
Here is an example that all of the models I tried got wrong.
Target measure: 1
Correct output:
{"notes":["F4","A4","F5","A4","G5"]}gpt-5.5:
{"notes":["C4","D4","G4","B4","A4","E5"]}claude-opus-4-7:
{"notes":["E4","G4","C5","G4","C5"]}gemini-3-flash-preview:
{"notes":["A4","C5","E5","B4","D5"]}Everything below here is AI slop that I didn't bother to read. Good luck.
Quickstart
Create and inspect a dataset:
python3 -m venv .venv source .venv/bin/activate pip install -e . python -m music_bench generate --output-dir data/generated --dev-count 8 --public-test-count 16
Render score images after installing LilyPond:
brew install lilypond python -m music_bench render --manifest data/generated/dev/manifest.jsonl python -m music_bench render --manifest data/generated/public_test/manifest.jsonl
Run an offline replay evaluation:
python -m music_bench evaluate \ --manifest data/generated/dev/manifest.jsonl \ --provider replay \ --replay-file path/to/responses.jsonl \ --results-file data/generated/dev/results/replay.jsonl
Generate a report:
python -m music_bench report \ --manifest data/generated/dev/manifest.jsonl \ --results-file data/generated/dev/results/replay.jsonl \ --output-dir reports/dev
Provider configuration
The benchmark includes adapters for OpenAI, Anthropic, Google, and JSONL replay mode.
openai: requiresOPENAI_API_KEYanthropic: requiresANTHROPIC_API_KEYgoogle: requiresGEMINI_API_KEYorGOOGLE_API_KEYreplay: requires a JSONL file containing{"id":"example-id","response":"..."}
Use the same prompt contract across all providers. The default prompt is zero-shot, requests strict JSON only, and asks models to spell accidentals with ASCII # and b like F#4 and Bb3.
Dataset design
Each generated example includes:
idimage_pathquestiontarget_measureanswer_notesmetadata
metadata includes clef, key signature, time signature, note count, accidental count, pitch range, generator seed, skill tags, and per-measure note sequences for error analysis.
The generator emits contrast pairs so that two nearly identical score images have different correct answers. This reduces shortcutting and makes blind baselines easier to detect.
Split policy
Split seeds are defined in benchmark_config.toml.
dev: fixed public seed, small, answers visible, used for prompt iteration and harness debugging.public_test: fixed public seed, larger, answers visible, used for reproducible public comparisons.private_test: fixed secret seed, not committed, used for headline claims and contamination resistance.
The checked-in config currently pins:
dev.seed = 101public_test.seed = 202private_testhas no checked-in seed
By default, private_test is generated from the gitignored local seed file named in the config:
python -m music_bench generate --output-dir data/generated
The generator falls back to the environment variable named in the config if the local seed file is missing:
export MUSIC_BENCH_PRIVATE_TEST_SEED=123456789
python -m music_bench generate --output-dir data/generatedYou can still override the private seed with --private-seed, but the local gitignored seed file is the safest default because it keeps the seed reproducible without leaking it into shell history.
Each generated split writes a split_info.json file next to its manifest so the benchmark version, split name, seed, and visibility are explicit in the artifacts.
Metrics
- Exact match on normalized note sequence.
- Note-level precision, recall, and F1.
- Levenshtein edit distance on the predicted note sequence.
- Error categories: formatting failure, wrong measure, octave mistake, accidental mistake, or other mismatch.
Notes
- LilyPond is not bundled.
music-bench renderfails with a clear message untillilypondis installed. private_testgeneration is supported, but its seed file and generated artifacts should stay out of source control.
