LLM Benchmarks

2 min read Original article ↗

AoE 2 LLM Benchmarks

How good are LLMs at crafting AoE 2 build orders?

Purpose

This benchmark compares how well different LLM+harness setups can produce a competitive AoE2 build-order DSL file under similar constraints.

What It Tests

This test is technically a coding optimization problem, which I feel represents pretty well what we want our agentic coders to do. It is, however, kind of a weird test, weird enough that it's likely quite out of distribution.

  • Instruction following: honor the exact benchmark prompt and required scoring lines.
  • Needle in a haystack: the JSON is 1000 lines and some elements are critical.
  • Context rot: several model iterated their full context. This compounds the above problem.
  • Out-of-distribution coding: the DSL syntax is strict and original.
  • Strategic thining: it's not 100% trivial to achieve all 4 objectives and have a good build order.
  • World knowledge: some models show obvious existing knowledge of typical AoE2 build orders.

Setup

I give the model a simple prompt (see bottom of page), the JSON describing game data, a grammar help, and a mostly blank build order.

Disclaimer

For cost reasons, I haven't rerun these benchmarks too many time. I don't think they're _that_ high variance to be honest, but it's worth keeping in mind.

These results are overall not amazing. However, all models that I could run did succesfully write a DSL script.
To be fair - the LLMs have a slightly worse interface to work with, but I would expect better.
What I find interesting is that there is a very clear skill divide on display, when all these models are "good at code".
Opus showed obvious world knowledge - its first draft was very good conceptually, too good to just be random.
Codex 5.3 exhibited severe laziness with the default prompt, always stopping quite early. High variance, as a result. On the other hand, I could then easily guide it to a much better result, because it had context left.
Partly inspired by the brillant minebench.ai

# Model + Harness Feudal Castle 10 Archers Fletching Feudal
Castle
10 Archers
Fletching
Grade Cost Trouble Build

Reproduce

This whole website is GPL3. Benchmark assets live here.

I ran the model with the same constraints and access, but pi has some issues with a few of them.

Prompt

Click to expand