Show HN: LLM Skirmish – a benchmark where LLMs play RTS games, by writing code
llmskirmish.comI wanted to create an LLM game benchmark that put this generation of frontier LLMs' top skill, coding, on full display.
Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." In Screeps, human players write javascript strategies that get executed in the game's environment.
The Screeps paradigm, writing code and having it execute in a real-time game environment, is well suited for an LLM benchmark. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games. Whoa, this is sick. Like adversarial chess training but inverted for model evaluation. The model has to be both correct and fast at code while managing tactics and strategy well. I wonder if it should extend to general-soldier models, like an agent swarm. obv would kill tokens but would be super interesting Oof, gonna go sell my Google position. Such fascinating results and a cool way to design a benchmark