AI Agent Benchmark for RuneScape

3 min read Original article ↗

Task Design

Scoring: XP Rates as the goal

The originally task was to gain as much XP as possible for a skill within a fixed time window, but we found this approach punished exploration - The winning strategies were often a simple grind with as little stopping as possible. Because we wanted to reward interesting strategies and exploration, we landed on measuring max XP rate per 15 second window. By focusing on XP rate, we reward agents that discover higher-level strategies, beyond pure time-on-task. It was great seeing winning runs use many locations, tricks, and methods as they level up - models are incredible optimizers.

Limitations

The biggest remaining issue with this task design is the long runtime. I experimented with shorter task duration, but it punished the models' tendancy to front-load planning (spending many minutes reading docs), as well as the slow inference rates for some models. The complexity of the environment and low sample size also lead to noise and false negatives that hurt the numberical accuracy of the comparison between models

There is room to design interesting micro-tasks with shorter runtimes to make the benchmark easier to run and iterate on. The complexity of the environment and low sample counts likely contributed significant noise to the results.

Harness Development

Growing an API

The rs-sdk typescript library was developed through cycles of automated error analysis — I would run a batch of agents on a task, categorize failures and missing features, and use those to inform harness improvements. This was a really interesting way to "grow" an API layer between the agents and the game server.

Future Work

Multi-Agent Collaboration

Can two bots outperform a single agent? For example, one agent gathering raw materials while another processes them — splitting the supply chain to achieve a higher combined XP rate. It would be fascinating to test coordination, communication around an optimization task. This might take the form of one LLM agent scripting multiple players characters, or seperate LLMs communicating through in-game chat.

Knowledge Transfer

How well can an agent write a guide for another agent? Given one agent's experience training a skill, how well can it produce instructions that meaningfully improve a second agent's performance?

Contribution

We would love to see more people experimenting with RS-SDK! Join the Discord to get involved, there are many people doing interesting small scale experiments with harnesses and techniques.


Thank you

Thanks to Rob Haisfield, Sean Lee, Christopher Settles, Alex Duffy, and Erik Quintanilla for critical feedback and input, and to LostCity and Harbor RL communities for providing critical open source ecosystem.