Task Design
Scoring: XP Rates as the goal
The originally task was to gain as much XP as possible for a skill within a fixed time window, but we found this approach punished exploration - The winning strategies were often a simple grind with as little stopping as possible. Because we wanted to reward interesting strategies and exploration, we landed on measuring max XP rate per 15 second window. By focusing on XP rate, we reward agents that discover higher-level strategies, beyond pure time-on-task. It was great seeing winning runs use many locations, tricks, and methods as they level up - models are incredible optimizers.
Limitations
The biggest remaining issue with this task design is the long runtime. I experimented with shorter task duration, but it punished the models' tendancy to front-load planning (spending many minutes reading docs), as well as the slow inference rates for some models. The complexity of the environment and low sample size also lead to noise and false negatives that hurt the numberical accuracy of the comparison between models
There is room to design interesting micro-tasks with shorter runtimes to make the benchmark easier to run and iterate on. The complexity of the environment and low sample counts likely contributed significant noise to the results.
Harness Development
Growing an API
The rs-sdk typescript library was developed through cycles of automated error analysis — I would run a batch of agents on a task, categorize failures and missing features, and use those to inform harness improvements. This was a really interesting way to "grow" an API layer between the agents and the game server.
Future Work
Multi-Agent Collaboration
Can two bots outperform a single agent? For example, one agent gathering raw materials while another processes them — splitting the supply chain to achieve a higher combined XP rate. It would be fascinating to test coordination, communication around an optimization task. This might take the form of one LLM agent scripting multiple players characters, or seperate LLMs communicating through in-game chat.
Knowledge Transfer
How well can an agent write a guide for another agent? Given one agent's experience training a skill, how well can it produce instructions that meaningfully improve a second agent's performance?
Contribution
We would love to see more people experimenting with RS-SDK! Join the Discord to get involved, there are many people doing interesting small scale experiments with harnesses and techniques.
Thank you
Thanks to Rob Haisfield, Sean Lee, Christopher Settles, Alex Duffy, and Erik Quintanilla for critical feedback and input, and to LostCity and Harbor RL communities for providing critical open source ecosystem.