Click on a row to explore individual model runs

Visit on a desktop for the full interactive experience

Model leaderboard with rounds, tool-call reliability, tokens, time, and cost
# Model Vendor Round Average final round reached across all runs (± std. dev.). Responses with valid tool calls that can be executed in the current game state. Responses with valid tool calls that cannot be executed in the current game state. Responses without valid tool calls.

In

/

Average input tokens per tool call (± std. dev.).

Out

/

Average output tokens per tool call (± std. dev., including reasoning tokens).

/

[s]

Average time per tool call in seconds (± std. dev.).

/

[m$]

Average cost per tool call in milli-dollars (± std. dev.).