CATArena: Engineering-Level Tournament Evaluation Platform for LLM-Driven Code Agents
๐ Website | ๐ Leaderboard | ๐บ Watch Replays | ๐ Paper (arXiv)
โก๏ธQuick Overview
CATArena (Code Agent Tournament Arena) is an open-ended environment where LLMs write executable code agents to battle each other and then learn from each other.
Unlike static coding benchmarks, in CATArena, agents are asked to
- Write a code for the task;
- Compete their code in a tournament;
- Learn competition logs, ranking, and rivals' code from the tournament;
- Then Re-Write the code for the next tournament.
Online Competition Demostration
Latest results from SOTA agents' competitions are continuously updated on our Online Competition Website.
A demo competition of 5 SOTA code agents in Texas Hold'em.
๐ฏ Core Positioning
CATArena is an engineering-level tournament evaluation platform for Large Language Model-driven code agents (LLM-driven code agents), based on an iterative competitive peer learning framework. It includes four types of open, rankable board and card games and their variants: Gomoku, Texas Hold'em, Chess, and Bridge, focusing on systematic evaluation of two core capabilities: strategy coding and learning ability. We will add more new environments and tasks in the future.
๐ฎ Supported Environments
Now we provide 4 core environments designed to test different cognitive capabilities:
| Game Environment | Core Capability Tested | Location | Rules |
|---|---|---|---|
| โซ Gomoku | 15ร15 board, symmetric game, medium difficulty | CATArena/gomokugame/ |
Win by connecting five stones, supports standard and variant versions |
| ๐ Texas Hold'em | Multi-player card game, simple difficulty, opening randomness | CATArena/holdemgame/ |
Supports classic version (52 cards) and variant version (32 cards) |
| โ๏ธ Chess | 8ร8 board, symmetric game, difficult difficulty | CATArena/chessgame/ |
Standard chess rules, supports variant rule extensions |
| Four-player card game, medium difficulty, opening randomness | CATArena/bridgegame/ |
Standard bridge rules, supports open/closed room direction switching |
|
Note: We also support variants like Chess960 to test generalizability and prevent rote memorization.
Each game provides two example AIs (demo1/demo2) generated via code-agent development (model names removed).
๐ง Technical Architecture
Evaluation Process
- Initial Strategy Development (Round 1): Agents autonomously code strategies based on game environment code and sample AI implementations, participating in the first round of competition
- Iterative Strategy Optimization (Rounds 2~N): Agents obtain all previous round participant code and detailed battle logs, analyze historical data, and optimize their own strategies
- Multi-round Cycle: Through multi-round cycles, evaluate agents' learning and adaptation capabilities
Competition Format
- Symmetric Games: Use full round-robin tournaments to ensure sufficient strategy confrontation
- Asymmetric Games: Use grouped multi-agent battles with multi-round averaging to reduce randomness impact
- Repeated Validation: All competitions are repeated multiple times, and results are averaged for robust evaluation
๐ Evaluation Indicator System
1. Strategy Coding Ability
Measures the basic ability of agents to abstract game strategies into algorithms and implement them as executable code. Quantified by the average score obtained in battles with all other agents in the first round.
2. Learning Ability
Measures agents' ability to improve their own performance using historical information, including:
- Global Learning: Agents' learning and adaptation capabilities in multi-round competitions
- Targeted Learning: Agents' ability to achieve performance improvement against opponents
- Self-improvement: Models' ability to improve their own strategies during iteration
๐ Leaderboard
| Agent Group | Agent | Standard | Variant | ||
|---|---|---|---|---|---|
| S.C. โ |
G.L. โ |
S.C. โ |
G.L. โ |
||
| Minimal | Claude-4-Sonnet | 1.25 | 2.5 | 1.75 | 2.75 |
| DeepSeek-Chat | 5.75 | 2.75 | 4.25 | 2.75 | |
| Doubao-Seed | 3.75 | 4.75 | 3.75 | 4.50 | |
| Gemini-2.5-Pro | 3.25 | 3.75 | 3.25 | 2.75 | |
| GPT-5 | 3.75 | 3.50 | 3.00 | 3.75 | |
| Qwen3-Coder | 2.25 | 3.75 | 3.00 | 4.5 | |
| Commercial | best ADK | 3.25 | 2.25 | 2.00 | 3.75 |
| Claude-Code | 2.50 | 3.75 | 2.50 | 2.75 | |
| CodeX | 2.25 | 2.75 | 3.00 | 3.00 | |
| Gemini-CLI | 3.50 | 2.25 | 3.00 | 4.00 | |
| Qwen-Coder | 3.00 | 3.75 | 4.00 | 1.25 | |
Legend:
S.C. = Strategy Coding average Ranking, lower is better.
G.L. = Global Learning average Ranking, lower is better.
For more details and results, visit our paper.
๐ ๏ธ Usage Guide
Quick Start
Each game environment has independent README documentation, including:
- Environment installation and dependency configuration
- AI development guides and sample code
- Battle configuration and running methods
- Result analysis and report generation
Developing Custom AI
- Refer to
ChatPrompt.pyin each game directory to get development prompts - Use your code agent to generate competing AI
- Configure battle parameters and start services
- Participate in multi-round iterative battles
Evaluation Recommendations
- Recommend generating code multiple times (>=4 times) to compare relative rankings
- Focus on models' relative ranking values rather than absolute scores
- Make full use of historical battle logs for strategy optimization
๐ Project Structure
CATArena/
โโโ README.md # This document
โโโ README-CN.md # Chinese version
โโโ rawDoc # Detailed technical documentation
โโโ gomokugame/ # Gomoku game environment
โโโ holdemgame/ # Texas Hold'em game environment
โโโ chessgame/ # Chess game environment
โโโ bridgegame/ # Bridge game environment
Each game environment includes:
- Game server and API interfaces
- AI sample code and development tools
- Battle arena system
- Configuration files and logging system
๐ฎ Future Plans
- More new evaluation environments will be added
- Continuous optimization of evaluation indicators and stability
๐ Core Evaluation Conclusions
CATArena can effectively distinguish different types of agent capabilities. Detailed evaluation results can be found in our paper here.
Citation
@misc{fu2025catarenaevaluationllmagents,
title={CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions},
author={Lingyue Fu and Xin Ding and Yaoming Zhu and Shao Zhang and Lin Qiu and Weiwen Liu and Weinan Zhang and Xuezhi Cao and Xunliang Cai and Jiaxin Ding and Yong Yu},
year={2025},
eprint={2510.26852},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.26852},
}
๐ License
This project is licensed under the MIT License, welcoming open source community contributions and usage.
โ Contact
For any questions, reach out via:
X (Twitter): @AGI_Evals
Team Email: agieval17@gmail.com
Authors' Email: Lingyue Fu (fulingyue [at] sjtu.edu.cn), Yaoming Zhu (zhuyaoming02 [at] meituan.com)
Built with โค๏ธ by the AGI-Eval Team, Meituan & SJTU
