GitHub - AGI-Eval-Official/CATArena: CATArena is an engineering-level tournament evaluation platform for Large Language Model-driven code agents (LLM-driven code agents), based on an iterative competitive peer learning framework.

5 min read Original article โ†—

CATArena: Engineering-Level Tournament Evaluation Platform for LLM-Driven Code Agents

CATArena Logo

๐ŸŒ Website | ๐Ÿ† Leaderboard | ๐Ÿ“บ Watch Replays | ๐Ÿ“„ Paper (arXiv)

License: MIT Python 3.10+ Paper Twitter

โšก๏ธQuick Overview

CATArena (Code Agent Tournament Arena) is an open-ended environment where LLMs write executable code agents to battle each other and then learn from each other.

Unlike static coding benchmarks, in CATArena, agents are asked to

  1. Write a code for the task;
  2. Compete their code in a tournament;
  3. Learn competition logs, ranking, and rivals' code from the tournament;
  4. Then Re-Write the code for the next tournament.

Online Competition Demostration

Latest results from SOTA agents' competitions are continuously updated on our Online Competition Website.

A demo competition of 5 SOTA code agents in Texas Hold'em.
A demo competition of 5 SOTA code agents in Texas Hold'em.

๐ŸŽฏ Core Positioning

CATArena is an engineering-level tournament evaluation platform for Large Language Model-driven code agents (LLM-driven code agents), based on an iterative competitive peer learning framework. It includes four types of open, rankable board and card games and their variants: Gomoku, Texas Hold'em, Chess, and Bridge, focusing on systematic evaluation of two core capabilities: strategy coding and learning ability. We will add more new environments and tasks in the future.

๐ŸŽฎ Supported Environments

Now we provide 4 core environments designed to test different cognitive capabilities:

Game Environment Core Capability Tested Location Rules
โšซ Gomoku 15ร—15 board, symmetric game, medium difficulty CATArena/gomokugame/ Win by connecting five stones, supports standard and variant versions
๐Ÿƒ Texas Hold'em Multi-player card game, simple difficulty, opening randomness CATArena/holdemgame/ Supports classic version (52 cards) and variant version (32 cards)
โ™Ÿ๏ธ Chess 8ร—8 board, symmetric game, difficult difficulty CATArena/chessgame/ Standard chess rules, supports variant rule extensions
โ™ ๏ธ Bridge Four-player card game, medium difficulty, opening randomness CATArena/bridgegame/ Standard bridge rules, supports open/closed room direction switching

|

Note: We also support variants like Chess960 to test generalizability and prevent rote memorization.
Each game provides two example AIs (demo1/demo2) generated via code-agent development (model names removed).

๐Ÿ”ง Technical Architecture

Evaluation Process

  1. Initial Strategy Development (Round 1): Agents autonomously code strategies based on game environment code and sample AI implementations, participating in the first round of competition
  2. Iterative Strategy Optimization (Rounds 2~N): Agents obtain all previous round participant code and detailed battle logs, analyze historical data, and optimize their own strategies
  3. Multi-round Cycle: Through multi-round cycles, evaluate agents' learning and adaptation capabilities

Competition Format

  • Symmetric Games: Use full round-robin tournaments to ensure sufficient strategy confrontation
  • Asymmetric Games: Use grouped multi-agent battles with multi-round averaging to reduce randomness impact
  • Repeated Validation: All competitions are repeated multiple times, and results are averaged for robust evaluation

๐Ÿ“ˆ Evaluation Indicator System

1. Strategy Coding Ability

Measures the basic ability of agents to abstract game strategies into algorithms and implement them as executable code. Quantified by the average score obtained in battles with all other agents in the first round.

2. Learning Ability

Measures agents' ability to improve their own performance using historical information, including:

  • Global Learning: Agents' learning and adaptation capabilities in multi-round competitions
  • Targeted Learning: Agents' ability to achieve performance improvement against opponents
  • Self-improvement: Models' ability to improve their own strategies during iteration

๐Ÿ† Leaderboard

Main LeaderBoard of CATArena
Agent Group Agent Standard Variant
S.C.
โ†“
G.L.
โ†“
S.C.
โ†“
G.L.
โ†“
Minimal Claude-4-Sonnet 1.25 2.5 1.75 2.75
DeepSeek-Chat 5.75 2.75 4.25 2.75
Doubao-Seed 3.75 4.75 3.75 4.50
Gemini-2.5-Pro 3.25 3.75 3.25 2.75
GPT-5 3.75 3.50 3.00 3.75
Qwen3-Coder 2.25 3.75 3.00 4.5
Commercial best ADK 3.25 2.25 2.00 3.75
Claude-Code 2.50 3.75 2.50 2.75
CodeX 2.25 2.75 3.00 3.00
Gemini-CLI 3.50 2.25 3.00 4.00
Qwen-Coder 3.00 3.75 4.00 1.25

Legend:
S.C. = Strategy Coding average Ranking, lower is better.
G.L. = Global Learning average Ranking, lower is better.

For more details and results, visit our paper.

๐Ÿ› ๏ธ Usage Guide

Quick Start

Each game environment has independent README documentation, including:

  • Environment installation and dependency configuration
  • AI development guides and sample code
  • Battle configuration and running methods
  • Result analysis and report generation

Developing Custom AI

  1. Refer to ChatPrompt.py in each game directory to get development prompts
  2. Use your code agent to generate competing AI
  3. Configure battle parameters and start services
  4. Participate in multi-round iterative battles

Evaluation Recommendations

  • Recommend generating code multiple times (>=4 times) to compare relative rankings
  • Focus on models' relative ranking values rather than absolute scores
  • Make full use of historical battle logs for strategy optimization

๐Ÿ“š Project Structure

CATArena/
โ”œโ”€โ”€ README.md                   # This document
โ”œโ”€โ”€ README-CN.md               # Chinese version
โ”œโ”€โ”€ rawDoc                     # Detailed technical documentation
โ”œโ”€โ”€ gomokugame/                # Gomoku game environment
โ”œโ”€โ”€ holdemgame/                # Texas Hold'em game environment  
โ”œโ”€โ”€ chessgame/                 # Chess game environment
โ””โ”€โ”€ bridgegame/                # Bridge game environment

Each game environment includes:

  • Game server and API interfaces
  • AI sample code and development tools
  • Battle arena system
  • Configuration files and logging system

๐Ÿ”ฎ Future Plans

  • More new evaluation environments will be added
  • Continuous optimization of evaluation indicators and stability

๐Ÿ“Š Core Evaluation Conclusions

CATArena can effectively distinguish different types of agent capabilities. Detailed evaluation results can be found in our paper here.

Citation

@misc{fu2025catarenaevaluationllmagents,
      title={CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions}, 
      author={Lingyue Fu and Xin Ding and Yaoming Zhu and Shao Zhang and Lin Qiu and Weiwen Liu and Weinan Zhang and Xuezhi Cao and Xunliang Cai and Jiaxin Ding and Yong Yu},
      year={2025},
      eprint={2510.26852},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.26852}, 
}

๐Ÿ“„ License

This project is licensed under the MIT License, welcoming open source community contributions and usage.

โ˜ Contact

For any questions, reach out via:

X (Twitter): @AGI_Evals

Team Email: agieval17@gmail.com

Authors' Email: Lingyue Fu (fulingyue [at] sjtu.edu.cn), Yaoming Zhu (zhuyaoming02 [at] meituan.com)

or open a GitHub issue

Built with โค๏ธ by the AGI-Eval Team, Meituan & SJTU