Chess LLM Benchmark
This benchmark evaluates LLM chess-playing ability by having models play games against calibrated engine anchors and other LLMs. Ratings are calculated using the Glicko-2 rating system, calibrated to approximate Lichess Classical ratings.
Results can be seen here
Author can be contacted at dfj2106@columbia.edu
How It Works
Gameplay LLMs receive the current position (FEN + ASCII board) and must return a single UCI move. Illegal moves get one retry with a warning; second illegal move = forfeit.
Anchor Engines Games are played against engines with known Lichess Classical in order to anchor our rating pool to the Lichess Classical pool.
Rating Calculation Glicko-2 ratings are calculated based on game outcomes. FIDE rating is estimated using ChessGoals.com FIDE conversion data.
More general methodology notes are on the website.
Installation
pip install -r requirements.txt
Using anchor engines requires installing them individually. Maia Eubos
Usage
Set API Key
export OPENROUTER_API_KEY="your-key"
Run Manual Games
# LLM vs Stockfish (default engine) python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --stockfish-skill 5 # LLM vs LLM python cli.py manual --white-model meta-llama/llama-4-maverick --black-model deepseek/deepseek-chat-v3-0324 # Multiple games (alternates colors each game) python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --games 10 # Against different engine types python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type maia-1100 python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type random python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --engine-type eubos # With reasoning models (use max-tokens 0 for extended thinking) python cli.py manual --white-model deepseek/deepseek-r1 --black-engine --white-reasoning-effort high --max-tokens 0 # Enable reasoning mode for hybrid models python cli.py manual --white-model deepseek/deepseek-chat --black-engine --reasoning # Don't save the game python cli.py manual --white-model meta-llama/llama-4-maverick --black-engine --no-save
Manual command engine presets: stockfish, maia-1100, maia-1900, random, eubos
Note:
eubosis a hardcoded preset. For custom UCI engines in benchmarks, usetype: uciin config.
Run Full Benchmark
python cli.py run -c config/benchmark.yaml -v
View Leaderboard
python cli.py leaderboard --min-games 5 python cli.py leaderboard --sort legal # Sort by legal move % python cli.py leaderboard --sort cost # Sort by $/game
Recalculate Ratings
Recalculate all ratings from stored game results (useful after playing manual games or changing anchor ratings):
python cli.py recalculate -c config/benchmark.yaml
Web Interface
Available at https://chessbenchllm.onrender.com
python web/app.py
# Open http://localhost:5000Features:
- Leaderboard with Glicko-2 ratings, FIDE estimates, confidence intervals, legal move rates, $/game, and release dates
- Game library with filtering by player and pagination
- Interactive game viewer with move-by-move navigation
- Client-side Stockfish analysis (toggle-able eval bar + top engine lines)
- Timeline chart showing rating progression over time
- Cost vs Rating chart with efficiency frontier
- Methodology page explaining the rating system
- JSON API at
/api/leaderboard,/api/games,/api/game/<id>
Configuration
Edit config/benchmark.yaml to configure:
- LLM models to benchmark (via OpenRouter)
- Engine anchors (Stockfish, Maia, Random, or any UCI engine)
- Games per matchup and concurrency settings
Example:
benchmark: games_vs_anchor_per_color: 10 games_vs_llm_per_color: 5 max_concurrent: 4 max_moves: 200 rating_threshold: 600 # Only pair players within this rating difference engines: - player_id: "random-bot" type: random rating: 400 - player_id: "maia-1100" type: maia lc0_path: "/opt/homebrew/bin/lc0" weights_path: "maia-1100.pb.gz" rating: 1628 - player_id: "eubos" type: uci # Generic UCI engine path: "/path/to/engine" rating: 2200 initial_time: 900 # Clock-based time control (seconds) increment: 10 llms: - player_id: "llama-4-maverick" model_name: "meta-llama/llama-4-maverick" temperature: 0.0 max_tokens: 10 - player_id: "deepseek-r1" model_name: "deepseek/deepseek-r1" reasoning_effort: "medium" # minimal, low, medium, high
Engine types: stockfish, maia, random, uci (generic UCI engine)
Project Structure
├── cli.py # Main CLI entrypoint
├── config/
│ └── benchmark.yaml # Benchmark configuration
├── engines/ # Chess engine wrappers
│ ├── base_engine.py # Base engine class
│ ├── stockfish_engine.py
│ ├── maia_engine.py
│ ├── random_engine.py
│ └── uci_engine.py # Generic UCI engine wrapper
├── llm/ # LLM player clients
│ ├── base_llm.py # Base LLM player class
│ ├── openrouter_client.py
│ └── prompts.py # Chess prompt templates
├── game/ # Game execution
│ ├── game_runner.py # Core game loop
│ ├── match_scheduler.py # Parallel game execution
│ ├── models.py # Pydantic data models
│ ├── pgn_logger.py # PGN/result saving
│ └── stats_collector.py # Win/loss/draw stats
├── rating/ # Rating system
│ ├── glicko2.py # Glicko-2 implementation
│ ├── rating_store.py # Local JSON storage
│ ├── leaderboard.py # Leaderboard formatting
│ ├── fide_estimate.py # FIDE rating estimation
│ └── cost_calculator.py # API cost calculation
├── web/ # Web interface
│ ├── app.py # Flask application
│ ├── timeline_chart.py # Rating timeline visualization
│ ├── cost_chart.py # Cost vs rating visualization
│ ├── templates/ # HTML templates
│ └── static/ # CSS/JS assets
└── data/ # Output (gitignored)
├── games/ # PGN files
├── results/ # JSON game results
├── ratings.json # Current ratings
├── lichess_to_fide.json # FIDE conversion data
└── model_publish_dates.json # Model release dates
Rating System
Uses Glicko-2 with:
- Rating (μ): Estimated skill level (starts at 1500)
- Rating Deviation (RD): Uncertainty (decreases with more games)
- Volatility (σ): Expected rating fluctuation
- FIDE Estimate: Approximate FIDE rating based on ChessGoals.com Lichess-to-FIDE conversion (valid for 1715-2500 range)
- Legal Move Rate: Percentage of moves that were legal on first attempt
Engine anchors have fixed ratings based on their approximate Elo and are never updated.
Illegal Move Policy
- First illegal move: Warning sent, LLM gets one retry
- Second illegal move: Immediate forfeit (loss), following FIDE rules
The retry prompt tells the LLM which move was illegal but does not provide a list of legal moves.