OpenGameEval
OpenGameEval is an evaluation framework for testing LLMs on Roblox game development tasks. This repository contains open-sourced evaluation scripts and tools for running automated assessments in the Roblox Studio environment.
LLM Leaderboard
The LLM Leaderboard summarizes benchmark results and progress for all evaluated Large Language Models in this repository. LLM_LEADERBOARD.md
Prerequisites
1. Roblox Account
You'll need a Roblox account. If you don't have one, create a free account at roblox.com.
2. OpenCloud API Key
To interact with the OpenGameEval API, you need to create an OpenCloud API key:
- Navigate to Creator Hub and log in. Make sure you are viewing as user, not group.
- Go to All tools (or OpenCloud) > API Keys
- Create a new key with:
- Access Permissions:
studio-evaluations - Operations:
create - Set an expiration date (recommended: 90 days)
- Access Permissions:
- Save and copy the generated key, which will be used as <OPEN_GAME_EVAL_API_KEY> in following commands.
Quick Start
1. Clone the Repository
git clone https://github.com/Roblox/open-game-eval.git
cd open-game-eval2. Install uv (one-time setup)
The project uses uv for dependency management. Install dependencies:
# macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Or with Homebrew brew install uv # Or with pip pip install uv
3. Run Your Evaluation
Important: You must provide your own LLM credentials (--llm-name and --llm-api-key) to run evaluations.
You may save your API keys in a file named .env. See .env.example for a sample.
# Set envvar export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key> export ANTHROPIC_API_KEY=<your-anthropic-api-key> # Pass in OpenGameEval API key and LLM API key (required) uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \ --api-key $OPEN_GAME_EVAL_API_KEY \ --llm-name "claude" \ --llm-api-key $ANTHROPIC_API_KEY
It should show the status being "submitted" with a url, through which you can check the status of the eval with the Roblox account that owns the API key logged in.
Evals/001_make_cars_faster.lua : Submitted - https://apis.roblox.com/open-eval-api/v1/eval-records/b7647585-5e1f-46b5-a8be-797539b65cc5It is common for an eval to take 3-4 minutes to run and gather results. The script polls result every 10 seconds and print a status update every 30 seconds.
Once completed, it will return whether the eval run is successful or not. The default timeout is 10 minutes.
Evals/001_make_cars_faster.lua : Success
Success rate: 100.00% (1/1) Understanding Eval Result
After eval completed, a result object will be returned as a part of http response. It is accessible through https://apis.roblox.com/open-eval-api/v1/eval-records/{jobId}
The eval is considered as a pass only if all checks are passed.
"results": [
{
"mode": "[EDIT]",
"result": {
"passes": 1,
"fails": 0,
"checks": 1,
"warning": "",
"error": "",
"interruptions": []
}
}
],
Eval result fields
passes: Number of checks passed.fails: Number of checks failed.checks: Total number of checks. Equals to passes + fails.warnings: Number of warnings received when running the eval.error: Number of errors received when running the eval.
More Usage
Running Multiple Evaluations
# Set envvar export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key> export ANTHROPIC_API_KEY=<your-anthropic-api-key> # Run all evaluations uv run invoke_eval.py --files "Evals/*.lua" --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEY # Run specific pattern uv run invoke_eval.py --files "Evals/0*_*.lua" --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEY # Run with concurrency limit uv run invoke_eval.py --files "Evals/*.lua" --max-concurrent 5 --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEY
Using Custom LLM Models
Please make sure the LLM_API_KEY is the correct key corresponding to the model provider.
# Set envvar export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key> export GEMINI_API_KEY=<your-gemini-api-key> export ANTHROPIC_API_KEY=<your-anthropic-api-key> export OPENAI_API_KEY=<your-openai-api-key> # With Gemini uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \ --api-key $OPEN_GAME_EVAL_API_KEY \ --llm-name "gemini" \ --llm-model-version "gemini-2.5-pro" \ --llm-api-key $GEMINI_API_KEY # With Claude uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \ --api-key $OPEN_GAME_EVAL_API_KEY \ --llm-name "claude" \ --llm-model-version "claude-sonnet-4-5-20250929" \ --llm-api-key $ANTHROPIC_API_KEY # With OpenAI uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \ --api-key $OPEN_GAME_EVAL_API_KEY \ --llm-name "openai" \ --llm-model-version "gpt-5" \ --llm-api-key $OPENAI_API_KEY
Command Line Options
uv run invoke_eval.py [OPTIONS] Required Options: --files TEXT [TEXT ...] Lua files to evaluate (supports wildcards) --api-key TEXT Open Cloud API key studio-evaluation (or set OPEN_GAME_EVAL_API_KEY env var) Required if running evals through LLM (not using reference mode): --llm-name TEXT Name of provider: claude | gemini | openai (REQUIRED) --llm-api-key TEXT LLM API key (REQUIRED, or set LLM_API_KEY env var) Optional: --llm-model-version TEXT LLM model version, e.g. claude-4-sonnet-20250514 --llm-url TEXT LLM endpoint URL. Not yet supported, please put a placeholder string here. --max-concurrent INTEGER Maximum concurrent evaluations --use-reference-mode Use reference mode for evaluation. This skips LLM and uses reference code for debugging eval contributions. --verbose-headers Output HTTP request and response headers for debugging
Note: --llm-name and --llm-api-key are required to ensure evaluations use your own LLM API key. The only exception is --use-reference-mode, which doesn't call an LLM.
Available model-versions:
- For Gemini models (provider-name: “gemini”)
- gemini-2.5-pro
- For Claude models (provider-name: “claude”)
- claude-4-sonnet-20250514
- claude-sonnet-4-5-20250929
- claude-haiku-4-5-20251001
- For OpenAI models (provider-name: “openai”)
- gpt-5
- gpt-5-mini
API Rate Limit
To ensure the stability of public API, we implement rate limiting. Exceeding these limits will result in an 429 Too Many Requests status code.
1. Eval job creation
Endpoint: POST /open-eval-api/v1/eval
| Limit Type | Rate | Time Window |
|---|---|---|
| Per API Key | 50 requests | Per hour |
| Per API Key | 100 requests | Per day |
| Per IP Address | 100 requests | Per day |
2. Polling job status
Endpoint: GET /open-eval-api/v1/eval-records/{jobId}
| Limit Type | Rate | Time Window |
|---|---|---|
| Per API Key | 60 requests | Per minute |
| Per IP Address | 60 requests | Per minute |
Troubleshooting
Common Issues
- LLM Name/API Key Required: You must provide
--llm-nameand--llm-api-key(or setLLM_API_KEYin.env). You will use your own LLM credentials for evaluations. - API Key Not Found: Ensure your Open Game Eval API key is set in the
.envfile or passed via--api-key. See.env.exampleas an example. - Permission Denied: Verify your API key has proper scope (
studio-evaluation:create). - Timeout Errors: Evaluations have a 10-minute timeout.
- File Not Found: Check file paths and ensure evaluation files exist.
- SSL certificate verify failed: Find the
Install Certificates.commandin finder and execute it. (See details and other solutions) - No output from Lua: If the eval failed with error
Error occurred, no output from Lua, it is caused by incorrect LLM info being passed in. Please double-checkllm-api-keyis correct, andllm-model-versionis one of the available versions listed.
API Reference
Base URL
https://apis.roblox.com/open-eval-api/v1
Endpoints
Submit Evaluation with Custom LLM Configuration
curl -X POST 'https://apis.roblox.com/open-eval-api/v1/eval' \ --header 'Content-Type: application/json' \ --header "x-api-key: $OPEN_GAME_EVAL_API_KEY" \ --data "$(jq -n --rawfile script Evals/001_make_cars_faster.lua '{ name: "make_cars_faster", description: "Evaluation on making cars faster", input_script: $script, custom_llm_info: { name: "provider-name", // ← Provider only, claude | gemini | openai api_key: "your-provider-api-key", model_version: "model-version", // ← see example model versions below url: "dummy_url_not_effective", } }')"
Check Status
curl 'https://apis.roblox.com/open-eval-api/v1/eval-records/{job_id}' \ --header "x-api-key: $OPEN_GAME_EVAL_API_KEY"
Job Status Values
QUEUED: Job is waiting to be processedPENDING: Job is being processedCOMPLETED: Job finished successfullyFAILED: Job failed
Evaluation Structure
Each evaluation file follows this structure:
local eval: BaseEval = { scenario_name = "001_make_cars_faster", -- Name of the eval prompt = { { { role = "user", content = "Make the cars of this game 2x faster", -- User prompt } } }, place = "racing.rbxl", --Name of placefile used. Currently only supports Roblox templates. } -- Setup necessary changes to the placefile before evaluation eval.setup = function() -- Create necessary set up to placefile, including selection end -- Reference function (optional, used when running evals with use-reference-mode) eval.reference = function() -- Expected behavior implementation. They are intentionally left blank in this set for the purpose of evaluation. end -- Validation function eval.check_scene = function() -- Checks for edit mode end eval.check_game = function() -- Checks for play mode end return eval
Contributing
This repository contains open-source evaluation scripts. To contribute:
- Fork the repository
- Create evaluation scripts following the established format
- Test your evaluations thoroughly
- Submit a pull request with clear documentation
License
This project is part of Roblox's open-source initiative. Please refer to the repository's license file for details.
Support
- Contact the Roblox team for API access and permissions