cua-bench is a collection of desktop and mobile tasks with a harness for evaluation and training to help agent makers quantify their agents' computer-use mastery.
interested in|
Exactly things cua community needed. With this API based interface we are able to scale large amount diverse tasks for RL and data distillation, with help of Codex/Claude Code and other pipelines! Congrats @trycua
view agent performance
task resolution success-rate for top agents and models on cua-bench2.0
Coming Soon
The leaderboard is being prepared. Check back soon!
view cua-bench task examples