AI Agent Benchmark for Real-World Professional Workflows

1 min read Original article ↗

Agents' Last Exam

Challenge and measure AI agents on economically valuable and real-world tasks.

Agents' Last Exam is building the largest-scale, broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. Led by Berkeley RDI and 300+ industry experts, it now spans all 55 targeted sub-industries covering most major fields of professional work performed on a computer, with 1,500+ tasks collected toward a 5,000-task target, keeping scores objective, comparable, and meaningful across domains.