PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

7 points by shahules 7 days ago · 1 comment

Reader

shahulesOP 7 days ago

Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.

Settings

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

Keyboard Shortcuts