Settings

Theme

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

vibrantlabs.com

7 points by shahules 7 days ago · 1 comment

Reader

shahulesOP 7 days ago

Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection