BrowseComp: The Benchmark That Tests What AI Agents Can Find

oss.vstorm.co

1 points by kacper-vstorm 3 months ago · 2 comments

Reader

I work in compliance, and we see this daily. "Do you have an incident response plan?" is trivially easy to verify. But actually finding and assembling that evidence across AWS, Google Docs, Jira, and Slack? That's the hard part nobody benchmarks for.

Curious if BrowseComp accounts for domain-specific retrieval or if it's mostly general web search.

kacper-vstormOP 3 months ago

TL;DR

    BrowseComp is a web browsing benchmark, not a knowledge or reasoning test. It evaluates whether AI agents can navigate the open web to find specific, obscure information.
    Questions are “inverted” - authors start with a fact and work backwards to create a question that’s easy to verify but extremely hard to solve through search.
    Brute-force search doesn’t work. The search space is deliberately massive - thousands of papers, matches, events - making systematic enumeration impractical.
    Grading uses an LLM judge with a confidence score, creating an interesting meta-layer where one model evaluates another’s certainty.
    This benchmark reveals the gap between “can answer questions” and “can do research” - the exact capability that separates chatbots from useful AI agents.

Settings

BrowseComp: The Benchmark That Tests What AI Agents Can Find

Keyboard Shortcuts