Settings

Theme

Show HN: Open Operator Evals – real-world benchmarks for LLM web agents

github.com

3 points by monoid73 6 months ago · 1 comment · 1 min read

Reader

We’ve open-sourced a benchmark for LLM-driven web agent setups.

It evaluates real-world tasks, like logging in, scraping dashboards, and submitting forms, using structured criteria: success rate, latency, and task reliability.

Everything is fully reproducible, with all outputs, logs, and evaluation data available.

https://github.com/nottelabs/open-operator-evals

Feedback, critiques, or contributions welcome:)

pancsta 6 months ago

Nice video logs. Challenge accepted :)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection