Settings

Theme

"A benchmark for catching when code doesn't do what its documentation claims"

github.com

3 points by o2zer0cool 21 days ago · 2 comments

Reader

westurner 21 days ago

Suggestions; would it be more maintainable to:

Rewrite this with pytest-evals.

Write pytest tests with pytest.mark.parametrize, fixtures, and mocks. Push to >90% branch coverage with pytest-cov.

.

I don't think any of these benchmarks yet do model output evals for docs?:

Mcpbr > Supported Benchmarks: https://github.com/supermodeltools/mcpbr#supported-benchmark...

.

On subjectivity and language also the other day, this: https://github.com/mozilla/firefox-devtools-mcp/pull/90#issu... :

> how to optimize an AGENTS.md:

> [agentevals, foundry-toolkit, ]

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection