bisonbear
- Karma
- 37
- Created
- 7 months ago
About
Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@benr.buildRecent Submissions
- 1. ▲ GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (stet.sh)
- 2. ▲ GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (stet.sh)
- 3. ▲ I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (stet.sh)
- 4. ▲ Coding evals are broken. CI is green while AI code quality goes unmeasured (stet.sh)
- 5. ▲ Agents.md is the highest-leverage code you're not testing (stet.sh)
- 6. ▲ Your AI coding benchmark is hiding a 2x quality gap (stet.sh)
- 7. ▲ Things I Learned at the Claude Code NYC Meetup (benr.build)
- 8. ▲ Claude vs. Codex in the Messy Middle (benr.build)