bisonbear

Karma: 37
Created: 7 months ago

About

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@benr.build

Recent Submissions

1. ▲ GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (stet.sh) 2 points · 10 hours ago · 0 comments
2. ▲ GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (stet.sh) 4 points · 7 days ago · 0 comments
3. ▲ I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (stet.sh) 2 points · 21 days ago · 0 comments
4. ▲ Coding evals are broken. CI is green while AI code quality goes unmeasured (stet.sh) 1 point · 23 days ago · 0 comments
5. ▲ Agents.md is the highest-leverage code you're not testing (stet.sh) 1 point · 28 days ago · 0 comments
6. ▲ Your AI coding benchmark is hiding a 2x quality gap (stet.sh) 3 points · 1 month ago · 0 comments
7. ▲ Things I Learned at the Claude Code NYC Meetup (benr.build) 2 points · 3 months ago · 0 comments
8. ▲ Claude vs. Codex in the Messy Middle (benr.build) 1 point · 4 months ago · 0 comments

All submissions on HN · View profile on HN