Settings

Theme

bisonbear

Karma
37
Created
7 months ago

About

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@benr.build

Recent Submissions

  1. 1. GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (stet.sh)
  2. 2. GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (stet.sh)
  3. 3. I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (stet.sh)
  4. 4. Coding evals are broken. CI is green while AI code quality goes unmeasured (stet.sh)
  5. 5. Agents.md is the highest-leverage code you're not testing (stet.sh)
  6. 6. Your AI coding benchmark is hiding a 2x quality gap (stet.sh)
  7. 7. Things I Learned at the Claude Code NYC Meetup (benr.build)
  8. 8. Claude vs. Codex in the Messy Middle (benr.build)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection