Settings

Theme

bisonbear

Karma
43
Created
9 months ago

About

Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh

Recent Submissions

  1. 1. I evaluated GLM 5.2 against the frontier on tasks from real repos (stet.sh)
  2. 2. I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (stet.sh)
  3. 3. I used autoresearch to improve my AGENTS.md, measured against real tasks (stet.sh)
  4. 4. A brief investigation into the GPT-5.5 regression claims (stet.sh)
  5. 5. The Opus 4.7 reasoning curve - Medium is the best default? (stet.sh)
  6. 6. GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (stet.sh)
  7. 7. GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (stet.sh)
  8. 8. I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (stet.sh)
  9. 9. Coding evals are broken. CI is green while AI code quality goes unmeasured (stet.sh)
  10. 10. Agents.md is the highest-leverage code you're not testing (stet.sh)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection