bisonbear
- Karma
- 43
- Created
- 9 months ago
About
Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.shRecent Submissions
- 1. ▲ I evaluated GLM 5.2 against the frontier on tasks from real repos (stet.sh)
- 2. ▲ I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (stet.sh)
- 3. ▲ I used autoresearch to improve my AGENTS.md, measured against real tasks (stet.sh)
- 4. ▲ A brief investigation into the GPT-5.5 regression claims (stet.sh)
- 5. ▲ The Opus 4.7 reasoning curve - Medium is the best default? (stet.sh)
- 6. ▲ GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (stet.sh)
- 7. ▲ GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (stet.sh)
- 8. ▲ I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (stet.sh)
- 9. ▲ Coding evals are broken. CI is green while AI code quality goes unmeasured (stet.sh)
- 10. ▲ Agents.md is the highest-leverage code you're not testing (stet.sh)