Is It Nerfed? - Continuous LLMs Evaluation

1 min read Original article ↗

Metrics Check

Continuously runs coding tasks against LLMs to track their performance over time

Claude Code (Sonnet 4.5) Failure Rate

Lower is better

Shows how well Claude Code performs on coding tasks over time.Lower = better performance, higher = worse performance.Runs on Sonnet 4.5 model only at this time.

Claude Code (Sonnet 4) Failure Rate

Lower is better

Shows how well Claude Code performs on coding tasks over time.Lower = better performance, higher = worse performance.Runs on Sonnet 4 model only at this time.

GPT-4.1 Failure Rate

Lower is better

Shows how well GPT-4.1 performs on coding tasks over time.Lower = better performance, higher = worse performance.