Introducing: Code Canary

If you’ve got any kind of engineering background, you’re probably spending most of your waking hours enmeshed in a possibly unhealthy relationship with a coding agent like Claude Code. I know I am. This also means you know models like Claude have good days and bad days. Sometimes these issues stem from service interruptions as APIs get overloaded or there’s some kind of backend issue preventing them from working at all.

Other times it just seems like the model is being dumb and it’s hard to discern whether it’s the task you’ve given it or something deeper at play¹.

The most maddening part is that it’s very hard to tell whether it’s you or your hastily vibe coded codebase causing the difficulties, or whether there’s something actually happening. Without data, it’s impossible to know.

Anthropic has this data, but we don’t.

You’ve probably seen Claude Code ask you for feedback on how it’s doing:

The problem with this survey is that the data gets sent to Anthropic and we never get to see it aggregated, so we have no idea if Claude is actually having a bad day or if it’s just our perception.

This simple survey gave me an idea: what if we distributed this with a decentralized data survey tool?

Enter Code Canary, my humble attempt at creating a distributed data collection platform for analyzing the quality of coding agents in real time.

Code Canary is a lightweight, open feedback system that lets developers rate their AI coding sessions and publishes the results as a public, continuously-updated comparison dashboard.

Here’s how it works:

You install a hook. For Claude Code, it’s a TaskCompleted hook — a single shell command that fires when your coding session ends. It takes about 30 seconds to set up.
You rate your session. After each session, a small prompt asks: Did the agent complete the task? How was the code quality? How many corrections did you have to make? A quick 10-second interaction.
Your rating is anonymized and aggregated. No code leaves your machine. No prompts are shared. Just structured metadata: which tool, what type of task, what language, and your rating.
The dashboard updates. At codecanary.net, the aggregated data powers a public leaderboard that anyone can explore — sliced by language, task type, codebase size, and more.

This isn’t just about buggy models

As developers become more dependent on coding agents, we’re going to need an independent source of truth about how they’re actually behaving. Benchmarks are fine, but they really measure a model’s performance in a vacuum. Even worse, models are frequently over-optimized for individual evaluation frameworks and can behave vastly differently in situ.

For example, SWE-bench runs models against a curated set of GitHub issues. HumanEval tests function completion in isolation. These are useful, controlled experiments — but they measure performance in a vacuum, not in your codebase and not based on your lived experience as a developer.

Meanwhile, the people with the best data on how these tools actually perform — working developers — have no systematic way to share what they know.

The Canary in the Code Mine

I named this project Code Canary after canaries in coal mines where they served as early warning systems. In practice, this meant sacrificing the lives of a lot of tiny birds: lethal gases (mainly carbon monoxide) would suffocate them before humans, so it gave miners time to reach safety. Code Canary works with a similar idea, but thankfully with zero avian death: when a tool’s quality starts slipping — maybe after a model update, maybe after a rushed release — the developers using it every day will know first. Their ratings are the earliest, most reliable signal available.

Join the Beta

I’m launching Code Canary in beta today. Here’s how to get it running:

Install the hook for your preferred AI coding tool and start rating your sessions:
curl -sL https://codecanary.net/install | bash
Explore the dashboard at codecanary.net to see early results
Share your setup — the more developers who contribute ratings, the more useful the data becomes for everyone
Read the FAQ if you’re curious about how it works – Code Canary never sees your code and data is anonymized before publicly displayed.

As more people use Code Canary, the data will become more useful.

I can’t wait to start building some really sophisticated analytics, so I hope you check it out and contribute some votes!

One interesting theory I’ve heard is that the model providers will purposely degrade performance before they release a new model by reallocating GPUs away from inference and onto model training. By decreasing the number of GPUs available for inference, the theory goes, they have to quantize models (e.g. add lossiness, like turning down the bitrate of an MP3) which means a decrease in reasoning quality and model performance. ↩︎