Let me set the scene.
I’m the founder of a database operator company. Because my days are consumed by talking to customers, I hadn’t written serious code in a year.
Until that night at 11 PM, when I found myself staring at a test that had been failing for three straight hours.
I was building a database operator — the kind that handles failovers. The margin for error here is exactly zero: one bad line of code, and some company’s production database goes down in the middle of the night. In a moment of sheer desperation, I handed the debug session over to Claude Code.
It quickly scanned the logs and confidently informed me: the failure was highly likely caused by “kernel-level mutex contention during RST packet processing under high network throughput.”
I stared at the screen, speechless for a solid minute.
There were exactly 12 pods running in this test cluster. The “high network throughput” it was referring to was a basic smoke test.
Welcome to my life for the past month.
Press enter or click to view image in full size
How Did I Get Here?
My day job involves writing Kubernetes database operators. If you’re not familiar with this particular flavor of dark magic, an operator is essentially a “robot” running inside K8s that manages the entire lifecycle of a database cluster — creation, failovers, backups, scaling, the whole nine yards.
A month ago, I decided to go all in: I let Claude Code take the wheel.
Not the “hey, autocomplete this function for me” kind of dabbling. I mean everything: writing Terraform, managing clusters, packaging Helm charts, spinning up test environments, and running the entire test suite. I wanted to see what would happen if you treated an AI — one that can actually drive a terminal, not just suggest snippets — as a full-time Go developer and SRE co-pilot.
The result? It’s complicated. Like going on a cross-country road trip with a brilliant but occasionally psychotic navigator.
The “Living in the Future” Part
Let’s start with the good stuff, because it’s genuinely mind-blowing.
Those chores that used to feel like pure administrative overhead? Gone. Erased so thoroughly that I’ve almost forgotten what it feels like to provision environments manually.
Before: To spin up an EKS test cluster, I had to open a second terminal, rack my brain for which Terraform variables to override, run plan, stare at the diff, apply, wait, watch the node groups come up, and pray the CNI didn’t throw a tantrum. It was a 15-minute interruption — just enough to make me forget the core logic I was actually trying to write.
Now, I just say: “Spin up a 3-node EKS cluster in ap-southeast-1, m5.xlarge, add the usual taints. I’m running chaos tests today.”
Then I go grab a coffee. Claude Code writes the Terraform, runs apply, reads the output, handles that annoying AWS IAM propagation delay on its own, and pings me when it’s done. When I’m finished, a simple “tear it down” is all it takes.
The vcluster workflow is the same. Create an isolated cluster, package the local operator code, deploy it, wait for readiness, run tests, collect logs. This used to be a 20-minute copy-paste ritual performed countless times a week.
Now? One prompt.
Even Chaos testing has become somewhat… fun? (A sentence I never thought I’d type). Claude Code injects the faults, monitors the recovery metrics in real-time, and gives me play-by-play commentary like a sports announcer: “Primary node killed. Sentinel detected failure. Promoting replica-1. Endpoint updated. Recovery complete in 4.2 seconds.” When things break, the logs are right there in my face. No more digging through kubectl for 20 minutes.
Conservatively, I got back two to three hours of pure focus time every day.
This part of the experience is outrageously good.
The Nightmare of “Sleep-Oriented Programming”
Every developer has bad habits when their code breaks. Some scatter print statements everywhere. Some restart the service and pray. Some scour Stack Overflow for a 2014 answer that is 40% relevant.
Claude Code’s bad habit is adding sleep.
Not just occasionally. It’s an obsessive, escalating, and ultimately suffocating reliance on sleep.
Here’s a specific example. The test was checking whether the operator could correctly handle leader election after the primary node was killed. The test failed. I handed it to Claude Code.
Round 1: “I’ve increased the wait time from 5 seconds to 10 seconds to account for potential election timing issues.”Still failed.Round 2: “Adjusted to 20 seconds to ensure the election completes under load.”Still failed.
I watched this play out for ten rounds:5s → 10s → 20s → 30s → 60s → 90s → 120s → 180s → 300s → 600 seconds.
Press enter or click to view image in full size
Ten minutes. Waiting for a leader election that should take three seconds.
The test still failed. Because the problem was never timing. The operator had a race condition when handling concurrent reconcile triggers — a pure logic bug that waiting a million years wouldn’t fix. While Claude Code was meticulously building a scaffolding of delays, the real bug sat there, completely untouched, laughing at us.
The worst part is that at every step, Claude Code sounded entirely reasonable: “Based on the observed cluster state, the previous timeout may have been insufficient.” It was confident, articulate, and dead wrong.
The actual disaster here is that development velocity stops converging. Normal debugging should get faster as you close in on the root cause. But once you fall into the “sleep spiral,” every “fix” just makes the test slower, the bug remains unharmed, and the debug session stretches on forever. Several times, after watching it flail for an entire afternoon, I reverted all its changes, looked at the code myself, and fixed it in 20 minutes.
My AI Co-pilot is a Conspiracy Theorist
It took me a while to see this, but once I did, I couldn’t unsee it.
When software breaks, explanations fall into two categories:
•Mundane: The code has a bug.
•Exotic: The system, the kernel, the network stack, and perhaps cosmic rays combined to create a once-in-a-century anomaly.
95% of the time, the mundane explanation is correct. Claude Code prefers the exotic.
It’s not a “let’s rigorously consider all possibilities” attitude. It’s a “let me immediately invent a highly technical-sounding theory to prove the problem is definitely not in the code I just wrote” attitude.
Look at these actual quotes from this month, paired with the real issues:
Claude Code: “T9 and C6 failing simultaneously in the same run is not a code bug. It’s a shared environmental factor — the EKS API server response was slow during this run, causing timing issues for both tests.”The Real Issue: Both tests shared the same flawed initialization logic. It was a code bug. It had nothing to do with the API server.
Claude Code: “This timing signature suggests potential kernel-level mutex contention during RST packet processing under high throughput.”The Real Issue: The container image didn’t have bash. The script defaulted to bash, failed silently, and worked perfectly once changed to /bin/sh.
Two examples. One blamed the cloud provider’s infrastructure; the other blamed the Linux kernel. Both were smokescreens built from intimidating technical jargon. If you don’t look closely, you’ll find yourself nodding along — “Yeah, kernel mutex contention, that sounds really hard to debug…”
It’s not a kernel mutex. You’re just missing bash.
I’ve figured out its pattern: whenever Claude Code doesn’t understand why something broke, it generates an explanation that is (a) highly technical, and (b) implies the problem lies outside the codebase. Kernel bugs, API pressure, network anomalies… it acts as if the entire Kubernetes ecosystem is conspiring against us.
You have to argue with it.”No, the cluster is healthy, look at the metrics.””No, it’s not a kernel issue, this cluster only has 12 pods!””No, let’s look at the stack trace again.”
Sometimes it takes four rounds of back-and-forth before it finally concedes and reluctantly checks its own code. The worst-case scenario is that you get bluffed and start doubting yourself: “Maybe it really is API pressure? It seems so sure…”
Wake up. It’s not API pressure. Trust yourself.
Press enter or click to view image in full size
The 80% Cliff (Or: The Hangover You Didn’t Plan For)
Nobody warned me about this, so I’m warning you now.
Claude Code is fast. Unbelievably fast. Give it a new operator feature — like adding support for quorum-safe horizontal scaling — and you’ll have a working prototype in a day or two. The structure is right, the happy path works, the pods spin up, the scaling event completes, and for a moment, you feel like a god.
Then, you start poking at the edge cases.
What if the leader crashes halfway through scaling? What if there’s a network partition right when a new member joins? What if the new pod’s disk fills up before sync completes?
This is the cliff.
By the time you hit 80% completion, the codebase is too large for Claude Code to maintain global context. It starts patching edge cases one by one — adding handling for the specific error you just demonstrated, without understanding why the state machine allowed that condition in the first place. Every patch is localized, every patch is extremely confident, and every patch occasionally blows a new hole somewhere else.
Now, you’re forced to debug code you didn’t write, in a codebase that feels slightly alien, while your co-pilot frantically shoves more patches in faster than you can review them.
That last 20% — turning a “working demo” into software you’d trust with someone else’s production database — takes longer than the first 80%. Sometimes much longer. I’ve had features prototype in two days, only to spend two weeks hardening them, essentially engaged in hand-to-hand combat with my own operator’s edge cases.
The codebase you end up with contains many decisions you didn’t make. Some are strokes of genius; some are ticking time bombs. You won’t know which is which until something explodes at 2 AM.
So, Should You Do It?
Yes, but with conditions.
For Infrastructure Automation: Unreservedly recommended. The Terraform and cluster management workflows are genuinely disrupted. Handing off that layer gave my velocity a massive boost.
For Core Operator Development: Yes, but go in with your eyes wide open.
Here are my three rules written in blood:
1.The moment it reaches for sleep, stop it. Before allowing any timeout modifications, force it to answer: “What is your specific mechanical hypothesis for why timing is the root cause?” If it can’t explain the mechanism, it’s guessing. Make it look at the actual error stack.
2.Budget heavily for the 80% cliff. If Claude Code gives you a working prototype in two days, assume you’ll need at least a week to harden it. You must read the code as it writes it. Never let it pile up changes you don’t understand.
3.Maintain a healthy skepticism of exotic explanations. The moment it mentions kernel bugs, API pressure, or network anomalies, treat it as a signal that it’s lost. Force it to reproduce the failure in the simplest isolated scenario. The bug is almost always in the code. It knows where the code is; force it to look.
I’m not giving up this tool; it’s too useful. But I no longer expect it to be a senior engineer who just happens to type faster than me.
It’s more like an incredibly talented, lightning-fast junior developer who gets anxious when tests fail, and occasionally invents wild excuses — “The test environment is broken, my code is fine!” — to cover it up.
You can absolutely work with a partner like that. You just need to remember exactly who is sitting across from you.
I work on an open-source project called KubeBlocks. It uses an Addon abstraction mechanism to manage etcd, Redis, MySQL, PostgreSQL, and a dozen other database engines on K8s — adding a new database engine just requires writing an Addon (YAML and scripts), no Go code required.
By the way, the thing I had Claude Code writing during this experiment was actually an Addon, not a traditional Operator. This is supposed to be easier than writing a full Operator — at least 100x easier. Yet even so, Claude Code still couldn’t nail it. Some tests still take 600 seconds to timeout. I’m fixing them. And I’ve already spent $1,000 on API calls.