100% Autonomous “Agentic” Coding Is A Fool’s Errand

7 min read Original article ↗

Though I’ve seen very little evidence of it being attempted on production systems with real users (because risk), my socials are flooded with posts about people’s attempts to crack fully-autonomous, completely-unattended software creation and evolution using “agents” at scale.

Demonstrations by Cursor and Anthropic of large-scale development done – they claim – almost entirely by agents working in parallel have proven that the current state of the art produces software that doesn’t work. Perhaps, to those businesses, that’s just a minor detail. In the real world, we kind of prefer it when it does.

I’ve attempted experiments myself to see if I can get to a set-up good enough that I can hit “Play” and walk away to leave the agents to it while I go to the proverbial pub.

That seems to be the end goal here – the pot of gold at the end of the rainbow. Whoever makes that work will surely, at the very least, make a name for themselves, and probably a few coins.

I’ve seen many people – some who understand this technology far better than me – attempt the same thing. Curiously, they don’t seem to have nailed it either, but are convinced that somebody else must have.

It’s that FOMO, I suspect, that continues to drive people to try, despite repeated failures.

But, as of writing, I’ve seen no concrete evidence that anybody has done it successfully on any appreciable scale. (And no, a GitHub repo you claim was 100% agent-generated, “Trust me bro”, doesn’t qualify, I’m afraid.)

The rules of my closed-loop experiments are quite simple: I can take as much time as I like setting things up for Claude Code in read-only planning mode, but once the wheels of code generation are set in motion, we’re like an improv troupe – everything it suggests, the answer is automatically “yes”. I just let it play out.

Progress is measured with pre-baked automated acceptance tests driving deployed software, which act as a rough proxy for “value created”, and help to avoid confirmation bias and the kind of “LGTM” assessments of progress that plague accounts of “agentic” achievements right now. It’s very a much an “either it did, or it didn’t” final quality bar.

I can’t intervene until either Claude says it’s done, or progress stalls. I can’t correct anything. I can’t edit any of the generated files. I have to simply sit back, watch and wait.

So far, no matter how I dice it and slice it, no set-up has produced 100% autonomous completion, or anything close.

No doubting, the tools are improving – using LLMs in smarter ways. But there’s only so much we can do with context management, workflow, agent coordination, quality gates and version control before we reach the limits of reliability that are possible when LLMs are involved. I suspect some of us are almost at that plateau already.

Agents – with those faulty narrators at their core – will always get stuck in “doom loops” where the problem falls outside their training data distribution, or the constraints we try to impose on them conflict.

Round and round little Ralph Wiggum will go, throwing the dice again and again in the hope of getting 13, or any prime number greater than 5 that’s also divisible by 3.

Out-of-distribution problems will always be a feature of generative transformers. It’s an unfixable problem. The best solution OpenAI have managed to come up with is having the model look at the probabilities and, if there’s no clear winner for next token, reply “I don’t know”. That’s not good news for full autonomy.

And, no, a swarm of Ralphs won’t solve the problem, either. It just creates another major problem – coordination. No matter how many lanes in your motorway, ultimately every change has to go through the same garden gate of integration at the end.

A bunch of agents checking in on top of each other will almost certainly break the build, and once the build’s broken, everybody’s blocked, and your beeper is summoning you back from the proverbial pub to unblock them.

One amusing irony of all these attempts to fully define 100% autonomous “agentic” workflows is that it’s turning many advocates into software process engineers.

Just taking quality gates as the example, a completely automated code quality check will require us to precisely and completely describe exactly what we mean by “quality”, and in some form that can be directly interpreted against, for example, the code’s abstract syntax tree.

I know Feature Envy when I see it, but describing it precisely in those terms is a whole other story. Computing has a long history of teaching us that there are many things we thought we understood that, when we try to explain it to the computer, it turns out we don’t.

Software architecture and design is replete with woolly concepts – what exactly is a “responsibility”, for example? How could we instruct a computer to recognise when a function or a class has more than one reason to change? (Answers on a postcard, please.)

Fully autonomous code inspections are really, really, really (really) hard.

90% automated? Definitely do-able. But skill, nuance and judgement will likely always be required for the inevitable edge cases.

Having worked quite extensively in software process engineering earlier in my career, I know from experience that it’s largely a futile effort.

We naively believed that if we just described the processes well enough – the workflows, the inputs, the outputs, the roles and the rules – then we could shove a badger in a bowtie into any of those roles and the process would work. No skill or judgement was required.

You can probably imagine why this appealed to the people signing the salary cheques.

It didn’t work, of course. Not just because it’s way, way harder to describe software development processes to that level of precision, but also because – you guessed it – teams never actually did it the way the guidance told them to. They painted outside the lines, and we just couldn’t stop them.

In 2026, some of us are making the same mistakes all over again, only now the well-dressed badger’s being paid by the token.

We might get 80% of the way and think we’re one-fifth away from full autonomy, but the long and checkered history of AI research is littered with the discarded bones of approaches that got us “most of the way”. Close, but no cigar.

It turns out that last few percent is almost always exponentially harder to achieve, as it represents the fundamental limits of the technology. On the graph of progress vs. cost, 100% is typically an asymptote. We need to recognise a wall when we see one and back away to where the costs make sense.

Attempting to achieve better outcomes using agents with more autonomy seems like a reasonable pursuit, as long as we’re actually getting those better outcomes – shorter lead times, more reliable releases, more satisfied customers.

Folks I know being successful with an “agentic” approach have stepped back from searching for the end of that rainbow, and have focused on what can be achieved while staying very much in the loop.

They let the firehose run in short, controlled bursts and check the results thoroughly – using a combination of automated checks and their expert judgement – after every one. And for a host of reasons, that’s probably why they’re getting better results.

It’s highly likely there’s no end to the “agentic” rainbow. Perhaps we should start looking for some gold where we actually are, using tools we’ve actually got?

Unknown's avatar