Finally! Proof That Agentic AI Scales (For Creating Broken Software)

Some of the marketing choices made by the “AI” industry over the last few years have seemed a little… odd.

The latest is a “breakthrough” in “agentic AI” coding heralded by Cursor, in which they claim that a 3+ million-lines-of-code (MLOC) web browser was generated by 100 or so agents in a week.

It certainly sounds impressive, and many of the usual AI boosters have been amplifying it online as “proof” that agentic software development works at scale.

But don’t start ordering your uniform to fight in the Butlerian Jihad just yet. They might be getting a little ahead of themselves.

Did 100 agents generate 3 MLOC in about a week? It would appear so, yes. So that part of the claim’s probably true.

Did 100 agents generate a working web browser? Well, I couldn’t get it to work. And, apparently, other developers couldn’t get it to work.

Feel free to try it yourself if you have a Rust compiler.

And while you’re looking at the repo – and it surprises me it didn’t occur to them that anybody might – you might want to hop over to the Action performance metrics on the Insights page.

An 88% job failure rate is very high. It’s kind of indicative of a code base that doesn’t work. And looking at the CI build history on the Actions page, it appears it wasn’t working for a long time. I couldn’t go back far enough to find out when it became a sea of red builds.

Curiously, near to the end, builds suddenly started succeeding. Did the agents “fix” the build in the same way they sometimes “fix” failing tests, I wonder? If you’re a software engineering researcher, I suspect there’s at least one PhD project hiding in the data.

But, true to form, it ended on a broken build and what does indeed appear to be broken software.

The repo’s Action usage metrics tell an interesting story.

The total time GitHub spent running builds on this repo was 143,911 minutes. That’s 4 months of round-the-clock builds in about a week.

This strongly suggests that builds were happening in parallel, and that strongly suggests agents were checking in changes on top of each other. It also suggests agents were pulling changes while CI builds were in progress.

This is Continuous Integration 101. While a build is in progress, the software’s like Schrödinger’s Cat – simultaneously working and not working. Basically, we don’t know if the changes being tested in that check-in have broken the software.

The implication is, if our goal is to keep the code working, that nobody else should push or pull changes until they know the build’s green. And this means that builds shouldn’t be happening in parallel on the same code base.

Your dev team – agentic or of the meat-puppet variety – may be a 100-lane motorway, but a safe CI pipeline remains a garden gate.

The average job queuing time in the Action performance metrics illustrates what happens when a 100-lane motorway meets a garden gate.

And the 88% build failure rate illustrates what happens when motorists don’t stop for it.

The other fact staring us in the face is that the agents could not have been doing what Kent Beck calls “Clean Check-ins” – only checking in code that’s passing the tests.

They must have been pulling code from broken builds to stay in sync, and pushing demonstrably broken code (if they were running the tests, of course).

In the real world, when the build breaks and we can’t fix it quickly, we roll back to the previous working version – the last green build. Their agentic pile-on doesn’t appear to have done this. It broke, and they just carried on 88% of the time.

Far from proving that agentic software development works at scale, this experiment has proved my point. You can’t outrun a bottleneck.

If the agents had been constrained to producing software that works, all their check-ins would have had to go in single file – one at a time through the garden gate.

That’s where those 143,911 total build minutes tell a very different story. That’s the absolute minimum time it would have taken – with no slip-ups, no queueing etc -to produce a working web browser on that scale.

Realistically, with real-world constraints and LLMs’ famous unreliability – years, if ever. I strongly suspect it just wouldn’t be possible, and this experiment has just strengthened that case.

Who cares how fast we can generate broken code?

The discipline of real Continuous Integration – that results in working, shippable software – is something we explore practically with a team CI & CD exercise on my 3-day Code Craft training workshop. If you book it by January 31st 2026, you could save £thousands with our 50% off deal.