Backpressure in Agent-Driven Development

/images/what-is-vibe-coding-4-3999939321.png

The Bottomless Well of Engineer Suffering

The rise of agent-driven development has coincided with inflated expectations around what software engineering teams are capable of producing, given a set of resources and constraints. This, along with a fundamental shift in how tech organizations operate, has resulted in a whiplash that engineers are struggling to adapt to.

The token cost to build a production feature is now lower than the meeting cost to discuss building that feature.

Let me rephrase.

It is literally cheaper to build the thing and see if it works than to have a 30 minute planning meeting about whether you should build it.

Todd Saunders on Twitter

The good news, fewer meetings. The bad news, build everything… In short, we’re tired! But let’s not shed a tear for this well-compensated white-collar demographic just yet. Instead I choose to do what I do best: over-analyze the shit out of the situation.

What do we know?

Agents = LLMs + tools (I’m right, you’re wrong)
Agentic coding systems have gone through a step change in capability over the last 6 months
Most engineering organizations are ill prepared to ship 10x the volume of code

A brief aside on #1, since this is a highly debated topic: AGENTS.md, MCPs, tool-usage, control loops, etc., etc., these are all ways of augmenting the function of the underlying language model. And what do we call a thing that augments another autonomous thing? A tool. The idea of automation dates back to the Greeks – their conception of an autonomous or non-autonomous tool was determined by the presence of a slave (i.e. autonomous = slave + tool). There are many aspects of this topic that are worth semantically and syntactically deconstructing. However, in this case, this ain’t the difference that makes a difference.

WTF is Backpressure?

Like all fun and interesting ideas, I was reminded of the concept of backpressure (and its various usages) through a seemingly random assortment of Hacker News articles, engineering problems, a random podcast, and the books I’m currently reading during my morning commute on the Muni J-train – all of which coalesced into an idea: there’s a firehose of code being produced by agentic-driven development which is not very good at respecting the structure and stability of a codebase. Before we go too deep into that idea, let’s discuss some of the foundations that support it.

To start, in fluid flow, backpressure is the hydraulic force needed to achieve a certain velocity by overcoming resistance due to friction loss and pressure drop. This analogy has been extended to event-driven architecture and business processes. Within every system there will always be a rate-limiting step that caps or determines the ceiling for max throughput or performance. For engineers writing code, the system has historically been limited by humans writing and reviewing code. This is no longer the case.

Right Idea, Wrong System, Wrong Time

The motivation for this idea of backpressure in agentic systems derives from a couple of sources:

I’ve been reading a book called Kill It With Fire which makes a convincing point that many technological solutions are cyclical due to their alignable differences – essentially the feature improvements that are relatable due to their analogous behavior to a previous technology. The author recalls a question: “How is the cloud any different from old time-sharing schemes on mainframes?”

We started with thin-client mainframe green-screen terminal applications, then they wanted us to migrate to fat clients on PCs, now they want APIs with thin clients again.
A pattern that emerged during the first couple years post-ChatGPT was that the frontier labs, upon releasing a major model update, would absorb the functionality of various startups that sprang up around the ecosystem. In 2023-24 there were swaths of guardrails and fine-tuning companies offering safe and customized models for your specific use case – the capabilities of models got exponentially better, gobbling up the relevance of these companies trying to fill a niche in the generative AI ecosystem. It was simply too much of a liability for LLM providers to leave this problem for someone else to solve.

The guardrails and fine-tuning companies had the right instinct — wrangle models to your specific purpose — but applied it at the wrong layer. If the answer isn’t middleware between you and an API, then where does the backpressure live? I think the idea of wrangling agentic systems is going to be reborn in the code itself: in the organization, conventions, and semantic structure of a codebase. But to understand why, we need to look at where agents are actually failing.

Where Backpressure is Absent

The analogy I’m trying to draw is between raw material flow through a physical system and code flowing through a standard development process. What I’m not going to do is explore the different categories that agents are positively impacting this process because (1) it’s not that interesting and (2) most engineers seem to have an intuitive sense for this. Rather, I’d like to explore the areas in which agents are deficient and why this should be equally intuitive. Like many sciences in their investigation of physical phenomena, the study of pathologies can shed light on the fundamental principles that describe how complex systems function.

At the Boundary: Opaque External Dependencies & Longitudinal Blindness

My coworker said it perfectly: “You can’t expect an agent (or a person, for that matter) to perform well when there are opaque external dependencies.” Said another way, agents only understand what you give them. It’s been obvious for a while now that remote state is something that LLMs don’t handle well – a great example of this would be NoSQL systems that don’t enforce a schema through data structures. Distributed systems that live in multiple repositories or application code that’s in a separate repo from the infrastructure code all give rise to this issue that the complete definition of an application is difficult to capture within a single codebase or context window.

This type of metadata is valuable to understanding the architectural limitations at each stage of the promotion process, and by extension the verifiability of a functional change in environments that are necessarily subsets of production environments. An example of this is how Vercel will create branch-specific deployments which it calls “previews” in order to deploy a web application automatically – likewise, Neon will create copy-on-write database branches for each preview deployment so that engineers applying migrations within their branch preview won’t conflict with their teammate’s work. These dynamically provisioned resources have bespoke URLs that require additional systems to propagate – this is to say that I can’t set up a webhook for a dynamic preview environment without some clever programming or middleware.

The system’s prompt – that now uses AGENTS.md or CLAUDE.md to preface the context window such that agents will abide by codebase conventions – is the current way in which we can insert codebase metadata and in doing so affect the stability of the application at the point in which a change is being made. A codebase with metadata annotations that are visible to an agent (or human!) would prevent missteps that are obvious to a keen eye. While my thinking was focused on longitudinal flux (e.g. “churn rate” and “change coupling”), I stumbled upon research by Adam Tornhill, whose methodology (formalized in Your Code as a Crime Scene and implemented in CodeScene) produces a 2x2 prioritization matrix:

                    LOW CHURN              HIGH CHURN
                ┌───────────────────┬───────────────────────┐
HIGH COMPLEXITY │  Cold Complexity  │  🔥 Critical Hotspot  │
                │  Technical debt,  │  Complex AND actively │
                │  stable liability │  changing — refactor  │
                ├───────────────────┼───────────────────────┤
LOW COMPLEXITY  │  Cold & Fine      │  Hot but Manageable   │
                │  Leave it alone   │  Config, generated,   │
                │                   │  or simple glue code  │
                └───────────────────┴───────────────────────┘

The Critical Hotspot quadrant is the primary target. These files concentrate both ongoing development activity and structural complexity, making them simultaneously the most expensive to work in and the most likely to produce bugs. Empirical research across Mozilla, Microsoft Windows, the Eclipse IDE, and many open-source projects consistently shows that roughly 80% of bugs originate in 20% of files — and that 20% maps closely onto the hotspot quadrant.

Just as there exist hot/cold code paths within software runtime (e.g. Golang’s PGO: profile-guided optimization) and hot/cold nodes within neural networks (stored at different points in memory, depending on their access frequency), there also exist hot/cold directories and files within a codebase.

Another way of thinking about agents and the actions they take on code is the dichotomy in Western philosophy and the divergence from Platonic concepts of truth in the 17th and 18th centuries: truth being conceptualized as ideal forms by which all things are represented, or truth as the constant that reflects the dynamic nature of reality. LLMs currently live in a Platonic world and miss the longitudinal nature of software systems – they don’t “know” what has changed recently or what patterns should be enforced or overwritten.

This is a missing backpressure mechanism: there is nothing in the current agent workflow that pushes back when an agent touches a hotspot file without awareness of its change history or external dependencies. The firehose of code is being pushed through a process that is ill-equipped to handle such velocity.

At the Interface: The Semantic Triangle

One aspect of agent-driven development that I don’t see addressed in my daily digest of Hacker News articles is this idea that there is semantic meaning implicit within each component of the system. The systems engineering triangle holds that any autonomous action always has a corresponding environment and task – in our case the agent (LLM + tools) is taking action within a codebase as instructed by a prompt or series of prompts. The most effective system will be one in which the semantic space in which each of these three components operate is overlapping. In other words, if the language you use to prompt a model doesn’t align with a model’s internal worldview and the function of the code then the efficacy of the system will be diminished. To make a bit of an absurd point: let’s say a model trained on only French is prompted in Chinese with instructions to write an API in Go. There is no overlap in the semantic space between the different components.

/images/epi-triangle-small.png

With respect to semantic meaning in a codebase, a phrase comes to mind that is often repeated by the elder programmers: the code is the spec. This phrase is usually a glib retort to a lack of documentation or readability of someone’s code that they don’t care to explain or unpack for the naive junior engineer. As a junior engineer, this drove me crazy. As a senior engineer, I can empathize with the sentiment but still think it is incorrect because code is not simply syntax. There is intent, deliberate and accidental design, along with a whole slew of compromises that were made on the road to creating a piece of software. Take a look at any of the foundational RFCs for the Internet – these are specifications that discuss set and setting for creating the protocols that underpin the greatest engineering achievement to date: the Internet. Even the descriptors used with respect to programming languages as “reference implementations” speak to this idea that the code is one way to manifest an idea but is not complete in and of itself. So, “the code is the spec” no longer works and probably never did. Code simultaneously encodes maintainability, design patterns, along with things like space and time complexity – carrying with it semantic meaning that is not always implicit within the syntax.

Semantic alignment is itself a form of backpressure. When the prompt language, the model’s representations, and the code’s conventions are coherent, the system resists bad output naturally — misaligned changes feel wrong and get caught. When they’re incoherent, anything goes.

At the Pattern Level: Latent Consensus and Systemic Monoculture

In the way-back times of university the idea was that the limits of machine learning were problems that required non-repetitive pattern recognition – i.e. information that is purely random is not addressable by some kind of AI method. What we didn’t expect was what qualified as a pattern. Grammar and semantic meaning dictate understanding in language but the patterns present in written word were yet to be encoded into some kind of model in a way that provided utility. It turns out that with enough data (i.e. Common Crawl) and deep neural nets, these elements of language are highly predictive when the context of the preceding words is accounted for (i.e. attention).

This is meaningful to agent-driven development because it’s important to understand that there are patterns encoded within LLMs that don’t necessarily respect the design or intention of your codebase. Agents often don’t honor existing patterns, implementing lazy imports or other code tricks to pass tests rather than stating that the assumptions in the given task are incorrect. A brief but insightful article titled Gaslight-driven development made the astute observation that despite a well-reasoned design, if it conflicts with a prevailing standard then the abnormality will be punished. Consider five years from now when most of the code in the world will have been written by agentic systems: this will simultaneously make software more robust and more vulnerable because it will follow a normalized pattern that is latent within models. This means that vulnerabilities either at the model level, or through their projection onto the world via code, will be pervasive.

An interesting quote that I recently heard from an ex-OpenAI researcher supports this sentiment across all models (and I’m paraphrasing):

Why do all commercial aircraft look similar? Because the design is economic for safely transporting hundreds of people across various distances. This is what’s happening at the LLM labs right now – they are all optimizing on a model architecture that obviously provides value and in doing so have comparable performance and capabilities.

This is perhaps the most insidious absence of backpressure: the patterns an agent defaults to aren’t random — they’re the training data’s consensus. Your codebase’s local conventions exert little force against this consensus unless they’re made explicit. Without that resistance, every agent interaction erodes local design in favor of the global mean.

Engineering Backpressure

These three absences of backpressure aren’t independent — they compound. An agent that can’t see external state is also unable to infer the semantic intent behind code that depends on that state, and will default to its latent patterns rather than the codebase’s actual conventions. The result is a system that produces code at volume without any of the natural resistance that once kept codebases coherent.

The point is that the guardrails and fine-tuning companies had it right but at the wrong time, and the wrong system. The task is still to wrangle agents and their underlying models to your specific purpose but the system isn’t some middleware between you and the LLM API, it’s the organization of your code itself. If code is the raw material then your controls should be in how you manipulate that raw material – i.e. the organization and assembly of tokens. A codebase is no longer this artisanal thing but a machine to be shaped as it changes at an industrial rate. The force you exert must be on enforcing design patterns that dictate long-term maintainability and extensibility. The effectiveness with which an engineer wields agent systems is determined by how accurately they can perceive the semantic triptych and then instruct the agent to act accordingly.

In practice, this means engineering resistance at each of the three layers where we found it missing:

At the boundary: metadata annotations, system prompt files that document external dependencies, infra-as-code colocation
At the interface: naming conventions that align prompt language with code abstractions, structured task descriptions that reduce semantic drift
At the pattern level: explicit style guides and architectural decision records that override the model’s defaults, linting and CI that enforce local conventions

Coda: The Vibe Coding Paradox

Despite these prescriptions one thing is abundantly clear: the combination of tools and model capability has created a tectonic shift over the last few months. Friends who have never coded a day in their life are building functional applications in creative and alarming ways that add real value to their work productivity that wouldn’t have otherwise been possible. Taking a peek under the hood, I’ve never seen such horrifying codebases in my life. No aspect of these vibe-coded applications is maintainable when the agents are orchestrated by non-technical users. But it still works!!! If it’s stupid but it works, then it ain’t stupid. We are rapidly approaching a state of the software world where spaghetti code is a valid format for agents to operate in.

/images/vibe-coding-house.png

However, this paradigm currently breaks down with a certain degree of complexity. There’s an analogy to the dissecting of deep neural nets in the discipline of machine interpretability: will vibe-coded codebases someday take on a state where we are using them as an artifact of coding models to better understand how to code ourselves? We have already seen this story play out with DeepMind’s AlphaGo model and its influence on styles of play in the game of Go.

The distinction isn’t between good code and spaghetti code, but between systems that need backpressure and systems that don’t. A weekend prototype built by a non-technical founder doesn’t need architectural coherence — it needs to work. A production system serving millions of users does. The question for engineers isn’t whether to use agents, but whether to engineer the resistance that keeps the output coherent as the volume scales. Backpressure isn’t about slowing down. It’s about not drowning.

References

Articles I Made My Coworkers Read

Other References

Ex-OpenAI Researcher On Why He Left, His Honest AGI Timeline, & The Limits of Scaling RL