Dark Factories: Retooling for LLM Velocity

61 min read Original article ↗

How to turn out the lights and leave the floor

Josh Wulf

Press enter or click to view image in full size

Last year, after thirty years as a software developer, LLMs took my job — because I let them. I still ship software; I just don’t write it anymore. I manage the system that writes it. The shift was not gradual and it was not a matter of finding the right tool or the right prompt. It was a recognition that the activity I had been doing for three decades, and the activity the LLM was doing on my behalf, were no longer the same activity — and that the second one, done well, did not need me in the loop where I used to be.

The context for this is specific. I am a Senior Developer at Camunda, working on our Public API and the SDKs that sit on top of it. The primary artifact of the shift I am describing is Camunda’s enhanced SDK portfolio, released with Camunda 8.9 — a next-generation set of SDKs for TypeScript, Python, and C#, generated from the OpenAPI specification of the Camunda 8 Public API and built from the ground up for human-LLM collaboration. The initial motivation came from watching an LLM produce code against our older SDK that type-checked and compiled but failed at runtime — a defect an experienced Camunda engineer could see by inspection but the compiler could not. We asked: if a human can tell the code will fail, how do we make the compiler tell us too? The answer, worked out across the SDK portfolio, is what this piece describes at the level of practice. Entire classes of defect that were previously runtime surprises are now inexpressible in the IDE. That is the dark factory running on a real product.

A secondary artifact, which shows up throughout the examples below, is c8ctl — a CLI for operating Camunda 8. c8ctl is a brownfield vibe-coded tool built on top of the Dark Factory SDKs, and it is useful as an example precisely because it is not the purest case. Its correctness properties are legible — every user-facing capability is a command, every command has a contract, the contract either holds everywhere or it doesn’t — which makes it a clean surface for showing what structural enforcement looks like in practice, how Stage 1 refactor work feels from inside a session, and how an AGENT.md reads when the invariants below it have actually been encoded. The SDKs are the pure case; c8ctl is the applied one. Both inform what follows.

What follows is not a prompt guide. It is not a tool review. It is an argument about what LLM-driven development actually is as a management activity, and an attempt to describe the practice that corresponds to taking that framing seriously. The piece was originally written for an internal audience at Camunda. I am publishing it because the market is currently full of teams treating LLM coding as delegation and being surprised by the results, and the gap between that posture and the one that actually produces velocity is both wider and more learnable than the surprise suggests.

A note on terminology. “Dark factory” is a manufacturing term for a facility that runs without human operators on the floor — fully automated, lights off, correct-by-construction rather than correct-by-vigilance. The analogy is load-bearing throughout the piece, and the title is not a metaphor I reach for once. It is the shape of the argument.

LLM-driven development is a management activity

Before any of what follows makes sense, there is a frame to establish. LLM-driven development is not delegation. It is management.

The difference matters because the two activities look similar from the outside and produce very different outcomes. Delegation is what you do with a contractor: you write a specification, hand it over, wait for delivery, and evaluate the result. The delegator’s job ends when the task is handed off and resumes when the output arrives. Management is what you do with a capable worker whose failure modes you understand: you direct their attention, challenge their reasoning, redirect them when they drift, and build the environment around them so that their strengths get used and their weaknesses get contained. The manager’s job is continuous, and the worker’s output reflects the quality of the management as much as the worker’s capability.

The market is currently full of people treating LLM coding as delegation and being surprised when the output drifts, the architecture erodes, or the agent confidently ships something subtly wrong. The surprise is misplaced. The agent is doing exactly what it is capable of doing, which is a great deal — the problem is that nobody is managing it. Prompts are not management. A well-crafted prompt is the equivalent of a good briefing at the start of a project, and projects are not successfully run by briefings alone.

The agent, viewed honestly, is a worker with a specific profile. It is fast, tireless, architecturally fluent, and not subject to the deadline pressures that corrupt human judgment. It can hold large amounts of code in view simultaneously, propose refactors across dozens of call sites, and generate regression tests faster than most humans can review them.

It also has specific, known failure modes. Its context window fills up, and the meta-rules that were salient at the start of a session lose their grip as feature-implementation reasoning crowds them out. It pattern-matches, which means it will find the 80% rule and also the 20% variance and have no intrinsic preference between them. It will confabulate plausibly when it does not know — producing code that looks right, references functions that do not exist, and asserts invariants the codebase does not hold. It cannot be relied on to remember, three hours into a session, the rule it understood perfectly at hour zero. It is unreliable as a structural enforcer of anything it is not itself being actively held to.

There is a further wrinkle: the worker operates faster than the manager can reason about it. The agent writes code faster than the developer can read it, proposes refactors across surfaces larger than the developer can hold in working memory, generates tests at a rate that saturates review. Traditional management techniques — watch the work, review every change, hold the rules in the reviewer’s head — do not scale to this velocity. They fail for the same reason the agent’s own attention fails over a long session: volume overwhelms attention, and attention is the only currency those techniques spend. A manager trying to keep up with LLM output by reading harder is in the same position as an agent trying to enforce its own AGENT.md by trying harder. Both are attentional mechanisms failing against a scale they were not built for.

This is the central management problem, and it determines everything that follows. What you manage, when the worker operates at this velocity, is no longer the code. You cannot manage the code; you cannot even read all of it. What you manage is the system the worker operates inside — the structure it writes against, the tests it is measured by, the guards that make the wrong thing expensive and the right thing cheap. Management scales by moving from inspecting outputs to shaping the environment in which outputs are produced.

Each of the techniques that follow is a specific response to this problem. The two-stage lifecycle is a management structure: Stage 1 is where maintainers do hands-on direction while they still hold ground truth, and Stage 2 is where management load shifts from the maintainer’s attention to the architecture itself. The structural invariants work is a management decision: move as much correctness as possible off the surface where the agent can get it wrong, and into the structure where wrong is not an option. The economics-of-purity argument is a management heuristic: when the cheap path and the pure path used to be genuinely in tension, now the tension has inverted, and the manager’s job is to notice the old instinct and override it. The AGENT.md is standing management instructions for the sessions the maintainer cannot be present for. The drain-the-backlog discipline is ongoing management maintenance: every rule that lives only in the agent’s attention is a rule that will eventually fail, and the manager’s job is to keep moving rules into structure faster than new ones accumulate.

This recasts what it means to be good at LLM-driven development. It is not about finding the right prompt, the right tool, or the right model. It is about developing the specific practice of managing a capable worker with this specific profile — directing its investigations, challenging its proposals, redirecting it toward architectural purity when it suggests the expedient path, insisting on evidence when it pattern-matches, and systematically closing the failure modes by encoding invariants into structure that the worker cannot displace. Teams that reach real velocity at this are not the ones with the best prompts. They are the ones who recognised what activity they were doing and built the practice accordingly.

The rest of this document is that practice.

The economics of architectural purity have inverted

For most of software’s history, the argument against architectural purity was economic. Normalising a pattern across every call site, elevating a convention to a structural invariant, closing the 20% variance that made a shape only 80% consistent — these were expensive. A senior engineer had to hold the system in their head, work through every instance, handle the edge cases, and write the tests. That time had to come from somewhere, and it almost always came from features the business was waiting for. So teams learned to cut corners and ship. “Good enough” was not laziness; it was a rational response to a real tradeoff between upfront cost and uncertain future benefit. The discounted expected value favoured shipping, and “we’ll clean it up later” was the honest name for the bet that the cleanup would never be worth its opportunity cost.

Both sides of that tradeoff have moved, and they have moved in opposite directions.

The cost of doing the pure thing has collapsed toward zero. An LLM working against a codebase with visible structure can execute a normalisation across dozens of call sites in the time it used to take to write the ticket describing it. The refactor that would have cost a senior engineer three days costs an afternoon of supervision. The architectural elevation that would have been deferred to “next quarter” happens in the same session as the conversation that identified the need for it. The upfront cost term in the old equation has not reduced — it has approximately vanished.

The cost of not doing the pure thing has multiplied. At LLM velocity, every piece of architectural fuzz gets hit more often, by more agents, with less context each time. A convention that held because five maintainers remembered it will not hold when fifty LLM-driven changes a week are landing against it — some of them will pattern-match to the 20% variance rather than the 80% rule, and the variance will widen. A guard that is “mostly” enforced becomes a guard that is “sometimes” enforced, and then a guard that exists only in the minds of whoever remembers the original intent. Fuzz compounds at the rate the codebase is changing, and LLMs change codebases fast. What used to be a slow drift you could fix opportunistically is now a fast one that outpaces opportunistic fixes.

The asymmetry that made “good enough” rational has inverted. The cheap thing is now pure; the expensive thing is now the drift you accepted when you cut the corner. “We’ll clean it up later” used to be a bet that the cleanup would never be worth it. It is now a bet that the drift will never cost more than the LLM time you saved — and that bet loses, because the drift compounds and the LLM time was trivial in the first place.

This changes how the conversation between a developer and an LLM should go. Consider the following exchange, from a recent refactor of a CLI tool:

LLM: One thing to flag first: handleAssign, handleUnassign, showCompletion, installCompletion are called directly from index.ts (not via defineCommand). Throwing from them will crash with an unhandled rejection unless I either
(a) wrap those call sites in index.ts with handleCommandError, or
(b) convert them to defineCommand handlers (bigger change).
Option (a) preserves the framework-prefix behaviour with minimal change. Confirm you want me to proceed with (a)?

Developer: I tend toward architectural purity and normalisation. That sends a stronger signal. If we leave them unconverted, then it becomes a matter of degree and the surface won’t hold over time. What do you think?

LLM: Agreed. The “matter of degree” argument is correct — if some handlers bypass the framework pipeline, the architectural guard becomes a fuzzy rule instead of an invariant.

One year ago, this conversation would have gone differently, because the engineer on the other end would have been weighing option (b) against their own afternoon and the ticket they had promised to close by Friday. “Minimal change” would have won. The fuzz would have been accepted, a comment would have been added (“TODO: normalise these through defineCommand”), and the TODO would have outlived three reorgs. The developer in the exchange above is not more architecturally virtuous than their predecessors — they are responding correctly to a different cost structure. Option (b) is no longer the expensive one.

Notice, though, what the LLM said. ‘Minimal change.’ That phrase is doing the old economics’ work. The LLM is quoting a cost structure in which the difference between wrapping call sites and converting handlers matters — because in the training data it learned from, bigger changes were expensive and minimal changes were the pragmatic default. The LLM is not analysing the tradeoff from first principles; it is repeating the framing that the tradeoff would have had a year ago. The developer’s override is not just of their own instinct — it is of the LLM’s inherited framing. When the developer says ‘the surface won’t hold over time,’ they are overruling the cost structure the LLM has just quoted back at them, using their own knowledge of what the work actually costs now. This is what the developer’s job now consists of: recognising when the LLM is importing a frame that no longer applies, and pushing back with the current cost structure.

The practical implication is a change in default. In the old economics, the default was “ship, clean up later” and purity required an explicit case. In the new economics, the default is purity and shortcuts require an explicit case. If you find yourself about to accept a fuzzy rule instead of an enforced invariant, a wrapper instead of a normalisation, a special case instead of a generalised pattern — stop and check whether you are reasoning from the old cost structure or the new one. The instinct that says “this is too big a change to justify for the principle” is almost always running on assumptions that no longer hold. The change is not big anymore. The principle is not expensive anymore. The thing that is expensive, now, is the surface that won’t hold.

Readiness

Before you can move fast with LLMs, you have to know what kind of codebase you’re actually working in. Not all repositories are ready for LLM-driven development — and pushing velocity before the substrate is ready produces bugs, confusion, and developers who lose trust in the tool. The thing that determines readiness isn’t size or language or test coverage. It’s whether the codebase’s structural invariants are visible to the LLM or locked inside the heads of the people who built it. If correctness depends on tribal knowledge — conventions that live in Slack, decisions nobody documented, patterns that “everyone knows” — the LLM is flying blind, and so is any developer who isn’t already a maintainer. The table below describes two stages every repository goes through on the path to real LLM velocity, who should be driving the LLM at each stage, and what has to be true before you can hand the wheel to feature developers and just ship.

Structural invariants

A structural invariant is a correctness property that the codebase enforces by construction rather than by convention — not a rule in a style guide or a pattern experienced developers know to follow, but a constraint encoded into the architecture itself such that violating it requires actively working against the structure. In c8ctl, every CLI command conforms to a single structural contract — declaring arguments and flags, validating them, coercing inputs to strong types, performing an operation, producing user output — enforced by a central registry from which commands, help output, argument handling, types, and documentation are all derived. A command that skips validation cannot be registered. A command that lacks help text cannot exist. This closes the entire class of whack-a-mole defects that emerge when each command is an independent island — not because developers became more disciplined, but because the architecture made the wrong thing structurally impossible. For LLM-driven development this distinction is decisive: an LLM adding a command to c8ctl can reason safely because the registry is the specification and the invariants are visible in the code, whereas in a codebase without structural enforcement the prompt must carry the tribal knowledge the architecture should be carrying. And structural invariants do not terminate at the source files — the build system extends that enforcement surface outward across time, upstream dependencies, documentation, and the deployment pipeline, making wrong outcomes impossible without actively breaking the pipeline, and holding across contributors across the entire lifecycle of every change from branch to deployment.

A dark factory is a manufacturing facility that runs without human operators on the floor — fully automated, lights off, machines executing against structural specifications with no room for improvisation or tribal knowledge. The factory does not produce correct output because skilled workers know the right way to do things; it produces correct output because the process itself encodes correctness, and deviation is not a judgment call but a mechanical impossibility. A codebase refactored for LLM velocity is the software equivalent. When structural invariants are encoded into the architecture and enforced by the build system, the LLM is not navigating by convention or relying on implicit knowledge — it is operating in a dark factory where the specification is the structure, the constraints are enforced by construction, and the correct thing is also the only available thing. The lights are off not because no one is watching, but because no one needs to be.

When the factory runs itself, the engineers stop tending the machines and start designing new ones. A codebase with structurally enforced invariants does not just reduce defects — it fundamentally changes what a small team can ship. Features that previously required a senior engineer to hold the entire system model in their head can be delegated to an LLM working against a legible specification, completed in minutes rather than days, and land correct by construction rather than by review. The SDK surface expands. The API coverage grows. Capabilities that would have sat in a backlog for a quarter ship in a sprint — not because the team grew, but because the leverage multiplied. This is the compounding return on architectural investment that is invisible until you are standing on the other side of it: every invariant encoded into the structure is a tax permanently removed from every future feature. The codebase gets cheaper to extend the more complete it becomes, which is the opposite of the trajectory most software is on. Teams that reach this state do not just move faster — they move into territory that was previously unreachable, shipping innovative features at a velocity that has no relationship to headcount, because the constraint was never the number of developers. It was the amount of structure they were working against rather than with.

Stage 1: Structural Foundation

Most codebases begin their life as a record of decisions made under pressure — features shipped, bugs fixed, conventions that made sense at the time and calcified into habit. The structural invariants that would make a codebase safe for LLM delegation are absent not because anyone decided against them, but because no one had a reason to make them explicit. Correctness lives in the heads of the maintainers. Patterns are inferred by reading enough code. A new feature is implemented by copying an existing one and hoping the convention holds. This is Stage 1: the codebase works, experienced developers can navigate it, but the knowledge required to extend it correctly is tribal rather than structural, and the LLM — like any new developer without access to that tribal knowledge — is flying partially blind. The work of Stage 1 is not feature development. It is making the implicit explicit: surfacing the invariants that experienced maintainers carry in their heads and encoding them into the architecture so that the codebase itself becomes the specification. This is the structural foundation work, and it is the prerequisite for everything that follows.

This is done by maintainers who know the code, and work with an LLM to make the invariants explicit in the codebase and build system, and to surface both the implicit invariants and any areas where they currently do not hold.

Stage 1 is also where the team’s working relationship with the LLM gets built. Maintainers learn, on work whose ground truth they already know, how to prompt the LLM, how to challenge its analysis, how to recognize when it has pattern-matched to something plausible but wrong, and how to structure a task so that the LLM’s output is verifiable rather than merely persuasive. This calibration cannot be skipped. A team that arrives at Stage 2 structurally — invariants encoded, enforcement in place — but whose developers have never learned the rhythms of LLM collaboration will not achieve Stage 2 velocity. They will have a clean specification and use it timidly, or they will over-trust and ship defects the structure was never meant to catch. The refactor work of Stage 1 is the ideal substrate for this calibration precisely because the maintainers hold the ground truth: every claim the LLM makes about the codebase can be verified against the code, every proposed refactor can be evaluated against the maintainer’s model of the system, and every miss is a chance to tune the collaboration before the stakes rise. By the time Stage 2 begins and feature developers who lack deep codebase knowledge are driving the LLM, the maintainers have already learned what good looks like, and can model it, review it, and set the norms the rest of the team will work inside.

Stage 1 Prompts

The following is an example of how to interrogate the code base using an LLM to map the structural invariants. The maintainer should use their own knowledge of the code to challenge the LLM, and use the LLM to challenge their understanding of the code. Each prompt may lead to further clarification discussion. The goal here is to establish the LLM-dev vibe as much as accomplish the refactor.

  • We are going to analyse this code base to derive its structural invariants, in data structures, data flow, and architecture. We are looking for in the first phase for structural invariants that are explicitly defined. Take a look at it and produce a report of the explicit structural invariants present.
  • Let’s now look at the implicit structural invariants. What structures are invariant by convention? Produce a report on those.
  • Now let’s look at structural invariants that are violated. We’re looking for shapes in data and code architecture that are 80% invariant, but with 20% variance. These may reveal a missing or violated invariant.
  • OK, so we want to elevate all structural invariants to explicit and enforced architecturally. We want full type safety and explicit architecture. The goal is maximum explicitness and maximum safety. Give me some options for this.
  • We’re going to do Option N. Make an epic issue for this, with a plan. This is going to be a significant refactor, and existing tests coupled to the implementation are likely to break. Therefore, to do a green/green refactor the first step will be generating sufficient test coverage to guard the surface.
  • Now make the subtask issues, and copy the requirement for sufficient regression test coverage for a green/green refactor into each one.

Here are some structural invariants to help you expand your thinking about it:

  • All configuration is explicitly documented. Done manually, it drifts: configuration is extended but not documented. Making it an enforced structural invariant ensures it stays up to date. Having a single place in the codebase where configuration is declared allows automated extraction to documentation. Making it strongly typed means it can’t be a source of defects. The normalised pattern provides a strong signal for PRs, and enforcement via the type system and linting / architecture tests makes it impossible to violate.
  • Code examples, same thing. Additionally, code examples that have errors are defects. Structure invariance means complete coverage and enforced correctness.
  • Anything that needs to be documented for developers (“Remember to update this when you update that”) is a symptom of an implicit or unenforced structural invariant.

The goal is to make it easy to see and do the right thing and impossible to do the wrong thing.

Stage 1 AGENT.md

Stage 1 is the foundation-building stage where maintainers work with an LLM to surface implicit invariants and encode them structurally. The AGENT.md at this stage has a different job than a Stage 2 AGENT.md: it’s not teaching the LLM how to ship features against a clean spec — it’s teaching the LLM how to investigate, challenge, and refactor a codebase whose invariants are mostly tribal.

Here is an AGENT.md for Stage 1:

# AGENT.md
## What this repository is
This codebase is in Stage 1. The structural invariants that govern correctness are not yet fully explicit in the code or enforced by the build system. A significant amount of the knowledge required to extend this codebase correctly lives in the heads of its maintainers - in conventions, patterns, and decisions that were never written down. You are working alongside a maintainer who holds that knowledge. Your job is not to ship features. Your job is to help make the implicit explicit, and then help encode it into the architecture so that the codebase itself becomes the specification.
If you find yourself reaching for a feature-development posture - pattern-matching on what "looks right," extrapolating conventions from a handful of examples, inferring intent from variable names - stop. That posture is what Stage 1 exists to eliminate. At this stage you are an investigator and an architect, not a contributor.
## Who is driving
A maintainer is driving. They know the code. They will challenge your analysis, and you should challenge theirs. When you produce a report of invariants, assume the maintainer will push back on it - welcome that, and treat their pushback as signal about where your model of the system diverges from the real one. When the maintainer asserts an invariant, do not simply accept it: go read the code and verify that it holds everywhere it claims to. Mismatches between what the maintainer believes and what the code actually does are among the most valuable artifacts of Stage 1 work.
## What you are looking for
The central move of Stage 1 is a reclassification. Things the team currently describes as "our convention," "the pattern we follow," "the way it's done here," or "you just have to know" are, in almost every case, structural invariants that lack enforcement. They govern correctness. They hold across every instance that matters. Violating them produces defects. The only thing separating them from the explicit invariants the type system already enforces is that nothing stops a contributor from violating them. Your job is to find these, name them as invariants rather than as conventions, and hand them to the maintainer as candidates for structural enforcement.
With that frame in mind, your work falls into three investigative modes, usually run in sequence.
1. Explicit invariants. Structural properties the code already enforces - type constraints, interfaces all implementations must satisfy, registries that gate entry, build-time checks, schema validations. Report these first, because they form the baseline the rest of the refactor builds on. Be precise about the mechanism of enforcement in each case: a type signature, a registry lookup, a lint rule, a build step. "Enforced by convention" is not enforcement and does not belong in this category.
2. Conventions that are structural invariants in disguise. This is the core of the work. Patterns that hold across every instance but are not enforced - shapes every module happens to share, sequences of calls that always appear together, naming schemes that the code relies on, lifecycle steps that every handler performs in the same order. For each one: name the pattern, cite the instances (all of them, not a sample), state what the code is silently relying on, and state what would break if the pattern were violated. Frame each finding as an invariant with a missing enforcement mechanism, not as a convention worth documenting. If the correct response to your finding is "we should write that down," you have described it wrong - the correct response is "we should make it structurally impossible to violate."
3. Violated invariants. The highest-value category, and a direct consequence of the reclassification above. Look for shapes that are roughly 80% consistent with 20% variance. That variance is almost never intentional. It is usually either a missed invariant (the pattern should hold everywhere but drifted in places nobody caught) or a hidden invariant (the variance encodes a real distinction nobody has named). Report both possibilities for each case and let the maintainer adjudicate. The 20% is where the next defect is already waiting.
## How to investigate
Read widely before concluding. An invariant claimed from three examples is a guess; an invariant verified across every instance in the repository is a finding. When you believe you have identified a pattern, enumerate every place it should apply and check each one. Report the count. If you cannot check exhaustively in the time available, say so explicitly and scope your claim to what you actually examined.
Prefer evidence over inference. When you describe a pattern, point to the code. When you describe a violation, show the diff between the conforming cases and the outlier. When you describe an enforcement mechanism, name the file and line where enforcement happens. A maintainer can verify a claim anchored to code in seconds; a claim anchored to vibes takes an hour to run down.
Challenge the maintainer's model when the code disagrees with it. If the maintainer says "all commands go through the registry" and you find three that don't, surface the exceptions directly and ask whether they are bugs, deliberate escapes, or an incomplete migration. Do not soften the finding. The entire point of Stage 1 is to find these mismatches before they calcify further.
## How to propose refactors
When you are asked to propose options for elevating an invariant from implicit to explicit, produce more than one. The maintainer is choosing between tradeoffs, not rubber-stamping a single answer. For each option, state the enforcement mechanism, the blast radius of the change, what breaks, and what becomes impossible to get wrong afterward. Prefer enforcement by construction - a registry, a type, a build-time check - over enforcement by lint rule, and lint rules over documentation.
The goal of every refactor in this stage is the same: make it easy to see and do the right thing, and make the wrong thing structurally impossible. If a proposed refactor leaves open a path where a future contributor (human or LLM) could do the wrong thing and not be caught, the refactor is incomplete. Anything that would require a comment saying "remember to update X when you change Y" is a symptom of an invariant that should be structurally enforced instead. Treat such comments, wherever you find them in the existing code, as leads.
## Tests
Refactors at this stage will frequently break tests that are coupled to the current implementation. Before you change structure, establish a regression surface that is coupled to behavior rather than implementation - enough coverage that a green/green refactor is possible. When you are given a refactor subtask, treat generating sufficient regression coverage as the first step, not an afterthought. If you cannot characterize the current behavior well enough to write those tests, that is itself a finding worth surfacing: it means the current behavior is under-specified and the refactor needs to nail it down before it moves it.
## Issues and planning
When a refactor is scoped, produce an epic issue with a plan and decompose it into subtasks. Each subtask must carry the regression-coverage requirement explicitly - do not assume a subtask inherits it from the epic. Each subtask should also state, in one sentence, the invariant it is elevating and the enforcement mechanism it is introducing. A subtask that cannot state those two things is not ready to be worked on.
## What not to do
Do not add features. Do not "clean up" code opportunistically outside the scope of the invariant being elevated. Do not introduce new patterns that are themselves conventions rather than structurally enforced - you would be adding tribal knowledge while trying to remove it. Do not treat the absence of a violation in the cases you examined as proof the invariant holds; say what you checked. Do not write documentation as a substitute for enforcement. Documentation is what you produce from the structure once the structure is correct, not what you produce instead of correct structure.
## What success looks like
At the end of Stage 1, a new contributor - human or LLM - can extend this codebase correctly without access to any maintainer's head. The invariants are visible in the code, enforced by the build system, and the wrong thing is structurally impossible rather than merely discouraged. When that is true, this file gets replaced by a Stage 2 AGENT.md, and the work shifts from refactoring to shipping.

Stage 2: Feature Velocity

When the structural foundation is in place, the nature of development changes. The codebase is no longer a system that requires deep familiarity to extend safely — it is a legible specification that an LLM can reason against completely. The invariants are visible, the patterns are uniform, the constraints are enforced by construction, and the tribal knowledge that previously gated every non-trivial change has been dissolved into the architecture itself. At this stage the developer’s role shifts: less time is spent holding the system model, navigating implicit conventions, or reviewing for correctness that the structure should be guaranteeing — and more time is spent on what only humans can do, which is deciding what to build and why. Features that would previously have required a senior engineer’s full attention can be delegated, completed at LLM velocity, and land correct by construction. The backlog that always seemed just out of reach becomes reachable. The capabilities that required careful orchestration of scarce engineering time can now be shipped with a fraction of that overhead. This is Stage 2 — not a reward for completing Stage 1, but its direct and compounding consequence.

Stage 2 AGENT.md

Where the Stage 1 AGENT.md teaches posture, the Stage 2 AGENT.md teaches the codebase itself. It is the handoff document between the structural foundation and the feature developers who will build on top of it — a map to the invariants that Stage 1 made explicit and the mechanisms that enforce them, short enough to be read before every session and concrete enough that an agent reading only this file and the code it points to can extend the repository correctly.

Below is an example, drawn from c8ctl. The shape is reusable, the contents are not. After the example, the commentary unpacks what this kind of document contains, what it deliberately excludes, and why.

Example: AGENT.md for c8ctl

# AGENT.md - c8ctl
## What this repository is
`c8ctl` is a CLI for operating Camunda 8. Every user-facing capability is exposed as a command. Commands are uniform by construction: they are declared, validated, typed, executed, and rendered through a single structural contract, and that contract is enforced by the command registry.
If you are adding or modifying a command, the registry is the specification.
Read `src/registry/` before you read anything else.
## The command contract
Every command is a value that implements the `Command` interface defined in `src/registry/command.ts`. A `Command` declares, in one place:
- its name and help text
- its argument and flag schema (typed, validated at parse time)
- the operation it performs against the Camunda 8 API
- the renderer that produces user-facing output
A command that omits any of these fields does not typecheck. A command that is not exported from `src/commands/index.ts` is not registered and cannot be invoked. Help output, shell completion, and the generated CLI reference documentation are all derived from the registry - you do not update them separately, and you cannot.
## Where to add a new command
1. Create `src/commands/<name>.ts` exporting a `Command`.
2. Add it to the registry export in `src/commands/index.ts`.
3. Add behavioral tests in `src/commands/<name>.test.ts`.
That is the entire surface. If you find yourself editing help text, completion scripts, the docs site, or argument-parsing code to add a command, stop - you are working against the structure instead of with it.
## What is structurally impossible
You cannot:
- register a command without help text (type error)
- register a command without a validated argument schema (type error)
- register a command that bypasses the renderer layer (type error)
- ship a command whose documentation is out of sync with its schema (docs are generated from the registry at build time)
- ship a code example in the docs that does not compile (examples are extracted and typechecked in CI)
- introduce a new configuration key without documenting it (config is declared in `src/config/schema.ts` and docs are generated from it)
Do not attempt these. They will fail at the type checker, at the registry, or in CI, and the failure is the design working.
## API client
All interaction with Camunda 8 goes through the client in `src/client/`.
The client is the only place that knows about HTTP, authentication, retries, and error mapping. Commands receive a typed client and call typed methods on it; they do not construct requests, handle transport errors, or know the shape of the wire protocol.
If the capability you need is not on the client, add it to the client first, with its own tests, and then consume it from the command. Do not inline HTTP calls into a command.
## Errors and output
User-facing errors are values of the `UserError` type, raised by commands and rendered by the output layer. Internal errors are exceptions and are caught at the top level. Never `console.log` or `console.error` from inside a command - all output flows through the renderer so that ` - json`, ` - quiet`, and non-TTY behavior stay uniform across every command.
## Tests
Every command has behavioral tests against a fake client. The fake client is generated from the same schema as the real one, so tests cannot drift from the wire contract. Add tests in the same PR as the command; CI will fail a command without tests.
## Process invariants
The following rules protect correctness properties this codebase cannot enforce through its types or its build. They are enforced by you, the agent, and you will be held to them.
**Always green.** Before starting an AI-assisted session, run the full test suite and confirm it passes. A failing test is never "pre-existing" or "unrelated" - treat it as a stop condition and confirm with the engineer before proceeding. "Re-run and hope" is not a triage strategy; it is the failure mode this rule exists to prevent.
**Red/green for new behavior.** Write the failing test first, then the minimal change that makes it pass. A test that passes before the production change lands is not guarding anything.
**Green/green for refactors.** Before a behavior-preserving refactor, land the guard tests in a separate PR against the pre-refactor code and prove they pass. A guard test that lands in the same PR as the change it is supposed to guard has no recorded moment of passing against the old code, and reviewers cannot tell whether it would have caught a regression.
**There are no flaky tests.** An intermittent failure is either a test defect (a race, an unsynchronised readiness signal, a wall-clock dependency) or a product defect (a race, a missed signal, a leaked resource). Pick one explicitly in the fix commit message and fix it.
Do not retry the job, do not mark the test `skip`, do not describe the failure as "flaky" or "unrelated" in the PR description.
**Review-comment fix-ups use `chore`, not `fix`.** `fix:` commits trigger a patch release and a CHANGELOG entry. Review iterations are not user-facing bug fixes and must not pollute the release pipeline.
Use `chore: address review comments - <summary>`.
## What to read before you start
- `src/registry/command.ts` - the command contract
- `src/commands/version.ts` - the simplest complete command, use as a template
- `src/client/README.md` - how the client layer works
- `docs/architecture.md` - generated from the registry; shows the current shape of the CLI
If after reading these you are unsure whether what you are about to do fits the structure, it probably does not. Ask.

Reading the example

Two kinds of content appear in this file, and the distinction is the point. The first — the command contract, the registration steps, the list of what is structurally impossible, the client layer — is pure structural reference. Every rule is anchored to a mechanism in the code that enforces it. The agent does not need to remember to conform; the type system, the registry, and the build system will not let it do otherwise. If you removed this content, the codebase would still be safe to extend — slower to navigate, but structurally sound. This content is a map, and the fences are in the code.

The second — the “process invariants” section — is different, and its presence in the document should be read as a signal that work is still outstanding. Each rule there protects a correctness property the code cannot yet see: whether a failing test was dismissed as “unrelated,” whether a guard test was proven against the pre-refactor code, whether a fix: commit was really a review iteration. For these rules, the AGENT.md is itself the enforcement surface and the agent is the mechanism — provisionally. A healthy Stage 2 codebase works to shrink this section over time, moving each rule into a pre-session hook, a CI check, a commitlint rule, a pipeline precondition. The next section explains why this migration is not optional. For now, what matters is that every item in the process-invariants list is a live backlog entry, not a permanent fixture.

What the file does not contain, in either section, is advice. There is no general guidance about good commit messages, no checklist of engineering virtues, no explanation of what a CLI is or why tests matter. Every rule names an invariant and points to a mechanism — either a mechanism in the code, or the agent itself acting under this document’s authority until a code-level mechanism replaces it. The test for whether a rule belongs in a Stage 2 AGENT.md is the same for both sections: could it have been written before the refactor, without knowledge of this codebase’s architecture, release pipeline, or test infrastructure? If yes, it does not belong. Everything here is earned.

Agents design enforcement. They do not reliably enforce.

There is a division of labour between the agent and the architecture that is worth naming explicitly, because getting it wrong quietly defeats the entire practice.

Agents are excellent at three things in the structural enforcement loop. They are good at designing enforcement mechanisms — given a pattern that holds by convention, an agent can propose the registry, the type guard, the schema, the build-time check that would elevate the convention to an invariant. They are good at implementing enforcement mechanisms — once the design is agreed, the agent can write the workflow, the lint rule, the code generator, the CI step. And they are good at identifying gaps in enforcement — reading widely across a codebase and surfacing the 20% variance, the unwrapped call site, the missing guard. These are the Stage 1 superpowers, and the refactor pipeline runs on them.

What agents are not good at, and what no amount of prompting reliably fixes, is acting as enforcement. An agent that has read the AGENT.md at the start of a session and understood every rule in it is not the same agent three hours later, twenty tool calls deep into a feature implementation. The context of the feature displaces the context of the meta-rules. The rule that said “never dismiss a failing test as unrelated” is still technically in the context window, but it is no longer being actively referenced by the reasoning that is deciding what to do next. The agent will, eventually, dismiss a failing test as unrelated. It will mark something skip. It will pattern-match to the 20% variance instead of the 80% rule, because pattern-matching is what the attention mechanism does and the rule that said “prefer the 80%” is now competing with three screens of feature code for salience. This is not a failure of instruction-following. It is a structural property of how context-window reasoning works, and it is the exact mirror image of the failure mode that produces tribal knowledge in humans: rules that aren’t written into the structure don’t survive contact with work.

This has a sharp consequence for how to read the AGENT.md’s process-invariants section. Those rules — always-green, red/green, green/green, no-flaky-tests, review-fix-ups-as-chore — are not a legitimate permanent category of enforcement. They are a backlog. Each one is a Stage 1 finding that hasn’t been closed yet: an invariant the team knows governs correctness, written into the AGENT.md as the interim form of enforcement while a proper structural form is designed and built. The goal is to keep draining that backlog. Every item in it is a candidate for the same treatment every other invariant got in Stage 1 — elevate it out of the document and into the pipeline, where context displacement cannot reach it.

Consider how each of the process invariants in the c8ctl example can be moved structurally:

  • Always-green can be enforced by a pre-session hook that runs the test suite and refuses to hand control to the agent on red. The agent does not have the option to ignore a failing test because the session does not begin until the tests pass.
  • Red/green for new behavior can be partly enforced by requiring the test commit to precede the implementation commit in the PR history, checked in CI. The shape of the diff becomes the evidence that the test was written first.
  • Green/green for refactors can be enforced by a workflow that verifies the guard tests were green on the pre-refactor SHA before allowing the refactor PR to merge. The separate-PR requirement becomes a pipeline precondition, not a norm.
  • No flaky tests can be enforced by removing retries from CI configuration entirely. If the pipeline has no retry button, the agent cannot choose to press it. The norm becomes the behavior of the system.
  • Review-comment fix-ups as chore is the hardest to fully mechanise, but the surface can shrink: a commitlint rule that flags fix: commits on PRs that already have review activity, or a pre-push check that asks the author to reclassify.

Each of these moves closes a rule out of the AGENT.md and into the build system. The rule does not go away; it moves to where it cannot be displaced. And crucially, the work of designing and implementing the move is exactly what agents are good at — so the backlog gets drained using the same collaboration pattern that populated it.

All of this is part of managing the LLM during LLM coding. Everything in the AGENT.md that is not yet structurally enforced has to be enforced by the developer, session by session, until it is — the process-invariants section is both a migration backlog and a running checklist of what the developer is currently responsible for catching. Draining the backlog reduces the enforcement load over time; it does not eliminate it until every item is gone.

The principle underneath all of this is the same one that motivates Stage 1 in the first place, extended to cover the agent itself. Stage 1 exists because human tribal knowledge doesn’t survive across time, across teams, or across the gap between maintainers and new contributors — so correctness must be encoded into structure rather than held in heads. The same logic applies to the agent. Agent attention doesn’t survive across a long session, across a large diff, or across the gap between reading the AGENT.md and implementing the feature — so correctness must be encoded into structure rather than held in the context window. “Structurally enforced” is not architectural preference. It is the only form of enforcement that is robust to context displacement, whether the context belongs to a person or a model.

This is why the direction of travel is one-way. Invariants move from tribal knowledge → AGENT.md → structure, and they do not move back. A team that finds itself writing new process invariants into the AGENT.md without a plan to mechanise them is accumulating a new kind of tribal knowledge — one that lives in a file instead of a head, but fails the same way under the same pressure. The AGENT.md is a waypoint, not a destination. The destination is always the structure.

Three levels of enforcement

The document so far has used a two-level framing — things held in the AGENT.md versus things enforced by “structure” — and the simplification has been useful but is now in the way. Structure is not one thing. It is two distinct levels with very different durability properties, and the distinction matters for how invariants should be migrated and how much protection each migration actually buys.

There are three levels. An invariant can live in any of them, and moving it up the stack makes it stronger against the failure modes this document is concerned with.

Level 1: AGENT.md. The invariant lives as text the agent reads at the start of a session and is held to by the developer during the session. Enforcement is attentional — the rule holds as long as someone is actively watching. The agent may or may not remember it at hour three; the developer may or may not catch the violation when it happens. The rule is only as strong as the weakest session’s vigilance, and context displacement is its constant adversary. This is where new invariants enter the practice, and where they must not stay.

Level 2: Codebase as substrate. The invariant lives in the code itself — in types, interfaces, registries, runtime checks, module boundaries. Enforcement is constructive: the wrong thing fails to compile, fails to typecheck, fails to register, fails to run. An agent trying to violate the invariant in the course of writing a feature will hit the constraint and have to work with it instead of around it. This is a large step up from Level 1, because the enforcement no longer depends on anyone remembering the rule — the code remembers, and the code is always looking.

But the codebase is mutable by the same agents that are bound by it. An agent can edit the type that prevents a mistake, delete the validation that catches an error, add an escape hatch to the registry, soften a strict check to a warning. The invariant is structural, but the structure is editable. In any given session, a Level 2 invariant is strong. Across sessions — especially sessions where the invariant is standing in the way of a feature the agent is trying to ship — Level 2 invariants can be dismantled by the very mechanism they are supposed to govern. The agent asked to ship a feature that the type system rejects may, without malice, propose relaxing the type. The review load for catching this is real and falls on the developer.

Level 3: Build systems as immutable infrastructure. The invariant lives outside the code that the agent is editing — in CI configuration, pre-commit hooks, required status checks, branch protection rules, the release pipeline, deployment gates. Enforcement is infrastructural: the agent cannot merge without passing, cannot release without the check, cannot bypass without credentials it does not have. Changes to the pipeline itself must go through the same pipeline they are trying to modify, which means an agent cannot unilaterally remove the invariant during the course of doing its work. Level 3 is the only level where the enforcement surface is not edited by the thing being enforced against.

This is the difference between a type that says commands must be registered and a CI check that refuses to merge a PR whose typecheck fails. Both look like “structural enforcement” from thirty thousand feet. From the altitude the work actually happens at, they are not the same. The type can be relaxed in the same PR that violates it, and a reviewer has to notice. The CI check cannot be turned off by the PR it is running against.

Why this reframes the migration

The drain-the-backlog discipline is not a single migration from Level 1 to “structure.” It is a two-stage migration, and each stage buys a different kind of protection.

Moving an invariant from Level 1 to Level 2 removes the attentional load. The developer no longer has to remember to check every session whether the rule held; the code now checks. This is a large improvement, and it is where most of the Stage 1 work lands. A convention-that-was-a-structural-invariant-in-disguise becomes a type, a registry entry, a runtime guard. The codebase enforces it.

Moving an invariant from Level 2 to Level 3 removes the tamper surface. The developer no longer has to review whether the rule itself was weakened in the diff that’s claiming to comply with it. The pipeline refuses the merge if the rule is broken, and it refuses the merge regardless of whether the rule has been edited locally. This is the step that makes an invariant robust across sessions, across contributors, and across the agent’s own future choices about what is in the way.

A team that thinks it has reached Level 3 when it has actually reached Level 2 is vulnerable in a specific way: the invariants look structural in any individual session, but they erode over time as agents working on features incrementally relax the constraints that get in their way. Each individual relaxation looks reasonable in context, and none of them are caught by the infrastructure because the infrastructure isn’t watching. A codebase can arrive at Stage 2 structurally, run for six months, and find that the Level 2 invariants have quietly been ground down by routine feature work — while the AGENT.md still claims they are enforced.

What this means in practice

Three rules of thumb follow from the three-level model.

First, Level 2 is not the destination. It is a significant waypoint, but a team whose drain-the-backlog process stops at “we added a type for it” has only done half the migration. The remaining work is to ensure the type is actually checked — by the pipeline, on every PR, with no opt-outs — and that the check cannot be skipped or disabled without cutting through review. If the typecheck runs locally but not in CI, the invariant is Level 1.5 and behaves like Level 1 under pressure.

Second, the audit question for any claimed structural invariant is “can the agent, in the course of shipping a normal feature, remove this?” If yes, the invariant is Level 2. If no, it is Level 3. The question is not whether removing the invariant would be caught — all sorts of things are caught by review — but whether the enforcement surface survives the attempt. Review is attentional enforcement applied to the structure itself; infrastructure is not.

Third, the AGENT.md should be explicit about which level each of its invariants currently lives at. The process-invariants section is Level 1 by definition. The structural-reference section should cite the file, line, or CI check that enforces each invariant, so the reader can tell at a glance whether they are being told about a Level 2 or Level 3 fence. An AGENT.md that lists an invariant without naming its enforcement mechanism is describing the invariant’s aspiration, not its actual strength.

The principle underneath all of this is the one the document has been building toward: enforcement strength is measured by what survives pressure. AGENT.md rules survive vigilance. Codebase invariants survive any given session. Build-system invariants survive the agent’s ability to edit the enforcement surface while working within it. The goal of the practice is not just to move rules from heads into code — it is to move them all the way to the level where the agent cannot, even inadvertently, take them apart.

Blast radius is the underlying mechanism

The three-level model tells you where an invariant lives. What follows is the mechanism that actually makes it hold, at any level: the blast radius of doing the wrong thing has to be bigger than the blast radius of doing the right thing.

Agents work by local optimisation. Given a goal, an agent chooses the shortest path toward it that produces working code. If the wrong path is one line — value as string, a // eslint-disable-next-line, an added escape hatch to the registry — and the right path is writing a type guard or conforming to the framework, the wrong path wins on local cost. Not because the agent is lazy or dishonest, but because shortest-path search is what it does. The architecture is competing with the goal for the agent’s attention, and if the architecture is cheaper to defeat than to satisfy, the architecture loses.

The job of structural enforcement, seen this way, is to shape the terrain so the correct path is genuinely the shortest. A lint rule that can be suppressed with a single comment has a blast radius of one line. A lint rule wired into the pre-commit hook, the CI check, the type system, and the generated documentation has a blast radius measured in files touched, tests updated, and increased diffs in the PR. Same rule, vastly different durability, because the cost of going around it is different.

This is why the c8ctl registry is a strong invariant. It is doing two jobs at once. The first is holding the invariant itself: the registry is a single source of truth from which help output, shell completions, documentation, argument parsing, and type definitions are all derived. Every registered command has all of these properties, complete and in sync, by construction — none of them can be missed out or drift. The second is shaping the terrain. The registry is also an affordance: a scaffold sitting there that does the work of argument and flag deserialisation, type coercion, validation, and error handling for any command that uses it. Conforming to the registry costs one file edit and gets all of that for free. Bypassing it means writing a few hundred lines of ad hoc infrastructure for a single command. The agent’s own local optimisation pushes it toward the correct path — not because the wrong path is impossible, but because the wrong path is expensive enough and visible enough that conformance is the shorter distance to working code. The registry is strong because it is both the structural constraint and the cheapest route through that constraint. Either alone would be weaker. A structural constraint without an affordance would be a fence that agents and developers worked around. An affordance without a constraint would be a scaffold that commands could opt out of. Together, they make the correct path the shortest one and the wrong path both expensive and visibly deviant.

An architectural test in c8ctl enforces that every entry in COMMAND_DISPATCH came from defineCommand(), via a non-enumerable symbol marker that the factory alone can apply. Any handler that bypasses the framework fails the test, and the failure message names the remediation — wrap in defineCommand(…). The defence is still Level 2; the test file is substrate and an agent with edit access can, in principle, disable it. But the blast radius of the bypass is now large. Adding a hand-rolled handler to the dispatch map is no longer one map entry; it is a visible, multi-file change that deletes a named architectural guard, readable in the PR diff for what it is. The agent is not prevented from going around the framework. It is required, to go around the framework, to do so conspicuously. And it is conspicuous to the agent itself, doing local optimisation. Local search stops finding the bypass, because the bypass is no longer short.

These two paragraphs show two different mechanisms that keep the registry pattern intact. The first is path shortening — the correct path is so much cheaper than the alternatives that the agent’s own local search picks it. The second is defect detection — the architectural test catches the bypass when path shortening isn’t enough, and forces the deviation to be conspicuous rather than silent. A strong Level 2 defence usually has both. Path shortening prevents most deviations by making conformance the obviously shorter route. Defect detection catches the deviations that slip through anyway, and makes them impossible to ship without signalling what they are. Either mechanism alone is partial — path shortening doesn’t help if an agent chooses to deviate anyway, and defect detection doesn’t scale if it is catching deviations the agent should never have attempted in the first place.

This is blast-radius hardening within a level, distinct from migration across levels. Level 3 migrations are more complex and require infrastructure. Hardening within Level 2 — making the bypass noisier, longer, and conspicuous — is frequently sufficient. The defence is not impregnable, but the agent’s local optimisation points toward conformance, which at Level 2 is the practical definition of strong.

This kind of defence in depth does not make sense pre-LLM. The cost of building the structural invariants and the architectural test is real, and in a world where the only contributors to COMMAND_DISPATCH are the small, known, professionally-invested group that has been working with the framework for months, the marginal violations caught are close to zero. The existing Level 2 defence is sufficient because the humans operating on it are carrying most of the enforcement work in their heads — pattern recognition, framework fluency, the instinct that something looks wrong. The architectural test would be cost without corresponding benefit, and leaving it unbuilt is correct engineering judgment under those conditions.

The LLM changes the population operating on the surface. The context that was doing the implicit enforcement work is no longer reliably present. An agent with ten minutes of exposure to the codebase will find any short path the architecture allows, because local search is what it does. The defence-in-depth pattern is not work that should have been done years ago and was neglected; it is work that was not worth doing under the old conditions and has become worth doing under the new ones.

The practical question for every invariant is the one the heuristic suggests: when the agent is tempted to do the wrong thing, what does it have to disable or work around to succeed? If the answer is “one annotation,” the invariant is weak regardless of which level it technically lives at. If the answer is “the typecheck, the lint rule, three tests, and a CI gate — each leaving a visible trace in the diff,” the invariant is strong. The three-level model tells you where an invariant is defended. The blast-radius heuristic tells you how well.

LLMs change what developers do with their time. Features now cost minutes of attention rather than days; the agent does the implementation, the tests and guards evaluate it, and the developer’s role is neither writing nor line-by-line review. Review-at-LLM-speed is a fantasy — the volume overwhelms the attention, for the same reasons the agent’s own attention cannot enforce its own rules over a long session. What actually scales is the test suite and the structural guards, and what the developer does is maintain them. When a defect makes it through, the response is not “be more careful next time” but “harden the guards so this class of defect cannot recur.” The developer has moved out of the production line entirely. They are now working on the instrumentation that evaluates the line’s output, and adjusting the line itself when the instrumentation signals drift. The shift is from building the product to building the factory that produces the product. The architectural test in c8ctl is not a feature of c8ctl; it is a feature of the factory. And the two sides of the loop are linked causally: the guard rails are what unlock LLM velocity in the first place, and LLM velocity is what frees the developer time to build more guard rails. Each turn compounds — more structure enables more velocity, more velocity funds more structure. We are no longer just building features. We are building guard rails.

The guard rails are what unlock velocity without chaos. This is the piece the old cost structure made invisible and the new cost structure makes unmissable: LLM-driven work at LLM speed, against a codebase without adequate structural defence, does not produce velocity. It produces regressions, drift, silent deficiencies, and features that appear to work but have quietly severed themselves from the surrounding system. The codebase absorbs the first fifty changes and degrades under the hundredth. The team gets the appearance of velocity for a quarter and spends the next quarter excavating the damage. Guard rails are not a tax on velocity; they are the mechanism by which velocity becomes possible without the codebase disintegrating under it. A factory running without them does not run faster. It runs until something breaks.

My lean is a new short section after the three-level model and before the blast-radius section, because the argument is specifically about what Level 3 requires to function, and because the manager’s concern you’re quoting is about LLM feasibility, which needs to be introduced before the deeper technical discussion of blast-radius hardening.

Here’s a draft section:

Always green is the precondition

Everything the three-level model depends on — the codebase’s structural invariants, the build-system checks that enforce them, the pipeline gates that refuse non-conforming merges — assumes that a failing signal means a real defect. The moment this stops being true, the enforcement surface stops functioning. A red CI that has been red for weeks, carrying “known issues we’re tracking” or “pre-existing failures unrelated to this PR,” is a CI that has already given up on structural enforcement. The failing checks were added because someone thought they would protect something: an invariant, a regression, a security property. If the team has chosen to run with them red, the team has chosen that the protection isn’t worth the cost of maintaining. Whatever those checks were guarding is now unguarded and has been since the red persisted.

This is a Level 2 problem in disguise. The structural defence is still technically in place — the check exists, the pipeline runs it, the red light blinks — but the enforcement is now attentional rather than structural, because humans have to remember which red signals are “real” and which are “known.” That distinction is tribal knowledge. It survives only as long as the team members who know it are present, and it is exactly the kind of thing LLMs cannot inherit. An agent working against a codebase with persistently failing CI cannot distinguish broken-by-this-change from broken-since-last-Tuesday. Neither can a developer reviewing the agent’s work, once the volume exceeds their ability to hold every known-failure in memory.

The operational consequence is that LLM-driven development against a red CI produces futile loops. The agent tries to ship. A check fails. The agent doesn’t know whether it caused the failure or inherited it, so it tries to fix what appears broken. It discovers the failure predates its change and tries something else. It loops. At the end of the loop, either the agent gives up, or the developer waves through a change on the grounds that “those failures are pre-existing” — at which point the agent has learned, from the feedback loop, that some red signals can be ignored. That is the lesson the codebase will now teach every future session. From this point onward, no red signal can be trusted, because the training environment has made ignoring them acceptable.

Always-green is therefore not a quality-of-life nicety. It is the precondition for LLM-driven development to work at all. A team whose CI has been red for weeks, with a standing list of “we’ll get to those,” is a team whose structural foundation has already been compromised before any LLM shows up. The first Stage 1 work, for that team, is not surfacing tribal knowledge into structure. It is cleaning up the structural defences that are already failing silently. Either fix the broken checks or delete them — but do not leave them in place, red, teaching the codebase that failing signals can be ignored.

A useful diagnostic: a team considering LLM adoption should look at its CI dashboard and count the jobs that have been red longer than a week. Every one of them is a place where structural enforcement has degraded into tribal knowledge. Every one of them will produce futile loops and ambiguous signals when an agent touches the surrounding code. The manager who worries about the cost of futile loops is correctly diagnosing the cost of a pre-existing problem that was tolerable when only humans were working against it, and becomes expensive as soon as the volume of work crosses a threshold. LLMs cross that threshold on arrival.

The output side of the factory

Structural invariants on the input side — types, registries, architectural tests, CI gates — shape what the production line is allowed to produce. They make the wrong thing harder to write. But a factory is not just its input side. Every real manufacturing operation has inspection at the output: machine vision, dimensional measurement, stress testing, statistical sampling of what actually comes off the line. The factory producers do not trust the machines to produce correct output without verification, because trusting the machines is how you ship defective parts at machine speed. The same applies to a codebase refactored for LLM velocity.

The output side of the factory is an adversarial integration test mesh. Not the sampled integration tests of the pre-LLM era, where teams covered the critical paths and accepted that the edges were lightly tested because the cost of comprehensive coverage was prohibitive. A mesh — thousands of tests, generated and maintained at LLM speed, covering happy paths, edge cases, failure modes, concurrency races, boundary conditions, multi-step workflows, and error-handling paths. Tests that do not merely verify expected behaviour but attempt to break the system: malformed inputs the schema barely permits, simultaneous operations from many clients, dependent services misbehaving, races between concurrent valid operations. This is testing that operates deliberately outside the mental model that produced the code, because unit tests live inside that model and only adversarial tests can surface the defects the model didn’t anticipate.

This kind of coverage was not practical before LLMs. Writing a thousand adversarial integration tests was the work of a team for a quarter, and most of the tests would have been duplicative, and the coverage would still have been incomplete. LLMs collapse that cost. An agent with a clear specification of what the system should do can generate adversarial scenarios at a rate that makes comprehensive coverage reachable for the first time. The same economic inversion that made structural purity cheap has made output-side QA cheap.

The reframing this produces is central to the whole practice: we do not examine the code, we examine the product. The developer is not reading every diff. The developer is not reviewing every PR for correctness. The developer has left both the writing station and the inspection-by-reading station. What the developer does, at both ends of the factory, is maintain the machinery that produces and verifies correctness at machine speed — the structural invariants at the input, the adversarial mesh at the output. When a defect makes it through both sides, the response is symmetric: harden the input-side guard that should have prevented the defect from being produced, and harden the output-side test that should have caught it being produced. Two stations get stronger in response to each defect, and neither response is “be more careful next time.”

This is the full shape of the dark factory. Input-side structural enforcement makes the wrong thing hard to produce. Output-side adversarial testing catches the wrong thing when the input-side enforcement doesn’t. The developer maintains both surfaces. The lights are off on the production line itself because neither the writing nor the reading of individual units of work requires human attention — the machines write, the machines test, and the developer operates the instrumentation that measures whether the machines are still calibrated. This is the answer to the question the piece has been implicitly asking: if the developer has left the production line, where are they, and what are they doing? They are at the two ends. Designing the factory at one end. Measuring the output at the other. The code is what happens in between, and the code is no longer what the developer is looking at.