In Part 1, I showed that AI-generated code drifts, one “working” commit at a time, and built ArchCodex to surface the right architectural constraints at the right time. In Part 2, I dug into the research and explained how boundaries, constraints, and canonical examples create a feedback loop that keeps drift in check.
But ArchCodex answers how code should be structured. It doesn’t answer what code should do.
When I asked 20 AI agents to implement the same feature, “Add the ability to duplicate timeline entries,” the ones with ArchCodex produced better-structured code. They still disagreed about what “duplicate” means. Should tags be copied? Should the status reset? Should due dates carry over? Each agent made different assumptions, and each assumption was reasonable. That’s the problem.
I needed a way to make those decisions explicit, readable for both humans and machines, and verifiable. So I built SpecCodex.
The Problem with Natural Language Specs
The instinct is to write a detailed specification in prose. “When duplicating an entry, the system should copy the title with a ‘(copy)’ suffix, reset the status to ‘todo’, clear tags, remove the due date, and place the duplicate immediately below the original.”
This seems reasonable, and there are tools that take this approach. GitHub’s Spec Kit, for example, generates natural language specifications for coding agents. But natural language specs have problems that compound as features grow.
Natural language is ambiguous, for everyone. “Clear tags”: does that mean set to an empty array, or remove the field entirely? “Immediately below”: does that mean sort order original + 1, or insert at original + 0.5? An LLM picks an interpretation silently and moves on.
Prose specs are hard to scan. A well-written natural-language spec for the duplicate feature runs 800 to 1200 tokens. Much of that is connective tissue: “the system should,” “in the case where,” “it is important to note that.” For an LLM, those wasted tokens compete with file contents, architectural constraints, and conversation history in a limited context window. For a developer, that’s two to three pages of text where the actual decisions are buried in paragraphs. When a codebase has dozens of features, natural language specs become a documentation mountain that nobody reads end to end.
Natural language can’t be tested. You can review a prose spec. You can’t run it. There’s no easy deterministic way to verify that the code matches the spec. This is the fundamental limitation: natural language specs look like they solve the “what should this do” problem, but they just move it from “the LLM guesses” to “the LLM interprets.” The variance is smaller, but it’s still there.
What You Spec Is What You Get
The schema draws from patterns LLMs already know deeply from training data: Pact contracts, Design by Contractinvariants, Specification by Example given/then pairs. This isn’t just familiar syntax. LLMs have learned the associations between these formal specification patterns and their implementations. When the model sees invariants with @length(0), it doesn't need instructions on what to produce; the mapping from spec to code is already in the weights. The schema exploits that. It's prompt engineering at the architectural level.
The pattern isn’t the tool. What matters is: make decisions explicit in a parseable format, co-author with the LLM, and verify deterministically. If you use OpenAPI, your API spec is already a structured specification; generate contract tests from it. If you use Prisma or Drizzle, your schema is a specification; generate integration tests from it. If you use TypeScript interfaces for component contracts, those are specifications too. SpecCodex provides an opinionated full-stack schema that covers backend, frontend, security, and effects in one place. The benchmarks below prove the pattern works. The tool is one way to implement it.
Here’s what the schema looks like in practice (abbreviated; the full spec includes 7 touchpoints and additional examples):
spec.timeline.duplicateEntry:
inherits: spec.mutation
mixins: [requires_auth, logs_audit, has_timestamps]
implementation: convex/projects/timeline/mutations.ts#duplicateEntrygoal: "Create a duplicate of an existing timeline entry"
intent: "Copy core entry data, position below original,
provide fresh start for transient fields" inputs:
entryId:
type: Id<"projectTimelineEntries">
required: true invariants:
- description: "Title suffixed with (copy)"
"result.title": "@endsWith(' (copy)')"
- description: "Same entry type as original"
"result.entryType": "@equals(original.entryType)"
- description: "Status reset to todo for tasks"
condition: "original.entryType === 'task'"
"result.status": "todo"
- description: "Duplicate has empty tags (fresh start)"
"result.tags": "@length(0)"
- description: "Mentions reset (no re-notifications)"
"result.mentions": "@length(0)"
- description: "Sort order places duplicate below original"
"result.sortOrder": "@gt(original.sortOrder)" effects:
- description: "Creates new timeline entry"
target: "projectTimelineEntries"
operation: "insert"
- description: "Creates junction table entries for linkedResources"
target: "projectTimelineEntryAttachments"
operation: "insert"
condition: "original.linkedResources.length > 0"
- description: "Logs activity for the new entry"
target: "projectTimelineEntryActivity"
operation: "insert"
metadata:
duplicatedFrom: "@string(original._id)" hooks:
- hook: useTimelineEntryMutations
file: src/hooks/projects/useTimelineEntryMutations.ts
change: "Add duplicateEntry mutation binding"
- hook: useTimelineEntryHandlers
file: src/components/projects/planning/useTimelineEntryHandlers.ts
change: "Add handleDuplicate callback using duplicateEntry mutation" touchpoints:
- component: TaskArchetype
file: src/components/projects/planning/archetypes/TaskArchetype.tsx
change: "Wire onDuplicate from handlers to getMenuItems"
status: TODO
- component: NoteArchetype
file: src/components/projects/planning/archetypes/NoteArchetype.tsx
change: "Add Duplicate menu item with Copy icon to custom menu"
status: TODO
# ... 5 more components examples:
success:
- name: "duplicate task"
given:
entryId: "@validEntryId"
original: { title: "Original Task", entryType: "task", status: "done" }
then:
result._id: "@exists"
result.title: "Original Task (copy)"
result.status: "todo"
result.tags: "@length(0)"
Two things to notice.
Notice the invariants: every decision is explicit (@length(0), not "cleared"), conditional logic is visible (condition: entryType === 'task'), and each assertion maps mechanically to a test. This isn't documentation; it's a test specification that hasn't been compiled yet.
Notice the touchpoints: exact file paths, not descriptions. This turned out to be the critical difference between specs that worked for backend only and specs that worked end-to-end.
Writing Specs with the LLM: The Discovery Loop
The schema is designed to be co-authored. You don’t sit down and fill it out like a form. You describe what you want, and the LLM drafts the spec, drawing on its knowledge of the codebase.
The workflow looks like this:
- You describe the feature in natural language. “I want to duplicate timeline entries.”
- The LLM drafts a spec in the SpecCodex schema. It uses ArchCodex’s entity context to look up the schema, relationships, and existing patterns in your codebase.
- You review and refine. “Actually, don’t copy tags. Users want a fresh start.” “Reset status to todo, but only for tasks.”
- The LLM updates the spec. Now the decisions are locked in and visible.
- The LLM implements from the spec. Not from the original prompt. From the agreed-upon specification.
This is where the LLM’s discovery power actually shines. When drafting the spec, the LLM surfaces questions you haven’t thought of yet: “What happens when someone duplicates a task that’s in a milestone?” “Should the duplicate inherit the parent’s position in the Gantt chart?” “The schema shows a linkedResources relation; should those be copied or just the references?" These questions come up at spec-writing time, when answering them is free, instead of at code-review time, when the wrong answer is already baked into the implementation.
Because the spec is structured, you can see exactly what rules the LLM is proposing. If the invariants section doesn’t mention sort ordering, you know the LLM hasn’t thought about positioning. If there’s no conditional on entry type, you know task-specific behavior will be missed. The gaps are visible because the schema defines what a complete spec looks like. A natural language spec can feel complete while omitting entire categories of decisions. A structured spec with an empty effects section is obviously incomplete.
Deterministic Verification: Testing What Was Built
Here’s the payoff of making specs parseable rather than prose: you can mechanically verify what the agent built. This is the fundamental difference between structured specs and natural language specs. With prose, the only verification is you reading the code (or tests) and comparing it to the document. With the SpecCodex schema, verification can be deterministic.
Test generation from specs
You can verify what was built, mechanically. That’s the fundamental difference. With natural language specs, verification means a human reads the code and judges whether it matches the document. With structured specs, the spec compiles directly to executable tests, no LLM involved in the translation, no interpretation variance.
This works because the schema includes a typed placeholder DSL for both generating test inputs and asserting on outputs. In given: blocks, placeholders like @string(100), @authenticated, and @array(3, { name: '@string(10)' }) generate concrete, deterministic test data. In then: blocks, @exists, @length(0), @gt(N), and @contains('copy') each compile to exactly one expect()call. There's no interpretation step. @length(0) always becomes expect(x).toHaveLength(0), every time, in every project.
Different sections of the spec feed different kinds of tests. Examples become unit tests, one it() block per given/then pair. Invariants become property tests via fast-check, verifying that properties hold for all valid inputs, not just the examples you thought of. Effects become integration tests that verify database writes and audit logs. Touchpoints become UI interaction tests.
Here’s a concrete example. This spec fragment:
examples:
success:
- name: "duplicate task"
given:
entryId: "@validEntryId"
original: { title: "Original Task", entryType: "task", status: "done" }
then:
result._id: "@exists"
result.title: "Original Task (copy)"
result.status: "todo"
result.tags: "@length(0)"
errors:
- name: "unauthenticated"
given:
user: null
then:
error: "NOT_AUTHENTICATED"Compiles to:
it('duplicate task', async () => {
const original = await createEntry({
title: "Original Task", entryType: "task", status: "done"
});
const result = await duplicateEntry(original._id);
expect(result._id).toBeDefined();
expect(result.title).toBe("Original Task (copy)");
expect(result.status).toBe("todo");
expect(result.tags).toHaveLength(0);
});it('unauthenticated', async () => {
await expect(duplicateEntry({ user: null }))
.rejects.toThrow('NOT_AUTHENTICATED');
});The translation is mechanical. The spec is written collaboratively (with all the benefits of the discovery loop), but the tests are compiled deterministically (with none of the variance of AI-generated test code). This closes the loop: the LLM writes the implementation, the spec generates tests that verify it, and the results are pass/fail.
Static analysis across specs
Because specs are structured, you can also run static analysis across them before any code is written. SpecCodex’s analyzer builds a cross-reference graph across your entire spec registry: which specs write to which database tables, which specs read from which tables, which specs depend on each other, which specs share entities. Then it runs 65 checkers across six categories against this graph.
For example, a checker sees a spec with authentication: none combined with a database insert effect and flags it: you're writing to a table without auth. Another sees two specs that both write to the same table with different field assumptions and flags a potential consistency issue. Another sees a CRUD entity with create, read, and delete specs but no update, flagging incomplete coverage. None of this requires running code. It's static analysis for designs, not implementations.
Deep mode: verifying code against specs
The base analyzer reasons about specs in isolation. Deep mode goes further: it reads the actual implementation source files and compares them against what the specs claim. The spec says authentication: required; does the code actually check the user? The spec says permissions: ["bookmark.edit"]; does the code check that permission, or did it drift to checking "admin"instead?
Deep mode uses configurable regex patterns grouped into six categories: auth checks, ownership checks, permission calls, soft-delete filters, database queries, and record fetches. You define these patterns per project because every framework looks different. A Convex project checks for ctx.userId; an Express project checks for req.user; a Django project checks for request.user. The patterns are different, but the security question is the same: does the code verify what the spec requires?
This catches a specific class of bugs that are nearly invisible in code review. When a spec says the user can only update their own records, deep mode checks whether the code both fetches the record and compares ownership. When a spec implies soft-delete semantics, deep mode checks whether queries actually filter out deleted records. When a spec declares a permission, deep mode extracts the permission string from the code and compares it to the spec, catching permission drift.
The full verification stack is layered intentionally. Test generation catches behavioral drift (does the code do what the spec says?). Static analysis catches design gaps (is the spec itself consistent and complete?). Deep mode catches implementation drift (has the code diverged from the spec?). Together, they turn a structured spec into a continuous verification system rather than a document that goes stale.
The Benchmark
To validate this approach, I ran the same feature request (duplicate timeline entries) across 20 AI agents with different configurations.
The feature was chosen because it’s deceptively complex. “Duplicate a timeline entry” sounds like a single mutation, but a complete implementation touches 11 files across four layers: the backend mutation and its barrel export, two hook files for binding and handling, a type contract update, a controller wiring change, and five separate UI archetype components that each need menu updates. Most agents discovered the first six files naturally by following imports. The five archetype files, delegated components that don’t appear in the obvious import chain, are where implementations diverged.
The results support the claims above, but they also revealed things I didn’t expect.
The assumption problem disappears
Without a spec, every agent made reasonable but different decisions:
Press enter or click to view image in full size
Every answer is defensible. None is what we wanted.
With SpecCodex, backend adherence went to 100%. Not improved. Identical. Every agent with the spec produced the same field handling, the same sort order logic, the same audit logging. The spec didn’t guide the agent; it constrained it.
Silent bugs (wrong data copied, missing features, semantic errors) dropped by 75%:
Press enter or click to view image in full size
The remaining silent bugs in the SpecCodex group were all UI-related, which leads to the next finding.
File paths matter more than descriptions
This was the most surprising finding, and the most actionable. The spec went through three versions. Versions 1 and 2 had invariants, effects, and hooks but described UI changes vaguely. Both produced 0% UI wiring success. Agents with perfect backends didn’t touch the right UI files.
The breakthrough came with v3, which added explicit file paths to touchpoints:
Press enter or click to view image in full size
Component names weren’t enough. Hook names weren’t enough. Only full paths worked. If you take one thing from this post for your own specs: when a feature touches multiple files, give the agent the exact path, not a description of where to look.
This also revealed a capability ceiling: with v3, Opus achieved 5/5 UI components wired correctly while Haiku produced a perfect backend but 0/5 UI. The spec format works for both models on the backend; UI wiring across multiple files requires a more capable model.
Lucky outcomes vs. reliable processes
The best agent without specs (Opus + ArchCodex, no spec) scored the same as the best agent with specs on production risk. But the unspecified agent’s success was emergent: it happened to explore the right files and make the right assumptions. Run it again and you might get a different result. The specified agent’s success was deterministic: the spec locked in every decision. Run it ten times and you get the same outcome. The difference between a lucky result and a reliable process.
The Arc of the Series
The pattern across the three-part series is clear:
- Part 1: LLMs write code that works but doesn’t fit. Architectural drift is invisible and compounds. ArchCodex makes it visible.
- Part 2: The research confirms this at scale. Structured guardrails (boundaries, constraints, canonical examples) reduce drift systematically.
- Part 3: What the code does matters as much as how it’s structured. A purpose-built specification schema, co-authored with the LLM and verified deterministically, eliminates assumption variance and makes every decision visible before code is written.
The table saw metaphor still holds. ArchCodex is the fence; it keeps the cut straight. SpecCodex is the blueprint; it defines where the cut goes. Without both, you’re measuring twice and still cutting wrong, because the LLM and you have different measurements in mind.
Try It
The practice is: structure your specs, make them parseable, verify deterministically. You can apply that with whatever tools fit your stack.
If you want an opinionated implementation that covers the full stack, SpecCodex is part of ArchCodex:
GitHub: github.com/ArchCodexOrg/archcodex
Start with one spec for your next feature. Write it with the LLM. See if the implementation matches. I think you’ll find it changes how you think about AI-assisted development: not as “generate and review” but as “specify and verify.”