Your agent doesn’t remember your codebase (dupehound)

Any system that relies on a human to remember what the machine wrote will be limited by the human’s memory, not the machine’s speed.

There are two obvious ways to deal with the volume of code that agents produce. Both are bad.

The first is to read everything. Every diff, every helper, every test. This is responsible, virtuous, and impossible. A reviewer reads about 500 lines per hour. An agent writes 1,500 lines in ten minutes. You can do the math, and the math does not care about your discipline.

The second is to trust the agent and merge. This is fast, exciting, and quietly catastrophic, because the problems that agents introduce are not the kind that explode on Friday. They are the kind that rot.

In this post I want to talk about the most common form of that rot: your agent has no memory of your codebase, and you have been acting as that memory yourself.

The forgetting problem

An agent cannot hold your repository in its context window. It sees the files it was pointed at, plus whatever it found while searching. Everything else does not exist.

So when it needs to format a date, it writes formatDate. Three weeks later, in another corner of the repo, it needs to format a date again. It does not remember the first one. It writes renderTimestamp. A month later, stringifyDate.

Each copy compiles, passes its tests, and ships. Each copy is now aging independently, and the rounding bug you will fix next quarter will be fixed in one of them.

Press enter or click to view image in full size

The agent writes, the codebase grows, and the only deduplication mechanism in the loop is your memory of what the repo contains.

This is not a hypothetical. Analyses of millions of commits report that code duplication roughly doubled since AI assistants went mainstream, while refactoring collapsed. The agent is not being lazy. It is being exactly as good as a brilliant contractor who starts every single day with amnesia.

Notice what the failure mode does to the humans in the loop. When a PR adds calculateOrderAmount, the only way to know that computeInvoiceTotal already exists is to remember it exists. The reviewer becomes a lookup table. You are not reviewing design anymore; you are doing recall against a 300k-line corpus, which is a job description for a machine.

My maxim for this one: any system that relies on a human to remember what the machine wrote will be limited by the human’s memory, not by the machine’s speed.

What a memory for code actually needs

The naive fix is to ask another LLM to watch for duplicates. I tried variations of this and they all fail in the same three ways.

First, exhaustiveness. Finding duplicates means comparing every function against every other function. In a large repo that is effectively billions of comparisons. A model samples whatever fits in context and gives you confident answers about the part it read. An index checks everything, every time.

Second, determinism. If the mechanism is going to block merges, the verdict has to be reproducible: same input, same answer, an algorithm someone can read when they disagree with it. You cannot reject a teammate’s PR on a model’s vibe, and you definitely cannot let the verdict change between reruns.

Third, cost. This check has to run on every commit, which means it has to be free and take seconds. An LLM pass over the whole repo, per commit, is neither.

So the memory has to be an index, not a model. But a naive index of source text is useless here, because the agent does not produce textual copies. It produces the same logic with different names. formatDate and renderTimestamp share almost no tokens and almost all structure.

The trick is to fingerprint structure instead of text, and it turns out academia solved this in 2003, for a different adversary: students renaming variables before submitting copied homework. Stanford’s MOSS plagiarism detector is built on an algorithm called winnowing (Schleimer, Wilkerson & Aiken, SIGMOD 2003), and it transfers to our problem almost untouched.

An agent renaming identifiers is just a very fast student.

Building the memory

I packaged this as dupehound, a single-binary CLI in Rust.

The pipeline has four stages, and each one earns its place.

Press enter or click to view image in full size

The pipeline: discover files, fingerprint every function, match through an inverted index, report.

1. Compare function bodies, not files. Every file goes through tree-sitter, and the unit of comparison is the function body. Imports, signatures, and license headers can never participate in a match, which kills the most embarrassing class of false positives before it exists.

2. Normalize away what renames change. Identifiers become one sentinel token, string literals another, numbers a third. Comments are dropped. Keywords, operators, and control flow stay. After this pass, a fully renamed copy produces a byte-identical token stream, and two functions that differ in actual logic do not.

3. Fingerprint with a guarantee. Sliding 10-token windows are hashed and selected by robust winnowing. The property that matters: any shared run of 17 or more normalized tokens is mathematically guaranteed to produce at least one shared fingerprint, and nothing shorter than 10 tokens can ever match. This is not a heuristic that usually works. It is a bound you can write a property test against, and the test suite does.

4. Match through an inverted index. Candidate pairs come from shared fingerprints, so there is no all-pairs pass. Fingerprints that appear in too many functions get culled, because at that frequency they are language idiom, not duplication. Think Go’s if err != nil ladders. Similarity is exact Jaccard, union-find groups the pairs, and the whole thing is parallel from disk to report.

On my laptop, this scans vscode, which is about three million lines and 53,000 functions, in 3.6 seconds. That number matters, because a memory you hesitate to consult is a memory you will stop consulting.

What the memory tells you

The first command is an audit:

$ dupehound scan .

Press enter or click to view image in full size

The slop score has a one-sentence definition: the percentage of code you could delete if every duplicate cluster kept only one copy.

The largest copy in each cluster is exempt, because the original is not the problem. Test files are excluded by default, because table-driven tests are repetitive by design and a tool that flags them trains you to ignore it.

For calibration, I ran it against codebases that humans curate carefully:

express scores 0.0%,
gin 0.2%,
tokio 1.1%,
fastapi 1.7%,
vscode 2.8%.

All grade A. Healthy repos really are this clean, which is exactly what makes the unhealthy ones visible.

The second command is the receipts.

$ dupehound history

It replays your git history, one snapshot per month, reading blobs straight from the object database without a single checkout, and charts the score over time:

Press enter or click to view image in full size

If that curve bends in the month your team adopted agents, you now have something better than a feeling. To be precise about what this measures: dupehound has no idea who or what wrote any line, and it never claims to. It measures duplication. The timestamps do the rest of the talking.

Closing the loop

Audits and charts are diagnosis. The mechanism that changes behavior is the gate.

dupehound check indexes the codebase as it existed at your base revision, then probes only the functions your change adds or touches. If a new function duplicates something that already exists, the build fails with the receipt:

$ dupehound check --diff main . 
src/api/orders.ts:1 calculateOrderAmount() is a 100% duplicate of 
src/billing/invoice.ts:1 computeInvoiceTotal()

Two suppressions keep this from crying wolf, and they are the difference between a gate people keep and a gate people delete. A function that was moved to another file does not fire. A function that was edited in place does not fire against its own previous version. Both are recognized structurally, not by guessing.

And here is the part I find genuinely satisfying. The output is one line, machine-readable, and it names the original. Which means you can hand it back to the thing that caused the problem. I have this in my CLAUDE.md:

Before committing, run

`dupehound check .`.

If it reports that a function you wrote duplicates existing code,
delete your version and reuse the original at the reported location.

he agent writes, the index remembers, the finding goes back to the agent. No human in the loop until there is something a human should actually decide.

With that one line, the agent stops being the cause and starts being the consumer. It writes calculateOrderAmount, the gate tells it computeInvoiceTotal exists, it deletes its copy and imports the original. The duplication stops at the door, and nobody had to remember anything.

How to try it

dupehound is open source (MIT), a single binary, and runs fully offline.

No API keys, no telemetry, no model anywhere in the pipeline.

cargo install dupehound

or grab a prebuilt binary for macOS, Linux, or Windows from the releases page: https://github.com/Rafaelpta/dupehound

Run dupehound scan . in the biggest repo you have access to.

It will take seconds, and the top cluster will probably annoy you.

Then run dupehound history .

and look at where the curve bends.

What’s next

Function-level granularity is the current limit: a duplicated 30-line block buried inside two otherwise different functions is not caught yet. Containment matching (a small function swallowed by a bigger one) and more languages are next; each language is roughly one tree-sitter query file, which makes it a satisfying contribution if you want one.

The bigger thought is that this pattern generalizes. We spent decades building machinery that refuses bad work before a human sees it: types, tests, linters, CI. Agents made the producer faster, so every missing piece of that machinery now shows up as a human doing a machine’s job. Duplication was the piece where the human was acting as the memory.

Any system that relies on a human to remember what the machine wrote will be limited by the human’s memory, not the machine’s speed.