GitHub - prostomarkeloff/find-dup-defs: Find duplicate / near-duplicate top-level definitions across a codebase via clone clustering (Python/TS/Rust)

13 min read Original article ↗

Find the copy-pasted code your linter can't — rank it by how worth refactoring it is — and surface the recurring shapes that should become one helper.

Rust 2021 License: MIT crates.io exact difflib

A duplicate-definition detector for Python, TypeScript, and Rust. It clusters duplicate & near-duplicate top-level definitions — functions, methods, classes, constants, type aliases, TS interfaces / Rust traits — by structural AST canonicalization, grades each cluster ERROR / WARNING / INFO by a normalized Thickness score, and suggests your project's noise filters for you. Opt into patternology and it also finds collapsible duplication: the recurring structure that should become one parameterized helper.

One engine, three single-parse frontends (Ruff · oxc · syn). 2–12× faster than PMD CPD / jscpd.


Why

GitClear's 2025 report (211M LOC): copy-pasted lines grew 8.3% → 12.3% of all changes 2021→2024, while refactored lines dropped from 25% to under 10% — for the first time, copy/paste exceeded reuse. AI assistants don't know your project's _helper.py; they emit the copy. find-dup-defs is the gate that catches it, and the calibration that tells you which copies are actually worth a PR.


Install

cargo install find-dup-defs
# …or a prebuilt binary from the Releases page.

Quickstart

# 1. Calibrate — histogram of refactor-worthiness + ready-to-paste noise filters
find-dup-defs ./src --calibrate

# 2. Gate CI on the actionable tail only
find-dup-defs ./src --error-thickness 0.5 -D @find-dup-defs.directives --errors-only

# 3. (opt-in) Surface helper-extraction candidates
find-dup-defs ./src --patternology

--calibrate is the intended first step: it never gates, it just reports.

$ find-dup-defs ./src --calibrate
=== thickness calibration (ERROR): 76 clusters analyzed ===
  T [0.2, 0.3)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 25
  T [0.3, 0.4)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27
  T [0.4, 0.5)  ▇▇▇▇▇▇▇▇▇ 8

suggested thresholds (p50/p75/p90):
  balanced   --error-thickness 0.34  →  21 ERROR remain  (median dup: 14 loc, 2 args)

=== inferred directives (auto-detected noise patterns) ===
  → -D 'de-escalate:*@*/{test,tests,__tests__}/*=test parametrize/fixture candidates'
    rationale: 21 clusters live entirely in test paths
    affects: 21 total (10 ERROR, 11 WARNING, 0 INFO)

What it detects

Two complementary things, from one parse per file:

1 · Duplicate definitions — the gate

Three passes find duplicated definitions and grade them by severity:

Pass Catches How
name-gated same-named copies same (kind, name) defs clustered by exact Ratcliff–Obershelp similarity on the alpha-renamed canonical
cross-name renamed copy-paste alpha-renamed canonical bucketed; ≥2 distinct names across ≥2 sites
Type-3 edited renamed copies IDF-weighted cosine over name-agnostic lines — catches what the exact pass misses

Each cluster is graded ERROR / WARNING / INFO by its Thickness (see below) and is a directive target you can suppress / de-escalate / annotate.

2 · Patternology — collapsible duplication (opt-in, --patternology · Python · TypeScript · Rust)

The passes above find duplicate functions. Patternology finds the recurring structure that should become one helper — and surfaces a motif only if it actually collapses into a clean, reflection-free helper (see Patternology for the mechanism). Advisory only — never an ERROR gate; it's a refactor map, not a CI failure.

$ find-dup-defs ./crates --only rs --patternology     # the tool on its own code
--- helper candidates in functions (patternology — collapsible duplication) ---
DUPLICATE FUNCTION [WARNING]: analyze_impl_fn/analyze_item_fn  [ast sim 1.00, n=2, loc=3, args=1]
  # helper: fn _fn(_v0: &?) -> AnalyzedFn { analyze(&_v0.sig.ident.to_string(), &_v0.sig, &_v0.block) }
  #         (1 param); collapses 2 sites, ~3 loc saved

The proposed body, parameter count, and LOC saved ride along on the finding (the real output is one line; wrapped here). A Python run renders def …: pseudo-source the same way.


Languages

Frontend Parser Duplicate passes Patternology
Python (.py) Ruff (PEP 695 / 701)
TypeScript (.ts .tsx .mts .cts) oxc (TS 5.x / JSX / decorators)
Rust (.rs) syn (full item grammar)

--only py,ts,rs scopes a run to specific frontends. Each is a single parse per file; method receivers (self / cls / &self) are stripped so a method matches an equivalent free function.

Built-in noise filtering at extraction (these never form phantom clusters):

  • Python / TS@overload / @abstractmethod / Protocol stubs, raise NotImplementedError, return False / None / 0 dispatch overrides, @property + setter/deleter.
  • Rust — one-line write!/writeln! Display/Debug impls, matches! predicates, todo!/panic! stubs; #[cfg(...)]-gated same-name siblings collapse to one logical item.

Severity & Thickness

ERROR  ←→  WARNING  ←→  INFO
  gate       review     hidden by default (JSON-only; --show-info to display)

Three knobs move clusters between tiers:

  • --error-thickness X — ERROR → WARNING when T < X
  • --warning-thickness X — WARNING → INFO when T < X
  • --escalate-thickness X — anything with T ≥ X is forced to ERROR

Thickness is a normalized [0, 1] "get-me-refactored" score — the single number you sort by:

T = 0.7 · sat(volume, 30) + 0.1 · sat(args, 5) + 0.2 · sim       sat(x, k) = 1 − exp(−x/k)
volume = (n_members − 1) · loc        # lines you'd actually delete (dominant signal)

Wide signatures read as architecturally chunkier; higher-similarity dups score higher. Sort by T → biggest refactor wins first.


Calibration & directives

find-dup-defs is meant to tune itself, then be gated by an explicit, committed config — never by hidden heuristics.

--calibrate

Prints a thickness histogram, three percentile-anchored threshold suggestions (permissive / balanced / strict, each with a concrete code sample at the cut), and inferred directives — ready-to-paste -D strings for the noise patterns it found in your tree:

Detected pattern Suggested directive
≥3 clusters entirely in test paths de-escalate:*@*/{test,tests,__tests__,fixtures,integration,e2e}/*
≥3 in .test.* / .spec.* files de-escalate:*@*.{test,spec}.*
≥5 in i18n / locale / translation dirs suppress:*@*/{locale,locales,i18n,translations}/*
≥3 touching *_pb2* / *_grpc* suppress:*@*_pb2*
≥3 under */migrations/* suppress:*@*migrations/*
≥5 in .d.ts / *.stories.* suppress:*@*.d.ts · de-escalate:*@*.stories.*
vendored snapshot roots (/util/vs/, /vendor/, …) suppress:*@*<prefix>* (auto-derived, marker-gated)
(kind,name) group > 256 members (entry-point names) settings:max-name-group=256
patternology candidates present settings:pattern-min-thickness=<p75> — drops the thin 2-site tail

Globs support {a,b,c} alternation, so one paste covers a whole convention family. The vendored detector is marker-gated — same-name files across dirs without a recognized vendored marker are treated as real cross-layer duplication, not auto-suppressed.

Directive language (directiva)

ACTION : [<KIND>] NAME [@PATH] [=NOTE]
Action Effect
suppress drop the finding entirely
de-escalate / escalate one tier down / up
note annotate, no severity change
set pipeline config (set:max-name-group=256, set:gpu=on, set:pattern-min-thickness=0.5)
# Intentional, per-repo:
-D 'de-escalate:<methods>Plugin.get_*_hook=intentional plugin no-op API'
-D 'suppress:<functions>spawn@*lib-rt/*=bootstrap copy, cannot import'
-D 'escalate:<methods>Lock.*@*/storage/*=must share impl before v1.0'

# Keep them in a committed file and point CI at it (one per line; `#` comments; `@-` reads stdin):
-D @find-dup-defs.directives

Nothing is filtered until you paste a directive — calibration suggests, you decide.


Patternology

Opt-in via --patternology. Advisory: WARNING for a tight family, INFO otherwise — never ERROR.

A duplicate-function pass asks "are these two functions the same?". Patternology asks "does this recurring shape collapse into one parameterized helper?" — and surfaces a motif only when the answer is yes. It folds the instances by Plotkin anti-unification (least general generalization, made robust to arity divergence) into a template with holes ? at the variation points, then keeps it only if the holes are bindable expression parameters — not leaky statement divergences and not name-identity selectors (a varying method/kwarg name would need reflection, so obj.?() is rejected, not surfaced as a "helper"). Pure-structure coincidences (? = ?; ? = ?) are dropped by a shared-anchor floor.

Two granularities:

  • whole-function — families that share a shape, clustered by structural tf·idf cosine with a greedy maximal-clique cover (no single-linkage blob), collapsed into one helper.
  • sub-block — a recurring statement-window idiom embedded inside otherwise-different functions, found by support (how many functions contain it), not pairwise similarity — the case whole-function cosine structurally cannot reach. E.g. a fetch-one idiom shared by seven unrelated repository methods → ? = await _v0.execute(?); return ?.scalar_one_or_none() (3 params).

Language-agnostic engine. The mechanism lives behind a Dialect trait; adding a language is a trait impl (slot classification + a pseudo-source renderer), the engine core is untouched. Ships with PyDialect (CPython ast.dump), RustDialect (rs-canon) and TsDialect (ts-canon). A run partitions defs by language and folds each group with its own dialect — Python, TypeScript and Rust functions never anti-unify against each other.

Each finding carries the proposed helper body (rendered as readable pseudo-source), its parameter count, an estimated LOC saved, and a stable signature key (holes ?, atoms verbatim): the same idiom in different files/packages yields the same key, so an external loop (for pkg in …: find-dup-defs --patternology --json pkg) + a glue script grouping on the signature gives ecosystem-wide codometry.

Knobs: --pattern-theta (whole-fn cosine floor, default 0.85), --pattern-support (sub-block support floor, default 3), and -D settings:pattern-min-thickness=<F> (drop candidates below a thickness floor — calibrated by --calibrate).


Performance

hyperfine --warmup 1 --runs 3, macOS arm64, vs jscpd@4 and PMD CPD 7.24 (both Python-mode, same tree):

Repo (Python files) find-dup-defs PMD CPD jscpd
pip (633) 0.18 s 0.87 s (4.9×) 3.21 s (18.2×)
mypy (155) 0.18 s 0.81 s (4.6×) 1.47 s (8.4×)
sympy (1 589) 1.22 s 4.29 s (3.5×) 15.18 s (12.4×)
django (2 910) 1.01 s 2.08 s (2.1×) 9.67 s (9.6×)

It does more semantic work (alpha-renamed canonical, IDF cosine, severity grading, calibration) and is still 3–12× faster — Rust + rayon, single-parse frontends, no JVM/Node tax. Throughput on django (426K SLOC): ~422K SLOC/s vs PMD ~205K, jscpd ~44K.

GPU acceleration (optional, macOS / Metal)

difflib-fast can offload the name-gated Ratcliff–Obershelp clustering to the Apple-Silicon GPU via its stateful Rationer handle — wired in but off by default and gated twice: build with --features gpu, enable at runtime with -D 'settings:gpu=on' (on / gpu+cpu / gpu / off). Only large all-ASCII same-name groups (≥ ~300 members) route to Metal; everything else stays on CPU, and output is byte-for-byte identical in every mode.

Honest read — it rarely helps end-to-end. The GPU only accelerates clustering of a single large group (difflib-fast's own bench: 1.1–1.4× there), while the tool's real shape is many mostly-small groups (the 0.6–0.99× case). On rustc/tests/ui (20 425 files, with functions:main ×12 678): gpu=off 33.97 s vs gpu=on 33.62 s — a tie. Keep CPU for everyday runs.


Benchmarks — real repos

10 production TypeScript repos (vscode, the TS compiler, vue, angular, svelte, nest, astro, prisma, next.js, excalidraw; ≈6M SLOC). --calibrate + auto-inferred directives + balanced thickness cut raw ERROR count by 94% on average:

Repo LOC Raw ERROR After %cut Top remaining cluster
microsoft/vscode 3.1M 5428 174 97% registerCLIChatCommands 771 LOC
microsoft/TypeScript 265k 1840 9 100% NavigationBarItem interface
vercel/next.js 756k 489 26 95% defaultLoader 115 LOC
angular/angular 1.0M 627 54 91% conditionalCreate/conditionalBranchCreate
prisma/prisma 222k 322 68 79% fieldToColumnType 95 LOC × 3 adapters

28 large Python repos (≈8M SLOC). Auto-applied directives cut raw ERROR by 67% on average:

Repo Raw ERROR After %cut Top remaining cluster
home-assistant/core 4475 850 81% ConfigFlow.async_step_* (n=178)
apache/airflow 2203 337 84% CloudComposerGetEnvironmentOperator (n=18)
django/django 559 71 87% TupleGreaterThan.get_fallback_sql (n=4)
scipy/scipy 492 140 71% dct/dst/idct/idst (n=4)
pandas-dev/pandas 406 78 80% read_csv/read_table (n=2)

Concrete wins this surfaced: pip Version __lt__…__gt__ ×6 → one _compare helper (−130 lines); scipy dct/dst/idct/idst ×4 → a factory (−330 lines); scikit-learn BaseSGD{Classifier,Regressor}._fit — a textbook sibling-estimator dupe.

The top remaining clusters are PR candidates a human reviewer would also flag, with the noise (vendored snapshots, test fixtures, .d.ts, Storybook) automatically removed.


AI-agent integration

# 1. Calibrate → JSON
find-dup-defs ./repo --calibrate --json > calib.json

# 2. Full scan with the agent's chosen tuning + inferred directives
find-dup-defs ./repo \
  --error-thickness <calib> \
  $(jq -r '.inferred_directives[].directive | "-D \"" + . + "\""' calib.json) \
  --errors-only --json > findings.json

# 3. Each finding ships everything to write a PR — no FS roundtrips:
#    groups[].snippet (full source of one member) · members[] (every file:line)
#    · thickness (priority) · notes[] (directive annotations)

Architecture

Six crates, layered so the engine never depends on a frontend and the contract crate stays pure:

              dup-defs-core            ← the contract: Def / KindSpec / Analysis / CanonDialect /
                  ▲                       the Frontend trait / LineMap.  No deps.
        ┌─────────┴─────────┐
   find-dup-defs-canon         find-dup-defs   ← find-dup-defs-canon: shared frontend helpers (alpha-rename, the
        ▲                  (engine+CLI)   KindSpec vocabulary, count_loc, AnalyzedFn).
   ┌────┼────┐               │           find-dup-defs: the 3 passes + patternology + severity +
 py-   rs-   ts-canon ───────┘           directives + calibration + reports.
 canon canon            (engine depends on the contract + each frontend, NOT on find-dup-defs-canon)
crate role
find-dup-defs engine + CLI; frontend-agnostic, clusters a Vec<Def> and never names a language
dup-defs-core the engine↔frontend contract (Def / KindSpec / Analysis / Frontend / LineMap)
find-dup-defs-canon shared frontend helpers between the contract and the frontends — no duplication across the three
py-canon · ts-canon · rs-canon the Python / TypeScript / Rust frontends (Ruff · oxc · syn)

Adding a language = one more <lang>-canon frontend implementing Frontend (and, for patternology, a Dialect impl) — no engine changes. The similarity engine is the exact Ratcliff–Obershelp + simjoin port difflib-fast. Dogfooded on its own source to 0 ERROR (find-dup-defs crates -D @find-dup-defs.directives).


CLI reference

USAGE:  find-dup-defs [OPTIONS] <PATHS>...

LANGUAGES
  --only <CODES>            Restrict to frontends (py,ts,rs). Default: all found in PATHS.

SEVERITY (thickness ladder)
  --error-thickness <F>     Demote ERROR → WARNING if T < F   (default 0.0 = off)
  --warning-thickness <F>   Demote WARNING → INFO  if T < F   (default 0.0 = off)
  --escalate-thickness <F>  Promote anything → ERROR if T ≥ F (default 0.0 = off)

SIMILARITY
  -t, --threshold <F>       Name-gated cluster floor (default 0.5)
  -e, --error-threshold <F> Name-gated ERROR floor   (default 0.85)
  --type3-theta <F>         Type-3 cosine floor       (default 0.7)

PATTERNOLOGY (opt-in · Python + TypeScript + Rust · advisory, never ERROR)
  --patternology            Surface collapsible-duplication helper candidates
  --pattern-theta <F>       Whole-fn structural cosine floor (default 0.85)
  --pattern-support <N>     Sub-block idiom support floor     (default 3)
                            (drop the thin tail with -D settings:pattern-min-thickness=<F>)

FILTERS
  -D, --directive <S>       ACTION:[<KIND>]NAME[@PATH][=NOTE], repeatable. ACTION ∈
                            suppress / de-escalate / escalate / note / set:KEY=VALUE.
                            `@PATH` reads a directive file (`#` comments; `@-` = stdin).
                            Globs: * ? {a,b} [a-z] (+ \ escapes).
  --kinds <K,…>             functions,methods,classes,interfaces,constants,type-aliases
  --min-size <N>            Only clusters with ≥ N members (default 2)
  --max-name-group <N>      Skip name-gated clustering for (kind,name) groups > N
  --errors-only             Filter output to ERROR
  --show-info               Include INFO in the human report

MODES
  --calibrate               Histogram + threshold suggestions + inferred directives
  --json                    Machine-readable output
  --no-cross-name / --no-type3   Skip pass 2 / pass 3

Limitations

  • Python / TypeScript / Rust today; patternology covers all three. New languages are a <lang>-canon sibling — PRs welcome.
  • Rust patternology is initial: rs-canon splices statement bodies as node children rather than lists, so long-body alignment is prefix-only, and macro internals are opaque.
  • TypeScript patternology surfaces top-level function declarations and arrow / function-expression consts; class methods don't participate (their slice doesn't re-parse as a standalone function, so they carry no patternology canonical — they're still covered by the duplicate passes).
  • Type-4 (semantic equivalence, different syntax → same logic) — out of scope.
  • Token-level sub-expression duplication — out of scope; pair with jscpd / PMD CPD if you need it.
  • Calibration is heuristic — the thickness formula constants were tuned on the benchmark corpora above; your codebase may want different.