GitHub - 10fra/data-review: Change-time data review for data-engineering and financial-controlling work. Re-runs affected pipeline lanes, diffs the numbers vs a blessed baseline, and judges whether movement matches declared intent. Catches the data bugs code review and unit tests miss.

Change-time review for data-engineering and financial-controlling work.

An agent skill that reviews the data impact of a code change: it re-runs the affected pipeline lanes and checks whether the numbers move the way the author said they would.

Code review checks the code; tests check the assertions you remembered to write. Neither runs a query, so locale mis-parses, sign flips, silent NULLs, join fan-out, and double counting all pass green. Scripts compute, agents judge: no agent does arithmetic a query can do, no script decides whether a number should have moved.

you             →  a PR touching parsers / transforms / models / deliverables
blast_radius    →  which lanes the diff hits                    (deterministic)
T1  tieouts     →  is the world in a known-good state?          (deterministic)
T2  rerun+diff  →  scratch rebuild vs committed baseline        (deterministic)
T3  row diff    →  row-level EXCEPT ALL + top-k moved groups    (deterministic)
judgment layer  →  declared / explained / UNEXPLAINED           (agents)
report          →  findings ranked by € impact + coverage table

Try it

A real before/after data bug caught in under a minute, no agent or account:

git clone https://github.com/10fra/data-review.git
cd data-review && ./demo.sh

It builds a toy two-stage pipeline, blesses a baseline, then drops a cents-to-dollars conversion in staging. A unit test and the orders-vs-staging tie-out both stay green (everything scaled in lockstep); the baseline diff catches sum(amount) jumping $134.50 → $13,450 and escalates it.

Setup

git clone https://github.com/10fra/data-review.git
cd data-review && ./install.sh

install.sh builds the venv, links the skill into ~/.claude/skills/, and runs the tests; re-run after a git pull. A consumer project carries a data-review.yml and committed baselines under .data-review/baselines/.

Usage

In a project with a data-review.yml, ask Claude Code:

review the data impact of this branch

Auto-discovered from ~/.claude/skills/, it runs the evidence scripts, then reviews the diff through ten failure-class lenses (each seeded from a real production bug), checks every delta against the PR's ## Expected data impact, and verifies each finding adversarially before it reaches the report.

Deterministic, so runnable by hand or in CI. The data-review launcher uses the repo's venv from any consumer directory:

data-review blast --manifest data-review.yml --diff main...HEAD   # diff × manifest → tier plan
data-review bless --manifest data-review.yml --lane <name>        # snapshot live state → committed baseline
data-review t1    --manifest data-review.yml                      # tieouts + live-vs-baseline conformance
data-review t2    --manifest data-review.yml --lane <name>        # scratch re-run, metric diff vs baseline
data-review t3    --manifest data-review.yml --lane <name> --scratch <db>  # row-level EXCEPT ALL
data-review xlsx  --manifest data-review.yml --workbook <path>    # Excel tie-outs, hardcodes, short SUMs

(alias dr=/path/to/data-review/data-review to shorten.)

The manifest

One data-review.yml per project, declaring what the skill cannot infer:

engine:                 # where the numbers live (DuckDB; adapters convert anything else)
lanes:                  # unit of blast radius: code globs + scratch-safe run command + output tables
baselines:              # metrics + group-by dims snapshotted into committed JSON
tieouts:                # cross-artifact invariants, verified on live data before being committed
deliverables:           # Excel workbooks audited against SQL
context:                # prose priors for the judgment agents (locale, sign conventions, known traps)

A lane run command must take a scratch-target override (scratch_env) or the skill refuses it, so a re-run never touches the canonical store. It must build what it consumes, or it re-runs stale code; non-DuckDB outputs convert there too.

How it works

Baselines are committed JSON. Re-blessing is a reviewed git change, so baseline drift shows up in diffs.
Tiers escalate mechanically. Any beyond-tolerance T2 delta makes the lane a T3 candidate; money_critical lanes always get T3. Whether the movement is explained is the agents' call.
Findings are falsifiable. Each carries a one-sentence claim, € impact from evidence, a reproduce command, and its verification votes.
Coverage is reported, not implied. The report ends with a lane × tier table; an empty report is not a clean review.

Proving ground

Tested on four stacks (one private, three public), replaying real historical bugs from their git history.

Project	Stack	Bug replayed / injected	Caught	Signal
consumer #1 (private)	TS parsers + DuckDB	supplier A amount explosion (real)	T2	sum €63.5M → €3.91B, 74 group deltas, zero leakage
consumer #1 (private)	TS parsers + DuckDB	supplier B US-locale dates (real)	T2	null_rate(accounting_period) +16pp, 96 group deltas
jaffle_shop_duckdb	dbt + DuckDB	fan-out / filter drift / unit-scale (injected)	T2	all three; onboarding took 3 minutes
owid/etl	Python ETL + feather	drug overdoses double-counted (`ef7a53c9b`, real)	T2	exactly 2 facts: +41,552 deaths, group `accidents`
pudl	dagster + parquet	FERC values silently dropped (`09d8efa7d`, real)	T2	14 facts, all in detailed tables, zero in core; +$54.5B distortion

Every clean-control rerun produced zero beyond-tolerance facts, deterministic to the cent, and one caught real drift on its own (a canonical store stale against merged parser fixes). The recurring lesson: tie-outs and pipeline tests are blind to scaling, double-counting, and drop distortions. Shares renormalize, identities scale in lockstep, sets dedupe. Only the committed-baseline diff is anchored outside the change.

Hard rules

Scripts emit facts, never findings. Severity is the judgment layer's job.
Exit 2 (evidence-unavailable) is never a pass. Missing input is a finding, not a skip.
Silence never looks green: unmatched files, skipped formulas, dropped metrics, and unavailable tiers are all reported facts.
A stale scratch database is never diffed; pre-existing scratch files are deleted before every re-run.
A tieout is only committed after it has been verified to hold on live data.
€ impact is computed from evidence by script, never estimated by an agent.

Onboarding

Write data-review.yml (schema: skills/data-review/manifest.schema.json).
data-review bless --manifest data-review.yml --lane <name>, commit the baselines.
data-review t1 --manifest data-review.yml; a clean run means ready.

Cost: 3 minutes (dbt + DuckDB) to half a day (dagster monorepo).