In this post, I will walk through why I built a maintainability ratchet for AI-assisted Python, what it actually measures, and how it fits into the review loop.
The short version is this: tests passing is necessary, but it is not the same as keeping the codebase easy to change. AI agents make that gap more visible because they can add correct-looking code quickly. A function can keep passing its tests while quietly gaining branches, public surface, file sprawl, and maintenance history that make the next change harder.
riskratchet is my attempt to make that drift show up as a diff.
The Moment That Started This
I asked an agent to add one branch to a function I owned. The function already had a handful of branches. The agent added the new behavior, kept the signature, kept the existing tests passing, and added one happy-path test for the new case.
The PR looked fine. CI was green. Coverage went up.
A week later I had to change the same function again. I opened it and realized the easy-to-review diff had left behind a function that was now much harder to reason about. The tests were not wrong. The agent was not obviously wrong. The normal signals just were not measuring the thing I cared about.
The thing I wanted to catch was not:
Is this function bad?
It was:
Did this change make a function riskier than the version we had already accepted?
That distinction is the whole product.
What I Wanted To Catch
I wanted a check that could fail a PR when a single function moved into a worse maintainability state:
- Cyclomatic complexity went up but tests did not follow.
- Line coverage stayed high while branch coverage dropped.
- A public function lost coverage.
- A function crossed a length or file-sprawl threshold.
- A hot file accumulated more complexity.
- A new function landed already above the team's risk threshold.
Those are all measurable from data a Python CI job can already produce: source files, coverage JSON, and optionally git history.
The goal was not to build another static quality dashboard. The goal was to build a ratchet:
- Measure the current state.
- Save it as a baseline.
- Fail only when future changes move risk up past a tolerance.
That makes adoption much easier. A mature codebase does not have to become clean in one sweep. It just has to stop getting worse silently.
Why Coverage Alone Is Not Enough
Coverage is useful, but it is very easy to overread.
A line can execute without the test asserting the behavior that matters. A happy-path test can touch every line in a function while leaving half the branch exits untested. A public API can be "covered" only through incidental calls from another test. A file can have respectable project-level coverage while one risky function has almost none.
This is one of the reasons I care about function-level output. Project coverage answers a broad question:
Did the test suite execute this much of the repository?
The review question is narrower:
Did the function changed by this PR become harder to change safely?
Those are not the same question.
Why CRAP Alone Was Not Enough
The CRAP score is still useful:
CC^2 * (1 - line_coverage)^3 + CC
It catches the classic bad shape: complex code with weak line coverage. riskratchet keeps CRAP in the output because
it is a good familiar ranking signal.
But CRAP does not see everything I wanted this tool to care about:
| Shape | What CRAP sees | What I still care about |
|---|---|---|
| 100% line coverage, 50% branch coverage | Looks mostly fine | Half the exits were never tested |
| A 2-line public function with no tests | Low score | Public contract has no direct coverage |
| A function in a 950-line module | Only the function's CC and line coverage | File sprawl makes every change more expensive |
| A file touched 6 times in the churn window | Nothing | Hot code is where small changes accumulate |
| A baseline score moved from 10 to 41 | Only the new absolute score | The regression is the useful signal |
So the score in riskratchet is a weighted blend of six normalized components:
| Component | Default weight | What it measures |
|---|---|---|
coverage_gap |
30% | Missing line coverage inside the function span |
structural_complexity |
25% | Cyclomatic complexity, saturating at high values |
branch_gap |
15% | Missing branch coverage when branch data exists |
churn |
10% | Recent commits touching the file, default 90-day window |
public_surface |
10% | Missing coverage on functions treated as public API |
sprawl |
10% | Function length and surrounding file length |
Weights are configurable in [tool.riskratchet.weights], but they are validated and renormalized. A typo or negative
weight should not silently weaken a CI gate.
A Real Fixture: The Agent Spaghetti Case
The repo has a fixture named tests/fixtures/agent_generated_spaghetti. It is the canonical shape I wanted the tool to
catch: a public process_payment function that is long, branchy, and lightly tested.
The function is 44 lines long. It handles IDs, amounts, currencies, captured status, metadata, retry counts, strict mode, exception paths, and default behavior in one place.
Running the current tool from that fixture directory:
uv run riskratchet explain src/m.py::process_payment \
--coverage coverage.json \
--no-git
prints:
src/m.py::process_payment
severity : high
score : 63.2
crap : 156.3
complexity : CC=21
coverage : line=33%, branch=25%
churn : 0 commits in window
public : True
lines : 8-51 (function 44 lines, file 51)
components :
coverage_gap 67.4
structural_complexity 100.0
branch_gap 75.0
churn 0.0
public_surface 67.4
sprawl 0.0
CRAP catches this one too, which is good. This is the easy case: high complexity and weak coverage.
The useful part is the component breakdown. It tells the agent what to do next. This is not a vague "quality score." The repair options are concrete: split the function, add missing branch tests, or both.
The Case CRAP Mostly Misses
The fixture tests/fixtures/covered_but_branchy is smaller:
def normalize(record: dict) -> dict:
out = {}
if "id" in record:
out["id"] = str(record["id"])
if "amount" in record:
out["amount"] = float(record["amount"])
if "currency" in record:
out["currency"] = record["currency"].upper()
if "captured" in record:
out["captured"] = bool(record["captured"])
return out
The fixture coverage touches every line, but only half the branch exits. The current output is:
src/m.py::normalize
severity : low
score : 12.5
crap : 5.0
complexity : CC=5
coverage : line=100%, branch=50%
churn : 0 commits in window
public : True
lines : 7-17 (function 11 lines, file 17)
components :
coverage_gap 0.0
structural_complexity 20.0
branch_gap 50.0
churn 0.0
public_surface 0.0
sprawl 0.0
This should not fail a build by itself. The score is low. But it should be visible when this function changes. If an agent adds four more optional fields and the branch coverage falls again, the baseline check will catch the delta.
That is the behavior I wanted: do not panic about every imperfect function, but do not let the same function get worse without someone noticing.
Public Surface Matters
Python does not have a perfect public/private boundary, but naming conventions and __all__ are still useful signals.
The tests/fixtures/all_exports_focused fixture has three functions:
_legacy_exposed, which looks private but appears in__all__.naturally_public, which is public by name._truly_private, which stays private.
With zero coverage, the current scan reports:
| medium | 42.5 | 12.0 | 3 | 0% | n/a | `src/m.py::_legacy_exposed` | 14-19 |
| medium | 40.0 | 2.0 | 1 | 0% | n/a | `src/m.py::naturally_public` | 22-23 |
| medium | 30.0 | 2.0 | 1 | 0% | n/a | `src/m.py::_truly_private` | 26-27 |
The important detail is that _legacy_exposed is treated as public because __all__ promotes it. Its
public_surface component is 100.0; _truly_private does not get that extra public-surface penalty.
This is still a heuristic. Framework callbacks, plugin entry points, and internal functions used across services can blur the boundary. But a heuristic is enough to make review sharper. A public function with no coverage deserves a different conversation than a private helper with the same raw line coverage.
The Baseline Is The Product
Scanning is useful for discovery. The baseline is what changes behavior.
riskratchet baseline writes a stable JSON file keyed by path and qualified function name. Each entry stores the score,
component scores, and a fingerprint so moved or unchanged functions can be reasoned about later.
riskratchet check compares the current run against that baseline. In the current repo, it can fail on:
- A new function above
fail_new_above(default50). - An existing function whose total score grew by more than
fail_regression_above(default5). - An existing function whose individual component grew by more than
fail_component_regression_above(default15) when the component gate is enabled. - An existing function still above an optional absolute ceiling, if a team configures one.
The strict comparison matters: a tolerance of 5 means +5.0 is still inside the budget; more than 5.0 fails.
The fixture tests/fixtures/public_api_regression pins the basic regression behavior. Running:
uv run riskratchet check src \
--coverage coverage.json \
--baseline baseline.json \
--format markdown \
--no-git
exits 1 and prints:
# riskratchet regressions
| Kind | Function | Before | After | Delta | Reason |
| --- | --- | ---: | ---: | ---: | --- |
| regressed | `src/m.py::public_api` | 10.0 | 41.0 | +31.0 | risk grew by +31.0 (from 10.0 to 41.0); tolerance is +5.0 |
That is the PR gate I wanted. Not "this function is morally bad." Just: this function was accepted at 10.0, this
change moved it to 41.0, and the team's budget was 5.0.
How The Current Implementation Works
The current package is 0.2.2. The command surface is:
riskratchet scan # scan files and report current risk
riskratchet baseline # write the accepted current state
riskratchet check # fail when risk regresses
riskratchet explain # print the full breakdown for one function
riskratchet diff # show the full baseline diff without failing
riskratchet config # validate or show resolved configuration
The analysis path is intentionally boring:
- Walk the configured Python paths, honoring include and exclude globs.
- Parse each file with
ast. - Discover functions, methods, nested functions, async functions, decorators, and public-surface metadata.
- Compute cyclomatic complexity.
- Load
coverage.jsonand intersect executed and missing lines and branches with each function span. - Count git churn with one recent-history pass, unless
--no-gitis set. - Compute CRAP and the six risk components.
- Emit a report, baseline, diff, or regression list.
If coverage is missing, the behavior is explicit. The config can choose pessimistic, optimistic, or skip, and CI
can require coverage unless --allow-missing-coverage is passed. I do not want "coverage was absent" to quietly become
"everything is fine."
The config in the repo looks like this in resolved form:
{
"paths": ["src"],
"coverage": "coverage.json",
"baseline": ".riskratchet.json",
"fail_new_above": 50.0,
"fail_regression_above": 5.0,
"fail_component_regression_above": 15.0,
"component_regression_gate": true,
"allow_missing_coverage": false,
"auto_coverage": true,
"coverage_cache": ".riskratchet/coverage.json",
"test_command": "pytest --cov --cov-branch --cov-report=json:{output} -q",
"missing_coverage": "pessimistic",
"churn_window_days": 90,
"exclude": ["tests/**", "migrations/**", "*/generated/**"]
}
riskratchet config validate --config pyproject.toml currently reports:
valid riskratchet config: pyproject.toml
That sounds small, but it matters in CI. A typo in a quality gate should be a usage error, not an ignored key.
Dogfooding The Tool On Itself
Before updating this post, I ran the current test suite in the riskratchet repo:
uv run pytest --cov=src/riskratchet --cov-branch --cov-report=json:coverage.json --cov-report=term-missing
The result was:
260 passed in 7.92s
TOTAL 1783 statements, 469 missed, 656 branches, 79 partial, 76.22% coverage
Required test coverage of 74.0% reached.
Then I scanned the package with the generated coverage:
uv run riskratchet scan src --coverage coverage.json --summary
The current summary was:
scan functions=20 analyzed=183 emitted=20 files=14 coverage=present suppressed=0 skipped_missing_coverage=0
severity low=0 medium=16 high=4 critical=0
group name=ungrouped by_severity.critical=0 by_severity.high=4 by_severity.low=0 by_severity.medium=16 functions=20 max_score=62.25
The top findings were not surprising. The riskiest current function is
src/riskratchet/models.py::DiffReport.regressions at 62.2, with 0% line and branch coverage. Several diff
renderers also show up because the newer review-output surface is still less directly covered than the core scanner.
One of the more interesting dogfooding examples is the PR-comment renderer:
src/riskratchet/reporting.py::render_report_pr_comment
severity : medium
score : 47.6
crap : 18.4
complexity : CC=13
coverage : line=68%, branch=38%
churn : 6 commits in window
public : True
lines : 159-202 (function 44 lines, file 950)
components :
coverage_gap 31.8
structural_complexity 60.0
branch_gap 62.5
churn 60.0
public_surface 31.8
sprawl 45.0
This is exactly the kind of function I want the tool to put in front of me. It is not failing the baseline today. The current checked-in baseline is clean:
diff regressed=0 component_regressed=0 improved=0 new=0 removed=0 moved=0 unchanged=183
But if I keep extending render_report_pr_comment without improving tests or splitting it, riskratchet check will
have enough context to stop me.
That is the value of dogfooding a ratchet. It does not shame old code. It remembers the bar I already accepted.
Output Surfaces Matter
The first version of this post only talked about table, JSON, and Markdown. That is stale now.
The current CLI supports:
riskratchet scan src --coverage coverage.json --format table
riskratchet scan src --coverage coverage.json --format json
riskratchet scan src --coverage coverage.json --format markdown
riskratchet scan src --coverage coverage.json --format sarif
riskratchet scan src --coverage coverage.json --format github
riskratchet scan src --coverage coverage.json --format pr-comment
riskratchet scan src --coverage coverage.json --summary
Those formats are for different places in the workflow:
| Format | Where I use it |
|---|---|
table |
Local terminal scan |
json |
CI artifacts, scripts, snapshot tests |
markdown |
Pasting a static report into a PR or issue |
pr-comment |
Sticky bot comments with a stable marker |
github |
GitHub Actions annotations |
sarif |
Code scanning and editor viewers |
summary |
Compact CI logs and dashboards |
scan --format sarif emits current findings above the score filter. check --format sarif and diff --format sarif
emit regressions. A clean run still produces valid SARIF with an empty results array.
The editor path is deliberately cheap. riskratchet does not ship a native VS Code or JetBrains plugin today. It emits
SARIF 2.1.0, and the docs show how to open that in VS Code's SARIF Viewer or JetBrains/Qodana tooling. The native LSP
idea is on the roadmap, but I do not want to pay that maintenance cost before SARIF proves there is real demand.
Wiring It Into The Loop
There are three practical entry points.
1. CI
This is the default recommendation:
pytest --cov --cov-branch --cov-report=json:coverage.json
riskratchet check src \
--coverage coverage.json \
--baseline .riskratchet.json \
--format pr-comment
check exits:
0when the baseline is clean.1when configured regressions are found.2for usage errors, such as a missing baseline or invalid config.
That maps cleanly to CI. If the command exits 1, post the PR comment and fail the job.
2. Local pytest
The pytest plugin is opt-in:
pytest \
--cov --cov-branch --cov-report=json:coverage.json \
--riskratchet \
--riskratchet-paths src \
--riskratchet-baseline .riskratchet.json
With --riskratchet absent, the plugin is a no-op even though it is registered as a pytest entry point. I wanted the
tool to be available locally without making every test run pay the analysis cost automatically.
3. Pre-commit
Pre-commit can work, but I would not start there for most teams. It assumes coverage is fresh, and stale coverage makes for a bad developer experience.
Use it when the repo already runs tests in pre-commit:
repos:
- repo: local
hooks:
- id: pytest-cov
name: pytest coverage
entry: pytest --cov --cov-branch --cov-report=json:coverage.json -q
language: system
pass_filenames: false
always_run: true
- repo: https://github.com/KayhanB21/riskratchet
rev: v0.2.2
hooks:
- id: riskratchet
args: ["src", "--coverage", "coverage.json", "--baseline", ".riskratchet.json"]
For most projects, CI is the enforcement layer and local pytest is the feedback layer.
The Token Efficiency Angle
The research memo in the repo changed how I think about this tool.
For long-running agent-assisted projects, maintainability is not only about humans reading code. It is also about how much context the next agent has to retrieve, compress, and reason over before it can safely make a change.
A risk report is a small context artifact:
path
qualified function name
score delta
component deltas
line range
remediation hint
That is much cheaper than asking an agent to rediscover the risky parts of the repository from scratch on every run. The ratchet gives the agent a focused repair target:
- This function regressed.
- The regression was branch coverage, not line coverage.
- The function is public.
- The file is hot.
- The accepted baseline was lower.
That does not replace architecture understanding. It does cut down the first pass from "read the repo and tell me what looks risky" to "fix these measured deltas without moving the baseline up."
For agent workflows, that matters. Context windows are large now, but they are not free. The cheapest context is the specific context you already computed in CI.
What This Tool Gets Wrong
The score is heuristic. A score of 63.2 is not a proof that a function is broken. It is a review signal.
Coverage can be shallow. If a test executes lines without asserting real behavior, both CRAP and riskratchet can be
too optimistic.
Churn can mean active ownership, not danger. A hot file is not necessarily a bad file. It is a file where small mistakes accumulate quickly.
Public-surface detection is imperfect. Python's public API boundary is conventional, and some frameworks invert control in ways static analysis will not understand.
Sprawl can punish code that should be excluded: generated files, migrations, vendored code, and framework boilerplate.
That is why the default config supports exclude, include, and allow lists.
Baselines can be abused. If every failed PR just updates .riskratchet.json, the tool becomes ceremony. A baseline bump
should be an explicit choice with a reason.
Practical Advice
Start by scanning, not failing:
pytest --cov --cov-branch --cov-report=json:coverage.json
riskratchet scan src --coverage coverage.json --top 20
Look at the top functions. If the output matches code you already know is painful, the signal is probably useful.
Then create the baseline on main:
riskratchet baseline src --coverage coverage.json --output .riskratchet.json
After that, gate PRs with check. Do not start by arguing over whether the worst existing function must be refactored
this week. Start with "do not make it worse."
Tune exclusions before tuning weights. If generated code is polluting the report, exclude it. If experimental folders should not gate production, allow them. Only change weights once the target set is right.
Keep the component breakdown visible in review. A single score is not enough. The point is to tell the author whether the fix is more branch coverage, less branching, a split function, or an explicit baseline decision.
Final Takeaway
The gap between "code that passes its tests" and "code that is still easy to change next month" is not completely subjective. Some of it is measurable at function granularity.
When the loop is:
- Agent writes code.
- Tests pass.
- Coverage looks fine.
- CI is green.
- PR merges.
you are not measuring whether maintainability risk moved up.
The loop I want is:
- Agent writes code.
- Tests pass.
- Coverage is generated.
riskratchet checkcompares the changed functions against the accepted baseline.- PR merges only if risk stayed within budget.
If step four fails, the fix is usually one of three things: add meaningful branch tests, split the function, or explain why the baseline should move. All three are better than silently accepting the drift.
The tool is on PyPI as riskratchet. The source is at github.com/KayhanB21/riskratchet. As of the current repo,
0.2.2 ships the CLI, pytest plugin, pre-commit hook, baseline diffing, component regression gates, schema-backed JSON,
SARIF, GitHub annotations, PR comments, summaries, config validation, groups, and the six-component score described
above.
If you want the fastest first look:
pipx run riskratchet scan src --coverage coverage.json
That tells you where you are. The baseline turns it into a ratchet.