Letting Agents Write Code Without Ratcheting Up Risk

17 min read Original article ↗

In this post, I will walk through why I built a maintainability ratchet for AI-assisted Python, what it actually measures, and how it fits into the review loop.

The short version is this: tests passing is necessary, but it is not the same as keeping the codebase easy to change. AI agents make that gap more visible because they can add correct-looking code quickly. A function can keep passing its tests while quietly gaining branches, public surface, file sprawl, and maintenance history that make the next change harder.

riskratchet is my attempt to make that drift show up as a diff.

The Moment That Started This

I asked an agent to add one branch to a function I owned. The function already had a handful of branches. The agent added the new behavior, kept the signature, kept the existing tests passing, and added one happy-path test for the new case.

The PR looked fine. CI was green. Coverage went up.

A week later I had to change the same function again. I opened it and realized the easy-to-review diff had left behind a function that was now much harder to reason about. The tests were not wrong. The agent was not obviously wrong. The normal signals just were not measuring the thing I cared about.

The thing I wanted to catch was not:

Is this function bad?

It was:

Did this change make a function riskier than the version we had already accepted?

That distinction is the whole product.

What I Wanted To Catch

I wanted a check that could fail a PR when a single function moved into a worse maintainability state:

  1. Cyclomatic complexity went up but tests did not follow.
  2. Line coverage stayed high while branch coverage dropped.
  3. A public function lost coverage.
  4. A function crossed a length or file-sprawl threshold.
  5. A hot file accumulated more complexity.
  6. A new function landed already above the team's risk threshold.

Those are all measurable from data a Python CI job can already produce: source files, coverage JSON, and optionally git history.

The goal was not to build another static quality dashboard. The goal was to build a ratchet:

  1. Measure the current state.
  2. Save it as a baseline.
  3. Fail only when future changes move risk up past a tolerance.

That makes adoption much easier. A mature codebase does not have to become clean in one sweep. It just has to stop getting worse silently.

Why Coverage Alone Is Not Enough

Coverage is useful, but it is very easy to overread.

A line can execute without the test asserting the behavior that matters. A happy-path test can touch every line in a function while leaving half the branch exits untested. A public API can be "covered" only through incidental calls from another test. A file can have respectable project-level coverage while one risky function has almost none.

This is one of the reasons I care about function-level output. Project coverage answers a broad question:

Did the test suite execute this much of the repository?

The review question is narrower:

Did the function changed by this PR become harder to change safely?

Those are not the same question.

Why CRAP Alone Was Not Enough

The CRAP score is still useful:

CC^2 * (1 - line_coverage)^3 + CC

It catches the classic bad shape: complex code with weak line coverage. riskratchet keeps CRAP in the output because it is a good familiar ranking signal.

But CRAP does not see everything I wanted this tool to care about:

Shape What CRAP sees What I still care about
100% line coverage, 50% branch coverage Looks mostly fine Half the exits were never tested
A 2-line public function with no tests Low score Public contract has no direct coverage
A function in a 950-line module Only the function's CC and line coverage File sprawl makes every change more expensive
A file touched 6 times in the churn window Nothing Hot code is where small changes accumulate
A baseline score moved from 10 to 41 Only the new absolute score The regression is the useful signal

So the score in riskratchet is a weighted blend of six normalized components:

Component Default weight What it measures
coverage_gap 30% Missing line coverage inside the function span
structural_complexity 25% Cyclomatic complexity, saturating at high values
branch_gap 15% Missing branch coverage when branch data exists
churn 10% Recent commits touching the file, default 90-day window
public_surface 10% Missing coverage on functions treated as public API
sprawl 10% Function length and surrounding file length

Weights are configurable in [tool.riskratchet.weights], but they are validated and renormalized. A typo or negative weight should not silently weaken a CI gate.

A Real Fixture: The Agent Spaghetti Case

The repo has a fixture named tests/fixtures/agent_generated_spaghetti. It is the canonical shape I wanted the tool to catch: a public process_payment function that is long, branchy, and lightly tested.

The function is 44 lines long. It handles IDs, amounts, currencies, captured status, metadata, retry counts, strict mode, exception paths, and default behavior in one place.

Running the current tool from that fixture directory:

uv run riskratchet explain src/m.py::process_payment \
  --coverage coverage.json \
  --no-git

prints:

src/m.py::process_payment
  severity     : high
  score        : 63.2
  crap         : 156.3
  complexity   : CC=21
  coverage     : line=33%, branch=25%
  churn        : 0 commits in window
  public       : True
  lines        : 8-51 (function 44 lines, file 51)
  components   :
    coverage_gap          67.4
    structural_complexity 100.0
    branch_gap            75.0
    churn                 0.0
    public_surface        67.4
    sprawl                0.0

CRAP catches this one too, which is good. This is the easy case: high complexity and weak coverage.

The useful part is the component breakdown. It tells the agent what to do next. This is not a vague "quality score." The repair options are concrete: split the function, add missing branch tests, or both.

The Case CRAP Mostly Misses

The fixture tests/fixtures/covered_but_branchy is smaller:

def normalize(record: dict) -> dict:
    out = {}
    if "id" in record:
        out["id"] = str(record["id"])
    if "amount" in record:
        out["amount"] = float(record["amount"])
    if "currency" in record:
        out["currency"] = record["currency"].upper()
    if "captured" in record:
        out["captured"] = bool(record["captured"])
    return out

The fixture coverage touches every line, but only half the branch exits. The current output is:

src/m.py::normalize
  severity     : low
  score        : 12.5
  crap         : 5.0
  complexity   : CC=5
  coverage     : line=100%, branch=50%
  churn        : 0 commits in window
  public       : True
  lines        : 7-17 (function 11 lines, file 17)
  components   :
    coverage_gap          0.0
    structural_complexity 20.0
    branch_gap            50.0
    churn                 0.0
    public_surface        0.0
    sprawl                0.0

This should not fail a build by itself. The score is low. But it should be visible when this function changes. If an agent adds four more optional fields and the branch coverage falls again, the baseline check will catch the delta.

That is the behavior I wanted: do not panic about every imperfect function, but do not let the same function get worse without someone noticing.

Public Surface Matters

Python does not have a perfect public/private boundary, but naming conventions and __all__ are still useful signals.

The tests/fixtures/all_exports_focused fixture has three functions:

  • _legacy_exposed, which looks private but appears in __all__.
  • naturally_public, which is public by name.
  • _truly_private, which stays private.

With zero coverage, the current scan reports:

| medium | 42.5 | 12.0 | 3 | 0% | n/a | `src/m.py::_legacy_exposed` | 14-19 |
| medium | 40.0 |  2.0 | 1 | 0% | n/a | `src/m.py::naturally_public` | 22-23 |
| medium | 30.0 |  2.0 | 1 | 0% | n/a | `src/m.py::_truly_private` | 26-27 |

The important detail is that _legacy_exposed is treated as public because __all__ promotes it. Its public_surface component is 100.0; _truly_private does not get that extra public-surface penalty.

This is still a heuristic. Framework callbacks, plugin entry points, and internal functions used across services can blur the boundary. But a heuristic is enough to make review sharper. A public function with no coverage deserves a different conversation than a private helper with the same raw line coverage.

The Baseline Is The Product

Scanning is useful for discovery. The baseline is what changes behavior.

riskratchet baseline writes a stable JSON file keyed by path and qualified function name. Each entry stores the score, component scores, and a fingerprint so moved or unchanged functions can be reasoned about later.

riskratchet check compares the current run against that baseline. In the current repo, it can fail on:

  • A new function above fail_new_above (default 50).
  • An existing function whose total score grew by more than fail_regression_above (default 5).
  • An existing function whose individual component grew by more than fail_component_regression_above (default 15) when the component gate is enabled.
  • An existing function still above an optional absolute ceiling, if a team configures one.

The strict comparison matters: a tolerance of 5 means +5.0 is still inside the budget; more than 5.0 fails.

The fixture tests/fixtures/public_api_regression pins the basic regression behavior. Running:

uv run riskratchet check src \
  --coverage coverage.json \
  --baseline baseline.json \
  --format markdown \
  --no-git

exits 1 and prints:

# riskratchet regressions

| Kind | Function | Before | After | Delta | Reason |
| --- | --- | ---: | ---: | ---: | --- |
| regressed | `src/m.py::public_api` | 10.0 | 41.0 | +31.0 | risk grew by +31.0 (from 10.0 to 41.0); tolerance is +5.0 |

That is the PR gate I wanted. Not "this function is morally bad." Just: this function was accepted at 10.0, this change moved it to 41.0, and the team's budget was 5.0.

How The Current Implementation Works

The current package is 0.2.2. The command surface is:

riskratchet scan      # scan files and report current risk
riskratchet baseline  # write the accepted current state
riskratchet check     # fail when risk regresses
riskratchet explain   # print the full breakdown for one function
riskratchet diff      # show the full baseline diff without failing
riskratchet config    # validate or show resolved configuration

The analysis path is intentionally boring:

  1. Walk the configured Python paths, honoring include and exclude globs.
  2. Parse each file with ast.
  3. Discover functions, methods, nested functions, async functions, decorators, and public-surface metadata.
  4. Compute cyclomatic complexity.
  5. Load coverage.json and intersect executed and missing lines and branches with each function span.
  6. Count git churn with one recent-history pass, unless --no-git is set.
  7. Compute CRAP and the six risk components.
  8. Emit a report, baseline, diff, or regression list.

If coverage is missing, the behavior is explicit. The config can choose pessimistic, optimistic, or skip, and CI can require coverage unless --allow-missing-coverage is passed. I do not want "coverage was absent" to quietly become "everything is fine."

The config in the repo looks like this in resolved form:

{
  "paths": ["src"],
  "coverage": "coverage.json",
  "baseline": ".riskratchet.json",
  "fail_new_above": 50.0,
  "fail_regression_above": 5.0,
  "fail_component_regression_above": 15.0,
  "component_regression_gate": true,
  "allow_missing_coverage": false,
  "auto_coverage": true,
  "coverage_cache": ".riskratchet/coverage.json",
  "test_command": "pytest --cov --cov-branch --cov-report=json:{output} -q",
  "missing_coverage": "pessimistic",
  "churn_window_days": 90,
  "exclude": ["tests/**", "migrations/**", "*/generated/**"]
}

riskratchet config validate --config pyproject.toml currently reports:

valid riskratchet config: pyproject.toml

That sounds small, but it matters in CI. A typo in a quality gate should be a usage error, not an ignored key.

Dogfooding The Tool On Itself

Before updating this post, I ran the current test suite in the riskratchet repo:

uv run pytest --cov=src/riskratchet --cov-branch --cov-report=json:coverage.json --cov-report=term-missing

The result was:

260 passed in 7.92s
TOTAL 1783 statements, 469 missed, 656 branches, 79 partial, 76.22% coverage
Required test coverage of 74.0% reached.

Then I scanned the package with the generated coverage:

uv run riskratchet scan src --coverage coverage.json --summary

The current summary was:

scan functions=20 analyzed=183 emitted=20 files=14 coverage=present suppressed=0 skipped_missing_coverage=0
severity low=0 medium=16 high=4 critical=0
group name=ungrouped by_severity.critical=0 by_severity.high=4 by_severity.low=0 by_severity.medium=16 functions=20 max_score=62.25

The top findings were not surprising. The riskiest current function is src/riskratchet/models.py::DiffReport.regressions at 62.2, with 0% line and branch coverage. Several diff renderers also show up because the newer review-output surface is still less directly covered than the core scanner.

One of the more interesting dogfooding examples is the PR-comment renderer:

src/riskratchet/reporting.py::render_report_pr_comment
  severity     : medium
  score        : 47.6
  crap         : 18.4
  complexity   : CC=13
  coverage     : line=68%, branch=38%
  churn        : 6 commits in window
  public       : True
  lines        : 159-202 (function 44 lines, file 950)
  components   :
    coverage_gap          31.8
    structural_complexity 60.0
    branch_gap            62.5
    churn                 60.0
    public_surface        31.8
    sprawl                45.0

This is exactly the kind of function I want the tool to put in front of me. It is not failing the baseline today. The current checked-in baseline is clean:

diff regressed=0 component_regressed=0 improved=0 new=0 removed=0 moved=0 unchanged=183

But if I keep extending render_report_pr_comment without improving tests or splitting it, riskratchet check will have enough context to stop me.

That is the value of dogfooding a ratchet. It does not shame old code. It remembers the bar I already accepted.

Output Surfaces Matter

The first version of this post only talked about table, JSON, and Markdown. That is stale now.

The current CLI supports:

riskratchet scan src --coverage coverage.json --format table
riskratchet scan src --coverage coverage.json --format json
riskratchet scan src --coverage coverage.json --format markdown
riskratchet scan src --coverage coverage.json --format sarif
riskratchet scan src --coverage coverage.json --format github
riskratchet scan src --coverage coverage.json --format pr-comment
riskratchet scan src --coverage coverage.json --summary

Those formats are for different places in the workflow:

Format Where I use it
table Local terminal scan
json CI artifacts, scripts, snapshot tests
markdown Pasting a static report into a PR or issue
pr-comment Sticky bot comments with a stable marker
github GitHub Actions annotations
sarif Code scanning and editor viewers
summary Compact CI logs and dashboards

scan --format sarif emits current findings above the score filter. check --format sarif and diff --format sarif emit regressions. A clean run still produces valid SARIF with an empty results array.

The editor path is deliberately cheap. riskratchet does not ship a native VS Code or JetBrains plugin today. It emits SARIF 2.1.0, and the docs show how to open that in VS Code's SARIF Viewer or JetBrains/Qodana tooling. The native LSP idea is on the roadmap, but I do not want to pay that maintenance cost before SARIF proves there is real demand.

Wiring It Into The Loop

There are three practical entry points.

1. CI

This is the default recommendation:

pytest --cov --cov-branch --cov-report=json:coverage.json
riskratchet check src \
  --coverage coverage.json \
  --baseline .riskratchet.json \
  --format pr-comment

check exits:

  • 0 when the baseline is clean.
  • 1 when configured regressions are found.
  • 2 for usage errors, such as a missing baseline or invalid config.

That maps cleanly to CI. If the command exits 1, post the PR comment and fail the job.

2. Local pytest

The pytest plugin is opt-in:

pytest \
  --cov --cov-branch --cov-report=json:coverage.json \
  --riskratchet \
  --riskratchet-paths src \
  --riskratchet-baseline .riskratchet.json

With --riskratchet absent, the plugin is a no-op even though it is registered as a pytest entry point. I wanted the tool to be available locally without making every test run pay the analysis cost automatically.

3. Pre-commit

Pre-commit can work, but I would not start there for most teams. It assumes coverage is fresh, and stale coverage makes for a bad developer experience.

Use it when the repo already runs tests in pre-commit:

repos:
  - repo: local
    hooks:
      - id: pytest-cov
        name: pytest coverage
        entry: pytest --cov --cov-branch --cov-report=json:coverage.json -q
        language: system
        pass_filenames: false
        always_run: true
  - repo: https://github.com/KayhanB21/riskratchet
    rev: v0.2.2
    hooks:
      - id: riskratchet
        args: ["src", "--coverage", "coverage.json", "--baseline", ".riskratchet.json"]

For most projects, CI is the enforcement layer and local pytest is the feedback layer.

The Token Efficiency Angle

The research memo in the repo changed how I think about this tool.

For long-running agent-assisted projects, maintainability is not only about humans reading code. It is also about how much context the next agent has to retrieve, compress, and reason over before it can safely make a change.

A risk report is a small context artifact:

path
qualified function name
score delta
component deltas
line range
remediation hint

That is much cheaper than asking an agent to rediscover the risky parts of the repository from scratch on every run. The ratchet gives the agent a focused repair target:

  • This function regressed.
  • The regression was branch coverage, not line coverage.
  • The function is public.
  • The file is hot.
  • The accepted baseline was lower.

That does not replace architecture understanding. It does cut down the first pass from "read the repo and tell me what looks risky" to "fix these measured deltas without moving the baseline up."

For agent workflows, that matters. Context windows are large now, but they are not free. The cheapest context is the specific context you already computed in CI.

What This Tool Gets Wrong

The score is heuristic. A score of 63.2 is not a proof that a function is broken. It is a review signal.

Coverage can be shallow. If a test executes lines without asserting real behavior, both CRAP and riskratchet can be too optimistic.

Churn can mean active ownership, not danger. A hot file is not necessarily a bad file. It is a file where small mistakes accumulate quickly.

Public-surface detection is imperfect. Python's public API boundary is conventional, and some frameworks invert control in ways static analysis will not understand.

Sprawl can punish code that should be excluded: generated files, migrations, vendored code, and framework boilerplate. That is why the default config supports exclude, include, and allow lists.

Baselines can be abused. If every failed PR just updates .riskratchet.json, the tool becomes ceremony. A baseline bump should be an explicit choice with a reason.

Practical Advice

Start by scanning, not failing:

pytest --cov --cov-branch --cov-report=json:coverage.json
riskratchet scan src --coverage coverage.json --top 20

Look at the top functions. If the output matches code you already know is painful, the signal is probably useful.

Then create the baseline on main:

riskratchet baseline src --coverage coverage.json --output .riskratchet.json

After that, gate PRs with check. Do not start by arguing over whether the worst existing function must be refactored this week. Start with "do not make it worse."

Tune exclusions before tuning weights. If generated code is polluting the report, exclude it. If experimental folders should not gate production, allow them. Only change weights once the target set is right.

Keep the component breakdown visible in review. A single score is not enough. The point is to tell the author whether the fix is more branch coverage, less branching, a split function, or an explicit baseline decision.

Final Takeaway

The gap between "code that passes its tests" and "code that is still easy to change next month" is not completely subjective. Some of it is measurable at function granularity.

When the loop is:

  1. Agent writes code.
  2. Tests pass.
  3. Coverage looks fine.
  4. CI is green.
  5. PR merges.

you are not measuring whether maintainability risk moved up.

The loop I want is:

  1. Agent writes code.
  2. Tests pass.
  3. Coverage is generated.
  4. riskratchet check compares the changed functions against the accepted baseline.
  5. PR merges only if risk stayed within budget.

If step four fails, the fix is usually one of three things: add meaningful branch tests, split the function, or explain why the baseline should move. All three are better than silently accepting the drift.

The tool is on PyPI as riskratchet. The source is at github.com/KayhanB21/riskratchet. As of the current repo, 0.2.2 ships the CLI, pytest plugin, pre-commit hook, baseline diffing, component regression gates, schema-backed JSON, SARIF, GitHub annotations, PR comments, summaries, config validation, groups, and the six-component score described above.

If you want the fastest first look:

pipx run riskratchet scan src --coverage coverage.json

That tells you where you are. The baseline turns it into a ratchet.