How We Reined In AI Agents With pre-commit

A lot of teams now seem to rely on AI code review bots to review code written by AI. That does work, to a point, but there is something slightly absurd about paying for a second model to inspect the output of the first when older tools like pre-commit already did a good job of catching many classes of mistakes early, locally, and without expensive licenses. We do use CodeRabbit on our pull requests, but we also wanted to catch code smells much earlier: in pre-commit hooks on developer machines, and again in GitHub pipelines before bad patterns have a chance to settle in.

So we added a guard in Merrilin to rein in exactly that behavior. It uses tree-sitter to reject fragile error-handling patterns before agent-written code gets committed. It does not just search for HTTPException or catch. It parses Python, TypeScript, and TSX into syntax trees and asks much more interesting questions:

Did someone raise a raw HTTPException instead of a typed API error?
Did they catch a database exception and then keep going without a rollback or savepoint?
Did they write reader progress in a critical path without a visible recovery boundary?
Did frontend code read error.response.data.detail directly instead of using the shared normalizer?
Did someone add a .catch() that only logs to console and silently swallows the failure?

There is a certain wonder in watching a machine produce so much working code so quickly. There is also a certain exhaustion in watching it rediscover the exact same bad ideas at scale. This guard has been one of the most effective ways we have found to keep AI-assisted development inside the boundaries of our system design.

Why we built it#

Merrilin is an AI reading companion. That means we have one non-negotiable invariant:

do not break reading ever.

Optional systems can fail: AI can fail, telemetry can fail, sync can fail, analytics can fail. But opening a book, turning a page, saving progress locally, and resuming offline still need to work.

The problem is that AI agents are fantastic at reproducing tiny error-handling shortcuts that look harmless in isolation:

raise HTTPException(...)
except Exception: logger.exception(...)
except IntegrityError: pass
catch (error) { console.error(error) }
throw new Error("something went wrong")

Individually, these are easy to rationalize. Collectively, they create outages, inconsistent client behavior, poisoned transactions, and impossible-to-centralize error semantics, and they are exactly the kind of thing an AI agent will keep doing unless you give it a hard boundary. We wanted something stricter than prompt instructions and more precise than regexes, so we put the rules in code.

Where it lives#

The error-handling guard matters most here, but it sits inside a broader pre-commit setup that gives AI-generated diffs fewer places to hide. Our .pre-commit-config.yaml also enforces:

protected branch safety with no-commit-to-branch
basic hygiene checks like JSON/YAML/TOML validation, merge-conflict detection, symlink checks, AST validation, private-key detection, whitespace cleanup, and line-ending normalization
Conventional Commit messages at commit-msg time
Alembic migration naming and single-head checks
backend Ruff linting and formatting
Prettier for JS, TS, and Markdown
ESLint for the web and admin apps

That matters because AI agents rarely fail in only one dimension. The same model that invents a broad except Exception will also happily leave formatting drift, hand-write invalid Alembic migrations, or produce inconsistent commit messages unless your repo pushes back.

The error-handling guard itself is wired into .pre-commit-config.yaml:

- id: error-handling-patterns
  name: Guard centralized error handling patterns
  entry:
    uv run --project backend --extra dev python scripts/check_error_handling_patterns.py
    --changed-lines
  language: system
  files: ^(backend/app/.*\.py|apps/(web|mobile|admin)/src/.*\.(ts|tsx))$
  stages: [pre-commit]

Taken together, these hooks keep the codebase clean, the schema history sane, the commit history legible, and the tree-sitter guard keeps the agent from reintroducing reliability bugs you already learned not to ship.

The Alembic hooks deserve special mention. Since Feb 7, we have landed roughly 331 PRs in this repo. In that same stretch, we have touched backend/alembic/versions in 93 commits, and the folder currently contains 81 migration files. There is something exhilarating about shipping that fast. There is also the quiet fatigue of realizing your migration history has become a place where humans and agents can both make a mess with full confidence.

In that kind of environment, AI agents start getting overconfident. They see a pile of migrations, infer a human numbering convention that was never really meant to be one, and begin writing new files by hand with integer prefixes like 001_add_users.py or 042_fix_index.py.

That does not sound catastrophic at first. In fact, it can feel convenient right up until revisions conflict and you realize you are effectively negotiating whether the next migration should be 65, 66, or something cursed like 65.1. Alembic does not want humans inventing numbering schemes under pressure; it wants revisions generated through its own graph.

So we explicitly block manually numbered migration filenames. New backend migrations must be created through Alembic itself, and we also require a single Alembic head so schema history stays manageable when multiple branches are moving at once.

It is easy to dismiss this as boring infrastructure work until an agent-generated migration lands, passes superficial review, and turns a simple schema change into a revision-conflict cleanup exercise.

There is also a repo-level command for scanning everything:

just check-error-handling

The checker itself lives in scripts/check_error_handling_patterns.py, and it uses:

tree_sitter
tree_sitter_language_pack
ProcessPoolExecutor for parallel scanning
git diff --unified=0 when running in --changed-lines mode

That last part matters a lot: by default, the pre-commit hook only reports findings whose start line was actually added or modified. That means we can enforce strict rules in a messy real-world codebase without making every unrelated commit impossible.

Doing this in pre-commit instead of only in CI also shortens the AI loop. The agent writes code, the hook objects immediately, and you correct course before the bad pattern gets normalized into the branch.

Why tree-sitter instead of grep or a traditional AST#

Because the patterns we care about are structural, but not in a way that makes plain text search or language-specific AST tooling especially pleasant.

Grep is too shallow. It can tell you that a file contains except Exception or .catch(, but it cannot tell you what happens inside the handler, whether there is a rollback, whether the code re-raises, or whether a suspicious-looking call is actually inside a recovery boundary. That is how you end up with tools that are either noisy enough to be ignored or weak enough to be useless.

A traditional compiler-style AST is better, but it is not always the most practical fit for this kind of guard. We are scanning Python, TypeScript, and TSX in one script. We want something fast, structurally aware, and tolerant enough to operate well in the messy reality of application code and staged diffs. Tree-sitter sits in a useful middle ground: it gives us real syntax trees and language-aware nodes without making us build separate heavyweight analyzers for every language in the repo.

More importantly, tree-sitter lets us ask nuanced questions about shape and context. Not just “does this token appear,” but things like:

is this HTTPException actually being raised here?
is this catch block empty, console-only, or routed through recovery?
is this database call sitting inside a try/except with rollback or savepoint containment?
is this frontend error read happening as a member-expression chain like error.response.data.detail?
is this code inside an allowed path, a test, a root render, or a reader-critical service?

That level of nuance is what makes a guard like this bearable in day-to-day use. Consider these examples:

try:
    await db.flush()
except Exception:
    logger.warning("best effort")

and:

try:
    await db.flush()
except Exception as exc:
    await db.rollback()
    raise

A regex sees except Exception in both. A syntax-aware checker can tell the difference between:

a broad catch that silently continues
a catch that actually restores transactional safety

Same on the frontend:

catch (error) {
  console.error(error)
}

vs.

catch (error) {
  reportError(normalizeApiError(error))
  throw error
}

Both contain catch. Only one is acceptable. Tree-sitter gives us AST nodes, ancestry, bodies, member-expression chains, call names, and language-specific structure. That is enough to write rules that are useful, not noisy.

What the guard checks#

When we first wrote down the backlog in the RCA, the initial tree-sitter guard snapshot looked like this:

Rule	Initial findings
`backend.direct_http_exception`	168
`backend.plain_exception_raise`	123
`backend.reader_db_operation_without_recovery`	85
`backend.broad_exception_handler`	81
`frontend.plain_error_throw`	77
`frontend.raw_error_message_read`	59
`backend.raw_exception_detail`	46
`frontend.swallowed_promise_rejection`	18
`frontend.direct_transport_detail`	8
`backend.db_operation_catch_continue`	8
`frontend.raw_api_fetch_without_client`	7
`backend.integrity_error_without_constraint_mapping`	6
`backend.reader_state_write_without_boundary`	6
`frontend.swallowed_catch`	3
`backend.sse_error_without_code`	2
`backend.unsupervised_background_task`	2

That was useful not just as an inventory, but as a way of seeing where the codebase had already started drifting. Some of those checks are straightforward policy checks: do not raise raw HTTPException, do not leak raw exception text, do not throw plain Error, do not parse error.response.data.detail in components, do not schedule unsupervised background work, do not emit SSE errors without stable codes.

Those matter, and they do useful cleanup work, but they are conceptually simple. The more interesting ones are the checks that depend on structural context: the ones where tree-sitter is not just matching a bad token, but reasoning about surrounding control flow, exception boundaries, transaction recovery, member-expression chains, and product-specific critical paths.

So rather than narrate every rule in the table one by one, I wanted to focus on the smaller set that needed more clever tree-sitter logic.

Today, that table is all zeros.

SQL and recovery rules#

This is where the checker becomes more specific.

A lot of outages are not caused by throwing exceptions. They are caused by recovering from them incorrectly.

1. Catch-and-Continue After Database Operations#

backend.db_operation_catch_continue looks for broad except blocks after SQLAlchemy operations such as:

execute()
flush()
commit()
scalar()
scalars()
add()
delete()
merge()
refresh()

and flags cases that continue without:

rollback
re-raise
savepoint containment

Bad:

try:
    await db.flush()
except Exception:
    logger.warning("ignoring")

After a SQLAlchemy failure, the transaction is often no longer safe to keep using.

2. SQLAlchemy Exceptions Without Recovery Boundaries#

backend.sqlalchemy_exception_without_boundary is the narrower companion to the previous rule. It specifically targets SQLAlchemy exception classes like:

IntegrityError
DBAPIError
SQLAlchemyError
OperationalError
ProgrammingError

If you catch one of these and keep going, the checker wants to see rollback, re-raise, or savepoint containment.

3. Integrity Errors Without Constraint Mapping#

backend.integrity_error_without_constraint_mapping flags IntegrityError handlers that do not branch on constraint name or code. Not all integrity failures are the same. Some are expected, recoverable, and user-facing. Some are real bugs. Treating them all alike creates terrible UX and terrible debugging.

Bad:

except IntegrityError:
    raise ConflictError()

Better:

except IntegrityError as exc:
    if is_unique_violation(exc, "uq_reading_progress_user_book"):
        ...

4. Background Database Errors Without Rollback#

backend.background_db_error_without_rollback checks worker-style error handlers that commit failure state without rolling back or re-raising first. Writing “the job failed” using a poisoned transaction is not the recovery story people think it is.

Reader-critical recovery rules#

These are the rules most tightly tied to our product invariant.

The checker has a hard-coded set of reader-critical paths, including places like:

backend/app/api/v1/reading.py
backend/app/api/v1/sync.py
backend/app/services/reading_service.py
backend/app/services/sync_service.py
backend/app/services/file_hash_sync_service.py

It also knows about reader-state models such as:

ReadingSession
ReadingProgress
ReadingEvent
RawClientEvent
DeferredReadingEvent
SyncEvent

5. Reader-Critical Database Operations Without Recovery#

backend.reader_db_operation_without_recovery flags SQLAlchemy mutations like flush, commit, delete, or merge inside reader-critical paths when there is no visible recovery boundary.

The checker accepts patterns such as:

rollback
re-raise
savepoint usage
named recovery helpers containing terms like safe, recover, ensure, defer, or savepoint

In reading-critical code, silent transactional coupling is unacceptable.

6. Reader State Writes Without Isolation#

backend.reader_state_write_without_boundary flags writes involving reader session/progress/event models unless they are isolated by savepoint or recovery helper. Optional side effects must not break the act of reading itself.

This rule is more product-specific than most static analysis, and that is why it matters. We are not just enforcing generic cleanliness here. We are encoding a business invariant directly into the commit path.

Frontend rules#

The frontend side of the checker is not there for symmetry. Error-handling drift in TS code is how carefully structured backend errors get turned back into stringly typed chaos.

Some frontend rules are deliberately simple and broad: use an error boundary at the root, route API calls through the shared client, do not read raw error.message, do not parse transport payloads inside components, and do not throw plain Error in application code.

The more interesting frontend rules are the ones that lean on syntax shape rather than just banned strings.

7. Direct Reads of Transport Error Details#

frontend.direct_transport_detail flags reads of member-expression chains like error.response.data.detail. That sounds narrow, but it is exactly the kind of normalization leak that spreads if you do not stop it early.

Bad:

const message = error.response.data.detail;

8. Swallowed Promise Rejections and Catch Blocks#

frontend.swallowed_promise_rejection and frontend.swallowed_catch look for promise catches and try/catch blocks whose bodies are empty or effectively empty, usually because they only log to the console.

Bad:

doSomething().catch((error) => {
  console.error(error);
});

This is the JavaScript version of except: pass. The interesting part is that the checker is not simply searching for .catch(. It walks into the handler body and asks whether anything meaningful happened.

The surprisingly important suppression mechanism#

No static analysis rule survives contact with reality unless it has an escape hatch.

Ours is intentionally explicit:

merrilin-error-guard: allow <reason>

You put that on the offending line or enclosing block when the exception is intentional and documented.

That gives us three benefits:

We can keep the rules strict.
We can make exceptions visible in code review.
We avoid the classic “turn the tool off because it is annoying” outcome.

A suppression is not free. It has to explain itself.

Why the rest of the pre-commit stack matters too#

It is tempting to talk about a tool like this as if one clever tree-sitter script solved the whole AI-code-quality problem.

It did not.

What actually helps is layering constraints:

hygiene hooks catch low-signal junk
formatters collapse bikeshedding and make diffs smaller
linters catch conventional mistakes
migration hooks protect operational correctness and stop agents from hand-authoring fake revision sequences
commit-message rules keep history readable
the tree-sitter guard enforces product- and architecture-specific reliability rules

That layered setup is especially useful with AI agents because they optimize for local completion. A pre-commit stack gives your repository a memory of its own.

Performance details#

Static analysis tools love to become CI folklore because they are too slow to run locally.

I care a lot about being able to run this kind of tooling on machines I control, whether that is my MacBook or one of the Linux VMs on my Proxmox server where I do a lot of work. Once a tool only really works as a hosted service, or only feels tolerable when somebody else’s infrastructure is carrying it, you are beholden to a system you do not fully control.

We tried not to do that.

The checker:

parallelizes file scanning with ProcessPoolExecutor
shows rich progress output during full scans
supports summary-only output
limits pre-commit checks to changed lines

So the same codebase can support two very different modes:

fast local enforcement during git commit
wide repo visibility when running just check-error-handling

That split matters to me for reasons beyond speed. Local tools are composable, scriptable, and durable in a way service-dependent workflows often are not. If I can run the guard in a checkout on my laptop, or on a VM somewhere in my own stack, then it stays part of my development environment instead of turning into yet another external dependency with an invoice attached.

Why this matters for AI agents#

This whole system became more valuable the moment AI started writing large volumes of plausible code.

That is the strange emotional split of this moment in software: part amazement, part cleanup.

LLMs are very good at producing code that looks reasonable in a diff.

They are much worse at consistently preserving your team’s implicit reliability contracts.

An AI assistant will happily generate:

broad catches
raw exception interpolation
direct frontend transport parsing
fire-and-forget background tasks
throw new Error(...)
SQL exception handlers that log and continue

None of those are obviously broken in isolation. They just slowly destroy system coherence.

That is why I think the real future of AI-assisted coding is not “the model gets smarter”.

It is:

teams encoding their hard-won engineering constraints as executable local policy that AI agents repeatedly bounce off.

Not a style guide. Not a wiki. Not a review comment. A machine-enforced boundary.

And the boundary keeps moving. This checker is not a finished taxonomy that we expect to admire from a distance. It evolves with the codebase. Every time we notice a new recurring smell in agent-written code, or a new recovery failure mode that review keeps catching too late, we add another rule or tighten an existing one. In practice that means the guard becomes part lint, part incident memory, and part steering mechanism. The agents do not need to understand the whole history of why a pattern is bad; they just need to run into the boundary often enough that they stop taking that path.

What this approach gets right#

It is opinionated in exactly the right place.

We are not trying to statically prove program correctness. We are not building a universal linter. We are not banning exceptions, SQLAlchemy, or fetch out of aesthetic preference.

We are taking a handful of outage-shaped lessons and making them impossible to reintroduce casually.

That is a much more realistic use of static analysis.

If you want to build something similar#

A few practical lessons from this one:

Start from failures, not taste. Every good rule here corresponds to a real class of breakage or centralization pain.
Encode product invariants, not just language rules. backend.reader_state_write_without_boundary is useful because it knows what our app must never break.
Allow explicit suppressions. Strict with escape hatches beats brittle absolutism.
Enforce on changed lines first. This is the easiest way to land serious policy in a non-pristine repo.
Use ASTs when structure matters. Regexes are fine until they are not. Error handling crosses that line quickly.

The bigger point#

People often talk about reliability work as if it only starts in production: retries, alerts, dashboards, tracing, incident response.

That is too late.

A lot of reliability is decided the moment someone—or something—types except Exception: and hits commit.

If you know what kinds of patterns keep hurting your system, there is enormous leverage in rejecting them before they merge.

That is what this checker does for us.

It is not glamorous. It will never demo as well as a new feature. It may, however, save more user trust than either.

This is not about stopping the use of AI agents.

It is about stopping the same failure modes from slipping back into the codebase, one polite-looking diff at a time.