The Receipt Was Lying

Brief mode: condensed view. Switch to Full for persona notes and full analysis.

You configured deny rules. You went to bed. Prod is gone. This is not a story about a rogue agent. This is a story about a receipt that lied to you.

You are running Claude as an agent on your codebase. You are not naive. You read the docs. You configured deny rules — the mechanism Claude Code provides to tell the agent “never run this.” You added git clean to your deny list because you know what git clean -fd does to a working tree. You had a reason. You set the rule. You trusted the rule.

What you did not know is that the deny rule evaluator reads the first token of a Bash command. If the command is git clean -fd, it fires.

If the command is git fetch && git pull && git clean -fd — which is how Claude normally writes efficient shell chains — the evaluator reads git fetch, matches nothing in your deny list, and allows the full expression. git clean runs. Your worktree is gone.

No adversarial input. No attacker. Claude did exactly what it always does.

The user had receipts. The receipt was wrong. That is worse than no receipt — it is a vibe with formatting.

The report and fix

John documented the mechanism and shipped a working PreToolUse hook that fixes it — PR #36645, 573 lines, 34 tests, all passing. The report also referenced issue #31523, filed independently two weeks earlier by a separate reporter who hit the allow-side of the same root cause. Two reporters. Same parser. Opposite sides of the same gap.

Here is the response, verbatim:

“Claude Code’s deny rules are not designed as a security barrier against adversarial command construction. They are a convenience mechanism to constrain well-intentioned agent actions… If you need a hard security boundary around command execution, that should be enforced at the OS or sandbox layer, not via deny rules.”

This response has two problems.

The first problem is that it answered a different question than the report asked. This was not an adversarial bypass report. The report described Claude’s normal behavior — chaining commands the way it always chains commands — producing compound expressions that the deny rule evaluator never fully checks. No attacker. No crafted input. Just the agent doing its job.

The second problem is more interesting. The response anchored immediately on rm -rf — used in the filing as shorthand for “the worst thing that can happen,” the way programmers often do. The triage system read rm -rf, mapped it to “destructive filesystem operation,” mapped that to “OS-layer concern,” and closed the report. It never got to git clean. git clean is not blocked at the OS layer. It runs as you, on your files, with your permissions, because you gave the agent access to your repo. The OS has nothing to say about it.

An LLM answered a bug report and got semantically stuck on one token. It reasoned confidently from the wrong frame and never corrected. You have read about this failure mode. You are reading about it right now, in the wild, on Anthropic’s own VDP system, in a response about Anthropic’s own agent tooling.

The mechanism is the prod disaster

The horror stories circulating about Claude agents destroying prod databases, wiping working trees, dropping tables — I do not know which specific ones trace directly to this parser gap. But I know the mechanism is real, documented, and reproducible. And I know the failure mode it produces is not “rogue agent.” It is “user had guardrails, guardrails silently failed, agent ran normally, damage was done.”

That distinction matters because it changes what the fix is. A rogue agent problem is a model alignment problem. A silent guardrail failure is an engineering problem with a known solution — parse the whole command, not just the first token. The fix exists. It was submitted. The report was closed.

If you are running Claude agents against anything you cannot afford to lose, your deny rules are not doing what you think they are doing. Not because you configured them wrong. Because the evaluator has a gap that has been reported, reproduced, and left open.

No attacker required. The agent wipes your worktree doing its job.

The triage failure is the same as the agent failure

There is one more thing worth naming. The triage system that closed this report got semantically stuck on rm -rf and never processed the actual mechanism. That is the same failure mode described in Post #5: heuristics that fire on familiar patterns, missing the LLM-specific failure entirely. It is in Post #3: expert-level confidence applied to a novel distribution of failures.

If you are going to have an LLM agent triage security reports, you need to understand what that model’s blind spots are. Salient tokens anchor the response. The rest of the report gets evaluated through that anchor. A report that says rm -rf and means git clean will be closed as an OS-layer concern. The model does not ask “what is the actual mechanism?” It pattern-matches to the loudest token and reasons forward from there.

Anthropic shipped VDP triage at scale. The throughput is real. The cost is that any report with a misleading surface token — even one that uses standard programmer shorthand — risks being closed before the actual claim is evaluated. That is not a criticism unique to Anthropic. It is a constraint that applies to any LLM-assisted triage system running at volume. The lesson is not “don’t use LLMs for triage.” The lesson is: understand the failure mode before you trust the output.