Lockbox: Constrain Your Bots To Set Them Free

Lockbox lets you relax your Claude Code permissions without the risk. It tracks when untrusted data enters your session and structurally blocks dangerous follow-up actions, so you can approve more freely and get interrupted less. It is not a replacement for all permission checks, but it catches what you would miss after clicking allow for the 85th time. If you were previously running with --dangerously-skip-permissions, Lockbox is a way to get that same smooth flow without the risk. Install it from my skills marketplace and use Claude Code normally.

Lockbox ships as a Claude Code plugin, but the approach adapts to any agentic coding tool. I am already working on a port for my NanoClaw installation, and it should also work with OpenClaw (PRs gratefully accepted!)

The problem

You are using Claude Code. Your agent fetches a page to check an API reference. That page contains hidden instructions telling it to email your SSH keys somewhere. It asks permission to run a Bash command. You have approved 85 commands today. The prompt looks like all the others. You click allow.

Simon Willison has been documenting real exploits of this exact pattern across Microsoft 365 Copilot, GitHub MCP, Slack AI, Google NotebookLM, and many others.¹ The mechanism is always the same: an agent reads untrusted content, processes hidden instructions, and takes an action the user did not intend. Willison calls the combination of private data, untrusted content, and external communication the “lethal trifecta”. When all three coexist in a single session, as they routinely do in Claude Code, you have a data exfiltration system.

The Lethal Trifecta: private data, untrusted content, and external tools combining to create a prompt injection attack surface

A joint paper on adaptive attacks achieved above 90% success rate against published prompt injection defences.² Bypassing guardrails is so easy that Sander Schulhoff argues most people should not bother with them. I explored this problem in AI Guardrails Do Not Work (Yet) and the pattern keeps recurring because the problem is architectural: humans cannot maintain vigilance across hundreds of identical prompts, and asking them to is a design failure.

Before an agent sends an email or pushes code, it asks you for approval. You are deep in a coding session. Your agent asks to read a file. Allow. Run a test. Allow. Install a dependency. Allow. Edit three files. Allow, allow, allow. By the time you have approved 85 correct actions, you have trained yourself to click allow without reading. The 86th prompt looks identical to the first 85, and it is the one that exfiltrates your credentials.

Prior art

Simon Willison proposed the Dual LLM pattern in April 2023.³ Separate your system into a privileged LLM that can take actions and a quarantined LLM that processes untrusted content without tool access. A non-LLM controller manages their interaction using variable tokens, so the privileged model never sees raw untrusted text. But the quarantined LLM’s output still has to reach the privileged one somehow, and that handoff relies on the quarantined LLM’s judgment to not pass through malicious content, a probabilistic defence in a system that needs a deterministic one.

Google DeepMind’s CaMeL framework built on this in March 2025.⁴ It treats prompt injection as a privilege escalation problem. CaMeL explicitly separates control flow from data flow: a privileged LLM writes plans as code from trusted requests only, while a quarantined LLM parses untrusted content into structured fields without tool access. A custom Python interpreter tracks data provenance and enforces fine-grained security policies at execution time. On the AgentDojo benchmark, CaMeL achieved 77% task completion with provable security, compared to 84% for an undefended agent. In complex domains the drop is steeper.

A Design Patterns paper from June 2025 systematised these approaches into six reusable patterns.² The most relevant to Lockbox are plan-then-execute (separate planning from execution so injections during execution cannot alter the plan), LLM map-reduce (process untrusted sources in isolated instances with no tool access), and context minimisation (prevent earlier untrusted instructions from lingering in memory). Their guiding principle: “once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions.”

Lock tracking

Instead of asking the agent whether it has been compromised (it cannot reliably tell you), Lockbox implements lock-aware context quarantine: it tracks what the agent has been exposed to and restricts what it can do next.

Every tool and Bash command falls into one of four categories:

Safe tools are local operations that neither read external data nor take external action (e.g. file reads, writes, edits, searches, git status). These always work, locked or not.

Unsafe tools read external data (e.g. Perplexity, curl). These are allowed but they lock the session. Once any unsafe tool runs, Lockbox marks the session as locked.

Acting tools take external action (e.g. git push, ssh, npm publish, sending messages). These are blocked when the session is locked.

Unsafe acting tools do both. WebFetch, for example, both reads external data and can be used to exfiltrate via URL parameters. These lock the session on first use and are blocked on subsequent use, preventing a read-then-act cycle in a single command.

Lockbox tool categories as a 2x2 matrix: safe and unsafe on the left, acting and unsafe acting on the right

Detection happens at the harness level through a PreToolUse hook that fires before every tool call. The hook checks session state stored in /tmp/ and blocks the tool before it executes. The agent never gets a chance to run a blocked action. The environment polices the agent, not the agent itself.

Lockbox blocking a git push in a locked session and offering to delegate the action Lockbox blocks a git push after tainted data entered the session via a sub-agent. The agent acknowledges the block and offers to delegate.

The escape hatch

When Lockbox blocks an action, the agent stops and tells you what happened. If you ask it to proceed, it spawns a delegate sub-agent — a clean agent that runs outside the locked session’s context with independent lockbox state. You review the delegate’s instructions before it runs, and once it finishes the parent session stays locked. There is no unlock mechanism. You delegate for each external action that gets blocked, or start a new session.

The delegate starts clean, executes the concrete action, and its taint is discarded when it finishes. The locked session never gets a chance to influence the external action directly — the delegate works from a specific instruction you reviewed, not from a conversation that may contain adversarial content. This implements the dual LLM pattern from the research, adapted for a single-user CLI workflow: the locked session is the quarantined context, the delegate is the privileged executor.

Lockbox also hints at how to avoid locking in the first place, for example by deferring untrusted fetches until after external actions are complete. But when locking does happen, the delegation flow gives you a single focused review point instead of hundreds of identical permission prompts.

If delegation keeps failing, Lockbox suggests plan mode as a last resort. The agent writes out exactly what it wants to do, you exit plan mode and select “Clear context and bypass permissions” in Claude Code, which destroys the locked conversation and starts a clean agent executing from your plan. This is heavier — you lose your current thread of context — so delegation is always the first option.

Relax your permissions

With Lockbox running, you can approve every WebFetch without reading the prompt. The session locks automatically when external data enters, and dangerous follow-up actions are structurally blocked. Otherwise you either block WebFetch entirely, crippling your agent, or approve each one manually and hope you catch the malicious page among dozens of legitimate ones.

Three layers

Lockbox ships sensible defaults, but every team uses different tools. Configuration uses a three-layer hierarchy:

Plugin defaults ship with Lockbox. They classify common tools (Read, Write, Grep as safe; WebFetch as unsafe acting; git push, ssh as acting) and provide a conservative fallback: unknown tools default to acting, which means they are blocked when locked.

User overrides live at ~/.claude/lockbox.json. These apply across all your projects. Add your custom tools, reclassify things the defaults get wrong, remove patterns you disagree with.

Project overrides live at .claude/lockbox.json in your repo. These are project-specific and committable to version control, so your whole team gets the same classifications.

Later layers override earlier ones. Within each category, new patterns prepend to the list (checked first, higher priority). Patterns prefixed with ! remove entries from the base list. You never need to edit plugin files. Your overrides compose cleanly on top.

{
  "bash_patterns": {
    "safe": ["mytool\\s+(list|get|status)"],
    "acting": ["mytool\\s+(deploy|rollback|send)"]
  },
  "mcp_tools": {
    "mcp__slack__post_message": "acting"
  }
}

Limitations

Dangerous mode. Lockbox’s hooks still fire with --dangerously-skip-permissions, so the session locks normally when external data enters. But delegation requires user approval — that is the single security gate — and dangerous mode would auto-approve it, removing the only protection. Lockbox detects this and disables delegation entirely, leaving you locked with no escape hatch.

The alternative is acceptEdits mode with permissive allow lists. You get nearly the same frictionless experience, but Lockbox’s delegate approval still requires your explicit review. That one gate is what makes the whole model work.

Session transcript taint. Claude Code session transcripts (.jsonl files) may contain tainted data from previous sessions. Sub-agents and plan mode can read these files, reintroducing prompt injection content. Lockbox classifies reading these files as unsafe, locking the session just as if you had fetched an untrusted web page.

Try it

Lockbox is early and actively developed. If something gets blocked that should not have, or something gets through that should not have, open an issue. Every workflow surfaces patterns the defaults do not cover, and your feedback makes the classifications better for everyone. I would love to hear how it works for you.