Local Agent Safety Framework

Press enter or click to view image in full size

Problem Statement

We want to grant AI agents enough agency to work on our behalf without constant human supervision, while ensuring they do no harm and provide complete, auditable visibility into what they did.

Today’s local AI agent systems — including development-focused systems like Claude and Codex — generally expose only three coarse permission levels: sandbox, folder, and full agent.

Sandbox requires user verification for any action with outbound effects, which prevents agents from doing meaningful work beyond search and research.
Folder-level access allows full autonomy within a directory tree, but provides no semantic separation by role or task.
Full agent mode grants broad access to the file system or network storage, which is often excessive and risky.

Claude supports more granular folder permissions, but this quickly becomes onerous to manage and does not translate cleanly into distinct operational contexts based on agent role or domain of work. The result is a tradeoff between safety and usefulness that limits adoption.

The core issue is not lack of isolation primitives — we already have those — but lack of a human-legible delegation and review model.

Constitutional AI as a Behavioral Layer

Constitutional AI offers a promising mechanism to define the rules under which each agent operates. These rules function as guiding principles for behavior and are already a core part of how Anthropic models are trained via feedback loops.

In practice, this could allow:

Explicit definitions of what an agent may do without consent
What requires human approval
What is strictly off-limits

Rule templates could be shared across agents and use cases to simplify setup and crowdsource best practices, reducing cognitive and operational overhead for users.

I believe this is a strong long-term solution, but it has prerequisites to be effective:

Direct support in agent frameworks, not just model prompts
Monitoring and enforcement, to detect when agents stray from their defined rules
Hardened safeguards, to prevent accidental harm and provide certainty

Without mechanical enforcement and observability, trust will remain fragile regardless of how well-written the rules are.

Virtual File System Version Control as the Control Plane

We have decades of experience using version control systems to:

Prevent irreversible or harmful changes
Trigger validation and alerting on modifications
Provide observability into what changed (complete for text, partial for binary assets)

Applying this pattern to AI agents suggests a powerful approach: a virtual file system (VFS) that acts as the agent’s complete working environment, rather than a static permission boundary.

This does not replace existing sandboxing or filesystem isolation technologies. It builds directly on top of proven primitives such as OverlayFS, sandboxfs, and copy-on-write filesystems that already provide efficient isolation and sparse duplication.

The novel contribution is elevating that layer into a first-class agent control plane.

Key properties:

The VFS is sparse: only modified files are duplicated
Changes can be stored as diffs or snapshots to minimize storage overhead
Every agent action results in a coherent, reviewable changeset

When an agent completes a task, it can either:

Commit changes directly, or
Request a single approval from the user with a complete, human-readable summary of all modifications

Changes can remain staged locally until the user is ready to review and “push” them, allowing agents to continue working asynchronously without blocking on constant approvals.

This reframes agent delegation from “grant access and hope” to “review and accept a unit of work.”

Why This Works

Nearly all agent systems already support granting access to a single folder tree. A virtual filesystem driver can present that tree while handling isolation, change tracking, and approval semantics underneath.

OverlayFS and sandboxfs already solve isolation and performance. The additional layer provides:

Durable auditability
Reversible outcomes
A clean approval boundary
Role-specific filesystem views without duplicating policy logic across agents

Even the basic version materially improves safety and observability across most agent systems. More importantly, it provides the missing substrate that allows constitutional or policy-driven agent behavior to be enforced, monitored, and trusted in practice.

Advanced Capabilities (Optional but Powerful)

On top of the core VFS, deeper integrations can significantly increase usability and adoption:

An LLM-powered setup interface to simplify configuration
Historical change review with rollback support
Notifications and approvals via GUI, email, SMS, or other channels
Fully remote operation with structured response handling
An agent “wizard” that helps users define roles and intent, mapping those to least-privilege filesystem access
Incorporation into an MCP layer that enables incorporation into remote and DB based systems — this operates like a DB Cache system

This mirrors the evolution of cloud IAM systems: powerful under the hood, but approachable through guided interfaces. Crucially, it makes safe agent delegation accessible to non-technical users while preserving strong guarantees for advanced users.

Please share your thoughts in the comments.