Trust Boundary Failures in AI Coding Agents: Empirical Analysis of MCP Configuration Attacks in Claude Code

Published March 14, 2026 | Version v1

Preprint Open

Description

AI coding agents grant large language models access to file systems, terminals, and external services through protocols such as the Model Context Protocol (MCP). The trust models governing that access were designed for human users, not autonomous agents processing attacker-controlled input. This paper presents three empirical findings in Anthropic’s Claude Code (v2.1.63) demonstrating systemic trust boundary failures in MCP server configuration handling, tool confirmation prompts, and workspace trust escalation. All findings were reported through Anthropic’s HackerOne Vulnerability Disclosure Program and closed as Informative. Rather than contesting that design decision, this paper reframes the findings from an enterprise defensive perspective and proposes compensating controls including virtual desktop infrastructure (VDI) isolation, MCP configuration integrity monitoring, and credential management practices adapted for AI-assisted development workflows.

Trust Boundary Failures in AI Coding Agents: Empirical Analysis of MCP Configuration Attacks in Claude Code

Description

Files

Trust_Boundary_Failures_in_AI_Coding_Agents.pdf

Files (208.6 kB)

Additional details