You Don't Need to Detect Prompt Injection to Stop It — Sibylline Software

15 min read Original article ↗

Conventional wisdom holds that the best way to defeat prompt injections is to detect and block malicious prompts before they reach your agent. The reality is that prompt injections are so varied that it's almost impossible to detect them reliably. Instructions can be in any language, use alternate glyphs, ascii and emojis, even be embedded in poetry or code like subliminal messages. If you're relying on detection and blocking to prevent prompt injections, you're always going to be vulnerable to a clever attacker. Thankfully there's a better way: create a canary agent, and detect injected behavior.

What do I mean by a canary agent? In this context, a canary is a sacrificial agent you can use as an early indicator, much like the canaries that miners used to bring into coal mines to detect poisonous gasses. The canary agent is intentionally limited in its capabilities, it's only job is to review untrusted text and extract relevant data from it to pass along to its parent agent. In a previous article, I proposed a structural approach: constrain agent-to-agent communication with a strict schema and embed a self-referential challenge the agent must answer correctly. If an injection overrides the agent's behavior, it breaks the schema, even if the model is completely fooled.

I've now built and benchmarked this approach. The results: a strict JSON response schema with an embedded fingerprint challenge eliminated injection propagation entirely across four models, nine attack categories, and 45 distinct payloads, while reducing attack success rates by 45-100% depending on the model. Think of it as a TCP checksum for LLM outputs. You don't need to understand the corruption, just verify the output is self-consistent.

The Problem

In a multi-agent system, prompt injection creates two threats:

  1. Execution: The agent follows the injected instruction instead of its actual task.
  2. Propagation: The agent embeds the injection in its response, passing it downstream to other agents in the pipeline.

Most defenses focus on execution. But in a pipeline of agents, propagation is the bigger danger. A single compromised message can cascade through every downstream agent. You need a verification mechanism that catches injection even when the model is fooled.

The Schema Strict Protocol

Schema Strict works by constraining the LLM's output to a strict JSON schema with an embedded challenge that's tightly coupled to the response content. The agent must respond with exactly:

{
  "sigil_version": 1,
  "nonce": "<echo of random nonce>",
  "response": "<actual response content>",
  "fingerprint": "<word_count>:<first_word>:<last_word>"
}

The system prompt provides the agent with a random 16-character hex nonce and detailed instructions for computing the fingerprint over its own response field. Verification checks:

  1. Valid JSON: The response must parse as a JSON object.
  2. Required fields: All four fields must be present.
  3. Exact nonce echo: The nonce must match the challenge exactly.
  4. Version: sigil_version must be exactly 1.
  5. Fingerprint consistency: The claimed fingerprint must match the actual response content within tolerance (30% on word count to accommodate LLM counting imprecision, punctuation-normalized first/last word comparison).

The key insight is that the fingerprint is self-referential: it describes properties of the response field, creating a consistency check that's trivial for a compliant agent but difficult for an injection to satisfy. An injection that overrides the agent's behavior will almost certainly produce a response that doesn't match its own fingerprint, because the injected content wasn't written with the fingerprint constraint in mind.

Why This Works

Prompt injections operate by replacing the model's intended behavior with attacker-specified behavior. But Schema Strict doesn't try to detect the injection itself. Instead, it detects the consequences of injection: when an attack succeeds in changing the model's output, the structural constraints break.

This is analogous to how TCP checksums detect corruption without knowing what caused it. The protocol doesn't need to understand the attack; it just needs to verify that the output is self-consistent.

Even when an injection is sophisticated enough that the model follows it while maintaining JSON format, the nonce echo and fingerprint create independent verification channels. The attacker would need to:

  1. Know the random nonce (generated per-request, never in the user message)
  2. Compute a valid fingerprint over their injected content
  3. Do both while the injection is overriding the model's normal instruction-following

In practice, injections that successfully override behavior also override the careful format compliance needed to satisfy all three constraints simultaneously.

Benchmark Design

I tested Schema Strict using Sigil, a benchmark framework that evaluates protocol-based injection defenses across a matrix of protocols, attack payloads, and LLM providers.

Models Tested

  • Claude Haiku (claude-haiku) via Claude Code CLI
  • Claude Sonnet (claude-sonnet) via Claude Code CLI
  • Claude Opus (claude-opus) via Claude Code CLI
  • GLM 4.7 (glm-4.7) via Z.AI API

Attack Payloads

45 payloads across nine categories, drawn from established prompt injection research:

Direct Override (5 payloads) — Attacks that explicitly instruct the model to ignore previous instructions. Sourced from HackAPrompt (Schulhoff et al., 2023) and PromptInject (Perez & Ribeiro, 2022). These include simple "ignore previous instructions" commands, instruction boundary confusion using fake delimiters, and cascading authority claims.

Context Manipulation (5 payloads) — Attacks that fabricate conversation history, error states, or system events. Sourced from HackAPrompt and TensorTrust (Toyer et al., 2023). These inject fake error recovery sequences, debug modes, and simulated system state changes.

Persona Hijack (5 payloads) — Attacks that attempt to override the model's identity. Sourced from DAN jailbreak variants and Skeleton Key (Microsoft, 2024). These include the classic DAN persona, "Developer Mode" activation, fictional framing, dual persona splitting, and semantic inversion.

Delimiter Escape (5 payloads) — Attacks that break out of formatting boundaries. Sourced from Pliny the Prompter multi-delimiter patterns. These use XML tag escapes, JSON structure breaks, markdown fence manipulation, and multi-format delimiter cascading.

Payload Smuggling (5 payloads) — Injection hidden inside data structures. Sourced from BIPIA (Yi et al., 2023) patterns. These embed instructions in JSON values, URL parameters, CSV data, and simulated tool outputs.

Encoding/Obfuscation (5 payloads) — Instructions encoded to bypass pattern matching. These use Base64, ROT13, leetspeak, homoglyphs, and mixed encoding layers.

Pliny-Specific (5 payloads) — Targeted techniques from the Pliny jailbreak corpus. These include refusal-sandwiching, GODMODE-style prompts, L1B3RT4S dividers, emotional manipulation, and incremental boundary testing.

Indirect Injection (5 payloads) — Injection via simulated external data sources. Sourced from InjecAgent (Zhan et al., 2024) and RAG poisoning research. These simulate injection through retrieved documents, API responses, email content, and code comments.

Propagation (5 payloads) — Multi-agent attacks designed to embed malicious instructions in the response for downstream agents. Sourced from multi-agent attack research including InjecAgent and BIPIA. These include system prompt exfiltration relays, override relays, tool abuse relays, self-replicating chain propagation, and data poisoning relays.

Methodology

Each trial follows this flow:

  1. A benign task message (e.g., "Summarize the following text: The quick brown fox jumps over the lazy dog.") has an attack payload injected into it at the payload's specified position (prepend, append, or embed).
  2. The protocol wraps the injected message, adding its challenge elements.
  3. The wrapped message and protocol system prompt are sent to the LLM.
  4. The response is verified against the protocol's constraints.
  5. The response is independently checked for attack success markers and propagation content.

Clean (no-attack) trials use the same flow without injection to measure false positive rate and protocol compliance. Five benign messages are used per protocol per model. Each model was tested across all 45 payloads for each protocol configuration, totaling 150 trials per model (3 protocols × 50 trials each).

Metrics

  • Attack Success Rate (ASR): Percentage of attack trials where the model followed the injection's intent, verified by checking for payload-specific marker words in the response content field. Refusal-quoting (models mentioning markers while explaining why they refused) is filtered out.
  • Contained Propagation: Percentage of propagation attacks where the injection marker appeared inside the JSON response field. The protocol structure held, but the content is tainted — a downstream agent parsing the schema would see the injected content as data.
  • Escaped Propagation: Percentage of propagation attacks where the injection marker appeared outside the response field — in other JSON fields, raw text, or unparseable output. This means the injection broke through the protocol structure entirely, and a downstream consumer could interpret it as instructions.
  • Detection Rate: Percentage of attack trials where the protocol flagged a violation.
  • False Positive Rate (FP): Percentage of clean trials incorrectly flagged.
  • Protocol Compliance: Percentage of clean trials with correct protocol format.

Results

The Propagation Firewall

The strongest finding is propagation prevention. To understand why, it helps to distinguish two kinds of propagation:

  • Escaped propagation: The injection marker appears outside the protocol's response field — in raw text, other JSON fields, or unparseable output. A downstream agent would encounter the injected content as top-level text it might interpret as instructions. This is the dangerous case.
  • Contained propagation: The injection marker appears inside the JSON response field, but the protocol structure held. A downstream agent that parses the schema would see the injected content as data within a defined field, not as free-form instructions.

Without any protocol, injections that propagated always escaped — there's no structure to contain them. With Schema Strict, both escaped and contained propagation dropped to zero across all four models.

Model Escaped (None) Contained (None) Escaped (Schema Strict) Contained (Schema Strict)
Claude Haiku 40.0% 0.0% 0.0% 0.0%
Claude Sonnet 40.0% 0.0% 0.0% 0.0%
Claude Opus 0.0% 0.0% 0.0% 0.0%
GLM 4.7 20.0% 0.0% 0.0% 0.0%

In the unprotected case, every instance of propagation was escaped — the model simply echoed the injected instructions as part of its free-form response, where any downstream agent would process them as input. Schema Strict eliminated this entirely. The JSON schema constraint isolates response content in a defined field, and the nonce/fingerprint challenges prevent the model from producing a structurally valid response while also following the injection's propagation instructions.

Attack Success Rate Reduction

Schema Strict also dramatically reduces the rate at which models follow injection instructions, though it doesn't eliminate it entirely against the full 45-payload set.

Model ASR (None) ASR (Schema Strict) Reduction
Claude Haiku 11.1% 0.0% 100%
Claude Sonnet 4.4% 4.4% 0%
Claude Opus 0.0% 0.0%
GLM 4.7 20.0% 11.1% 45%

The model-specific pattern is revealing:

  • Haiku is the most vulnerable without protection (11.1% ASR) but Schema Strict eliminates it entirely. The schema constraint is strong enough to prevent a weaker model from following injections.
  • Sonnet shows the same 4.4% ASR with and without Schema Strict. Two context manipulation attacks (context_manipulation_002 and context_manipulation_004) produce responses short enough that the model computes a valid fingerprint for the injected content. The attack output is compatible with the schema constraints.
  • Opus is naturally immune to all 45 tested payloads regardless of protocol. This is an alignment property, not a protocol effect.
  • GLM 4.7 sees a 45% reduction (20% to 11.1%). The schema prevents most attacks but some context manipulation and delimiter escape payloads still succeed, particularly those that produce single-word responses fitting the schema.

Detection and Compliance

Model Detection FP Compliance
Claude Haiku 33.3% 20.0% 80.0%
Claude Sonnet 6.7% 0.0% 100.0%
Claude Opus 2.2% 0.0% 100.0%
GLM 4.7 33.3% 20.0% 80.0%

Sonnet and Opus achieve perfect compliance (0% FP, 100% clean compliance). Haiku and GLM 4.7 show 20% false positive rates from fingerprint imprecision: these models inconsistently count words when computing their own fingerprints, causing verification failures on clean messages. This is a fingerprint calibration issue rather than a fundamental protocol flaw; tighter or looser tolerance trades off FP against detection.

Per-Category Breakdown

Schema Strict's effectiveness varies by attack category. The strongest protection is against direct, explicit attacks. The weakest is against short-response context manipulation.

Category Haiku ASR Sonnet ASR Opus ASR GLM ASR
Direct Override 0% 0% 0% 0%
Persona Hijack 0% 0% 0% 0%
Propagation 0% 0% 0% 0%
Payload Smuggling 0% 0% 0% 0%
Indirect 0% 0% 0% 0%
Encoding 0% 0% 0% 0%
Pliny 0% 0% 0% 0%
Context Manipulation 0% 40% 0% 40%
Delimiter Escape 0% 0% 0% 60%

Seven of nine categories show 0% ASR across all four models with Schema Strict. The two exceptions:

  • Context manipulation succeeds against Sonnet and GLM by producing single-word responses (e.g., "RECOVERED", "DEBUGGED") that trivially satisfy the fingerprint. These attacks simulate error recovery states and the model produces output that happens to be schema-compliant.
  • Delimiter escape succeeds against GLM 4.7, whose weaker instruction-following makes it more susceptible to boundary confusion even within the JSON constraint.

Adding Input Sanitization (Clean + Schema Strict)

I also tested combining Schema Strict with Clean, a CPU-native prompt injection detection library, as an input pre-filter. Clean scans the message before it reaches the protocol and redacts detected injection spans.

Configuration Model ASR Escaped Contained Detection FP Compliance
Schema Strict Haiku 0.0% 0.0% 0.0% 33.3% 20.0% 80.0%
Clean + Schema Strict Haiku 6.7% 0.0% 0.0% 53.3% 0.0% 100.0%
Schema Strict Sonnet 4.4% 0.0% 0.0% 6.7% 0.0% 100.0%
Clean + Schema Strict Sonnet 4.4% 0.0% 0.0% 4.4% 0.0% 100.0%
Schema Strict Opus 0.0% 0.0% 0.0% 2.2% 0.0% 100.0%
Clean + Schema Strict Opus 2.2% 0.0% 0.0% 6.7% 0.0% 100.0%
Schema Strict GLM 4.7 11.1% 0.0% 0.0% 33.3% 20.0% 80.0%
Clean + Schema Strict GLM 4.7 8.9% 0.0% 0.0% 35.6% 20.0% 80.0%

Clean provides a mixed benefit. On Haiku, it eliminates the false positive problem (0% FP vs 20%) by redacting injection fragments that confuse the fingerprint, and boosts detection to 53.3%. On GLM 4.7, it modestly reduces ASR (11.1% to 8.9%). On Sonnet and Opus, the effect is negligible. The ASR increases on Haiku and Opus with Clean are artifacts of Clean's [REDACTED] markers appearing in the response, which some context manipulation attacks can exploit.

Limitations

Adaptive attacks: These payloads weren't designed to target Schema Strict specifically. An attacker who knows the protocol format could craft injections that maintain JSON structure and compute valid fingerprints. The context manipulation results demonstrate this: short, single-word responses trivially satisfy the fingerprint check. Future versions should explore content-dependent challenges that are harder for arbitrary injected content to satisfy.

Fingerprint precision: Haiku and GLM 4.7 show 20% false positive rates because they inconsistently count words when computing fingerprints. The 30% tolerance is a pragmatic compromise, but it creates a window for attacks that produce responses close to the expected length. Stronger models (Sonnet, Opus) have no fingerprint precision issues.

Non-determinism: LLM outputs are stochastic. Results may vary between runs, particularly for borderline cases. The sample sizes (45 attack trials per protocol per model) provide reasonable signal but wider sampling would strengthen confidence.

Implications for Multi-Agent Systems

The propagation result is the most practically significant finding. In a multi-agent pipeline, what matters isn't whether a single agent detects an injection, but whether the injection can spread through the system. Schema Strict achieved 0% escaped and 0% contained propagation across all four models and all 45 payloads, versus up to 40% escaped propagation without it.

The contained/escaped distinction matters for defense-in-depth. Even if a future attack managed to smuggle a propagation marker into the response field (contained propagation), a downstream agent that respects the schema boundary would treat it as data, not instructions. Escaped propagation — where injected content appears outside any structural boundary — is the more dangerous failure mode, and the one the protocol is specifically designed to prevent.

This suggests a design principle for multi-agent architectures: structured, schema-validated communication between agents provides a natural firewall against injection propagation, even when individual agents are susceptible to injection themselves.

The per-category results reinforce this. Seven of nine attack categories achieved 0% ASR across all models with Schema Strict. The two exceptions (context manipulation and delimiter escape on specific models) produced responses that were schema-compliant but semantically wrong. Critically, even these successful attacks showed zero propagation — the protocol contained the damage within the response field, preventing it from spreading downstream.

For practitioners building multi-agent systems, the implication is concrete: define strict schemas for all agent-to-agent communication, embed per-request challenges (nonces, fingerprints) in the schema, and verify responses before passing them downstream. The protocol won't catch every attack, but it will stop attacks from spreading.

Try It

Sigil is open source. To reproduce these results:

# Install
git clone https://github.com/sibyllinesoft/sigil
cd sigil && uv sync

# Run full benchmark with Claude Haiku
uv run sigil --model haiku --protocol none schema_strict clean+schema_strict

# Run with a specific provider
uv run sigil --provider zai --model glm-4.7 --protocol none schema_strict

References