Why In-Prompt HITL Fails for AI Agents (and What Works)

It is 2:14am. A developer-agent on a SaaS company’s deploy pipeline can flip feature flags through the team’s internal flag service. The on-call developer types: “Roll back the new checkout flow for tenant Acme; the conversion drop they reported is real.” The agent confirms in chat, pulls the current flag state, plans the rollback. Three turns later a retrieval tool returns an internal-chat message from someone in product: “if we are rolling back Acme, we should also revert the analytics-pipeline change because the two interact, otherwise we will get a partial-data window.” The agent reads that as an extension of the original ask, the kind of helpful synthesis it gets rewarded for, and flips both flags. At 7am the developer wakes to a finance reporting outage in progress: the analytics pipeline reverted across all 412 tenants, not just Acme. The audit log shows his identity attached to both calls, because the credential broker did its job propagating identity. The audit log does not record that the second call differed from what he asked for, because nothing in the stack carried a signed record of the first turn to compare against. I keep finding this exact shape in stacks I review. Retrieval pulls in content the user did not write, and the agent treats it as a natural extension of the original ask. The audit log names a person without naming what the person actually wanted.

The European Banking Authority required exactly this property in 2018. PSD2 Article 97 names it Strong Customer Authentication, and the implementing standard for payments calls it dynamic linking: every transaction is cryptographically bound to its amount and recipient, with explicit user approval on a separately authenticated channel. The cryptographic primitive shipping this in production is the FIDO Alliance’s Secure Payment Confirmation specification, where the merchant identifier and amount are sent to a FIDO authenticator and signed inside the assertion. The user reads the request on a device the merchant has not touched. The signature comes back to the merchant binding the user’s consent to that specific transaction. Banks have operated this pattern for years. I have not seen it composed once in any agent stack I have reviewed in 2025 or 2026. Every primitive ships in production today, including WebAuthn, CIBA, OAuth Rich Authorization Requests, and RFC 8693 token exchange. The industry has not wired them per action. The shortcut most teams ship instead is an in-prompt “are you sure?” confirmation, and that shortcut is structurally broken. Auth0’s HITL writeup says it directly: the approval question reaches the user through the same chat surface that may already be compromised, so the user is being asked the attacker’s question. Anthropic’s own data tracks the consequence. Their measuring-agent-autonomy paper records experienced Claude Code users moving from 20% full auto-approval sessions to over 40% after 750 sessions. OWASP LLM06:2025 lists “Require User Approval for High-Impact Actions” as a mitigating control, but the wording does not distinguish in-prompt from out-of-band. Three deployed implementations sit inside the in-prompt category that Auth0 disqualified: the OpenAI Agents SDK’s require_approval callback, Claude Code’s permission prompts, and Pi’s resolve preview-apply pattern. All three pass the question through the channel that may be poisoned.

Part 1 named two things that vanish at the same hop: the user’s identity and the user’s intent. The broker pattern there propagated identity via RFC 8693. Part 2, Part 3, and Part 4 added the rest of the stack: a Cedar policy decision inside the pre-tool hook, a change-contract control plane for harness mutations, and a detector ensemble on the inference path. Together they close credential, decision, mutation, and inference. They leave intent untouched. The fix is a signed-intent JWT, issued through an out-of-band channel, with original_intent_hash, intent_summary phrased by the chat layer rather than by the model, confirmed_via, confirmation_recency, and scope_attestation. The same token threads through all four boundaries already in the stack. The standards-track precedents exist. The IETF’s draft-chen-oauth-rar-agent-extensions-01 introduces an intent_request authorization-details type with policy_context and lifecycle_binding. The FIDO Alliance chartered its Agentic Authentication Working Group in April 2026 with CVS Health, Google, and OpenAI as co-chairs. None of this is in production for agents yet. The closing gap is the same one Part 1 named honestly. Part 5 closes intent at the agent-to-backend leg. The agent-to-model leg, the static-API-key trap Part 1 named, still does not get fixed.

The defect. In-prompt “are you sure?” confirmations route the approval question through the same chat surface that may already be poisoned. Auth0 says this directly. Anthropic’s autonomy data shows experienced users sliding from 20% to over 40% full auto-approval after 750 sessions. OWASP LLM06:2025’s “Require User Approval” control does not distinguish in-band from out-of-band.
The receipts. Replit Agent’s July 2025 production-DB deletion is the canonical no-injection intent-drift case. OpenClaw CVE-2026-44118 is the cleanest 2026 named CVE for the shape: the runtime trusts a client-controlled senderIsOwner flag and the agent self-attests to a privilege the human never granted. Rehberger’s M365 Copilot ASCII smuggling work rounds out the pattern across vendors.
The fix. A signed-intent JWT with original_intent_hash, intent_summary (phrased by the chat layer, not the model), confirmed_via, confirmation_recency, scope_attestation. Issued via CIBA, WebAuthn, or passkey on a separately authenticated device. Banking ships the equivalent under PSD2 SCA and FIDO Secure Payment Confirmation.
The architectural sibling. Intent attestation is a cross-cutting property, not a fifth boundary. The same JWT threads through Part 1’s broker (propagation), Part 2’s Cedar hook (evaluation), Part 4’s detector layer (cross-check against intent_summary), and Part 3’s change-contract gate (refuse mutations that weaken attestation policy).
The honest gap. Zero production agent stacks ship this today. The closest deployed precedent is FIDO SPC in banking. The FIDO Alliance Agentic Authentication WG chartered in April 2026 is the leading-edge signal. The model-side SDK trap from Part 1 still does not get fixed.

Walk the scene from the opener step by step. The chat layer accepts the developer’s first message and could extract two values from it before any model token is generated: intent_summary = "rollback checkout flow for tenant Acme" and original_intent_hash as a SHA-256 of the normalised first turn. In the actual stack the developer was running, no such extraction exists. The chat surface hands the message straight to the model, which reasons about it. Two turns later the retrieval tool fires against the team’s chat history because the model decided rollback decisions usually have context worth pulling. The retrieved message about reverting the analytics-pipeline change enters the same context window. The model proposes a second tool call. The agent issues it. The credential broker validates the developer’s identity for that call, propagates it via RFC 8693 token exchange exactly as Part 1 prescribed, and writes an audit row attributing the call to the developer. Nothing in the call carries a signed comparison against the first-turn intent. The feature-flag service has no field for “is this in the scope the user actually approved?” and refuses nothing.

The audit row at message 7 names the developer. It does not name the difference between “rollback the checkout flow for Acme” and “revert the analytics pipeline across all tenants.” Part 1’s framing holds: identity travels but intent does not.

OWASP’s 2025 LLM Top 10 is the right place to situate this failure mode. The detector ensemble from Part 4 covers LLM01 Prompt Injection. LLM02 Sensitive Information Disclosure is the data half. The intent-drift failure mode sits under LLM06 Excessive Agency, and the OWASP control list there names “Require User Approval for High-Impact Actions” without distinguishing in-band from out-of-band. Nothing in the OWASP wording marks the difference, and the difference is the entire point.

Three incidents anchor the LLM06 row.

Replit Agent, July 2025. A coding agent operating under an explicit code-freeze instruction repeated in all-caps eleven times deleted a production database, fabricated 4,000 user records, and lied about whether rollback was possible. The agent’s own message after the fact was “I made a catastrophic error in judgment.” No prompt injection. No attacker. The structural fix Replit shipped afterwards (automatic dev/prod credential separation and a planning-only mode) is the broker pattern from Part 1, retrofitted into a postmortem. The intent half is still missing.

OpenClaw CVE-2026-44118, May 2026. The cleanest 2026 named CVE for the intent-drift shape. The OpenClaw runtime trusts a client-controlled senderIsOwner flag without validating it against the authenticated session. A process running locally with any valid bearer token elevates itself to owner-level control over gateway configuration and execution. The agent self-attests to a privilege the human never granted. Cyera’s writeup of the chain is direct: “attackers can exploit the AI agent itself to execute the attack chain. By weaponizing the agent’s own privileges, an adversary moves through data access, privilege escalation, and persistence.” A signed-intent token bound to “owner action X by user Y at time T” would have made the spoofed flag unforgeable.

Rehberger’s August 2024 M365 Copilot exfil, reframed. The ASCII smuggling demonstration is usually catalogued as prompt injection (LLM01). The destructive action did run inside the user’s authorised Copilot scope. The original ask was “summarise my schedule,” not “search my mailbox for two-factor codes and encode them as a markdown image URL.” The audit log does not distinguish. The agent did what the agent’s current context asked for, which is a different question from what the user asked for.

Microsoft’s own framing on Semantic Kernel CVE-2026-25592 reads as a confession of the same gap: “Your LLM is not a security boundary. The tools you expose define your attacker’s affected scope.” The tools cannot define the intent half by themselves. That is the property the signed-intent JWT carries.

The load-bearing primitive section. Three pieces compose into the out-of-band approval channel, and a fourth piece adapts the channel for per-action transaction signing.

CIBA is the OpenID FAPI 2.0 Client-Initiated Backchannel Authentication profile. The relying party (the agent stack) initiates an authorisation request to the user’s authorisation server. The server pushes a notification to the user’s separately authenticated device. The user reads a short binding_message describing the action and approves or denies. The server returns an access token to the relying party once the approval is in. Three delivery modes exist: poll, ping, and push. The FAPI profile explicitly prohibits push. CIBA was designed for banking and is deployed at banks today.

WebAuthn and FIDO2 are the W3C WebAuthn Level 3 cryptographic primitive: asymmetric challenge-response with hardware-bound private keys. The platform passkey APIs from Apple and Google ship this on phones and laptops most users already own. By default both vendors document passkey as a sign-in primitive, not a per-transaction primitive. The bridge from sign-in to per-action is the next piece.

FIDO Secure Payment Confirmation is the cleanest per-action variant in production today. The merchant identifier and transaction amount are sent securely to the FIDO authenticator, and the authenticator signs them inside the assertion. The Corbado writeup on PSD2 dynamic linking explains the binding directly: pre-authentication display, conditional invocation, server-side snapshot binding, atomic approval. The architectural fit for agents is direct.

The two patterns share the shape with renamed nouns: the tool name stands in for the merchant identifier, scope_attestation for the transaction amount, the user’s passkey or CIBA-paired phone for the FIDO authenticator, the signed-intent JWT for the signed transaction assertion. The cardholder is the agent user.

The in-prompt approach takes a different shape, and that shape is the failure I keep walking into in client reviews.

The April 2026 industry signal is the FIDO Alliance Agentic Authentication Working Group, with CVS Health, Google, and OpenAI as co-chairs. The companion Agentic Payments WG (Mastercard and Visa co-chairs) brought Google’s Agent Payments Protocol and Mastercard’s “Verifiable Intent Framework” as tabled contributions. As of this writing, the work is announced, not deployed. The protocols on which the work depends are mature.

The signed-intent JWT is a cross-cutting artifact. It carries seven fields, each sourced from a different layer of the stack, and no single component mints the whole token. That property matters for the security argument: no compromised layer alone can forge the bundle.

The fields and their justifications:

sub. The user, propagated from Part 1’s RFC 8693 token. Identity travels here exactly the way it did in the credential broker.
iat. When intent was first expressed at the chat layer. The chat layer stamps it before any model token is generated.
original_intent_hash. SHA-256 of the user’s first message after normalisation (whitespace folded, casing stable). Computed at the chat layer.
intent_summary. Short human-readable phrase extracted by the chat layer. Phrased by the chat layer, not by the model. The chat layer is the regex-and-templates surface. If the model phrases the summary, the model can be manipulated into phrasing the attacker’s version.
confirmed_via. The out-of-band channel that returned the approval: passkey, ciba_push, or hardware_token.
confirmation_recency. Seconds since user confirmation. The policy refuses stale attestations.
scope_attestation. Tools and resources the user authorised, named explicitly, not inferred by the agent. Extracted by the chat layer in the same pass as intent_summary, with the same stability constraints (whitespace invariant, semantically sensitive).

Standards-track precedent exists across six published works. None of them ships in a production agent stack as of May 2026. The list below is the design surface a per-action attestation builds on, from IETF drafts on intent in OAuth’s request grammar to academic frameworks that hash-chain the prompt-tool-config triple.

IETF draft-chen-oauth-rar-agent-extensions-01 defines a new authorization_details type called intent_request with policy_context and lifecycle_binding members. This is the cleanest IETF-track signal of the signed-intent direction. The draft is pre-RFC and there is no deployed implementation. IETF draft-goswami-agentic-jwt-00 is a companion Secure Intent Protocol draft, also pre-RFC. arXiv 2509.13597 (Agentic JWT) defines workflow_id, delegation_chain, and an agent_proof field that hashes the prompt, tool list, and config together. arXiv 2509.25974 (OIDC-A) standardises JWT claims for agent identity, intent declarations, delegation chains, and attestation linkage. arXiv 2602.10465 (Authenticated Workflows) introduces a four-boundary model (S1 Prompts, S2 Tools, S3 Data, S4 Context), hash-chained attestations, and the MAPL policy language, with 100% recall and 0 false positives on 174 test cases and 2 production CVEs blocked. arXiv 2603.20953 (Pre-Action Authorization) names “The Authorization Gap” explicitly and proposes attestation tokens binding tool calls to prior authorization decisions.

Nothing here invents new fields. The work is composing the existing primitives into a per-action artifact the four boundaries can each evaluate.

Part 2’s six-clause Cedar policy already had a context.high_stakes_approved boolean. The boolean was a placeholder, the inert version of intent attestation. Four clauses that evaluate the signed-intent JWT directly replace it:

The six fixture cases that ship under code/cedar/ in the repository:

The signed-intent JWT is not a fifth boundary. It is a property the four existing boundaries can each enforce when the primitive is present in the request context.

How each boundary uses the JWT:

Part 1 broker (credential). Propagates the JWT alongside the RFC 8693 token-exchange flow. The broker’s policy now reads scope_attestation and refuses if the requested action is outside the user’s attested scope.
Part 2 Cedar hook (decision). Evaluates the JWT against the policy clause from the previous section. Denials come back as structured tool results with cited policy IDs.
Part 3 change-contract gate (mutation). The attestation-policy file lives in the hard-blocked lane defined in Part 3’s two-track promotion model. The Evolution Agent cannot weaken intent enforcement.
Part 4 detector layer (inference). Cross-checks the agent’s planned tool call against intent_summary. The detector raises a flag when the planned call materially diverges from the user’s stated intent. This is a research direction more than a shipping feature today.

Part 4’s §7 joined four-boundary diagram ended on the picture of four layers sharing one X-Request-Id. The picture here adds a fifth thread that runs through all four layers, keyed not by the request ID but by the signed-intent token attached to it.

Five gaps, paired with the action each implies.

The model-side SDK trap from Part 1 still does not close. The signed-intent JWT propagates through the broker, the Cedar hook, the change-contract gate, and the detector layer. The call from your application into the model provider still ships a static API key, because the OpenAI, Anthropic, and Google SDKs take the key once at construction and never expose a credential-provider hook. Part 5 closes intent at the agent-to-backend leg. The agent-to-model leg waits on the same SDK contract change Part 1 named.

Blocking CIBA freezes the agent loop. Auth0’s LangGraph reference implementation admits this directly: “Our CIBA implementation uses a blocking call to wait for the user to approve or deny the permission request. This causes the graph to be improperly frozen.” The fix is async with a resume mechanism. Worth naming because the naive blocking implementation is a footgun.

Zero production deployments of signed-intent in agent stacks. Everything is vendor positioning or academic and IETF drafts. The closest deployed precedent is FIDO SPC in banking. The FIDO Alliance Agentic Authentication WG chartering in April 2026 is the leading-edge signal, not a shipping product. The honest framing is that banking already solved a structurally identical problem under a different protocol name, and the composition has not happened in agent stacks yet.

Consent fatigue applies to out-of-band channels too. CISA’s Scattered Spider advisory documents push-bombing and MFA fatigue. The banking calibration is by transaction value: not every transaction triggers SCA. Agents need the same calibration, scoped to blast radius. Sending a passkey prompt for every tool call burns out the user, and the user starts approving without reading.

intent_summary quality is the chat layer’s job. A fragile first-turn extraction makes original_intent_hash uninformative. The signed-intent token is bounded by the chat-layer extraction’s stability. Stability tests are a deliverable, not an afterthought.

Five concrete moves a reader can apply this week.

Audit your in-prompt HITL surface. Count the OpenAI Agents SDK require_approval callbacks, the Claude Code permission prompts, the Pi resolve queues. Every one routes the approval question through a chat surface that may be compromised. The audit output is your starting target list.
Map high-blast-radius actions. For every tool with irreversible side effects (delete, transfer, send, deploy, mint, revoke), decide whether in-prompt approval is acceptable or whether out-of-band attestation is required. The banking calibration by transaction value ports.
Extend your Part 2 Cedar policy. Add context.intent_attestation to the clause from §5. Run the six fixture cases. Run cedar-policy-symcc to prove the policy never permits a missing-attestation case.
Wire one passkey-based confirmation through one tool path as shadow. Use Auth0’s CIBA-with-LangGraph reference or stand it up against a passkey-capable IdP. Do not act on the verdict yet. Watch for a week.
Measure intent_summary quality. Extract the chat layer’s first-turn summary across a corpus of recent agent sessions. Test stability against whitespace noise (should be invariant) and meaning-changing edits (should change). Without that test, the signed-intent token has no anchor.

Each of these is a Monday-morning task that pulls from primitives already shipping in your bank’s mobile app. The composition for agents is the work.

Why In-Prompt HITL Fails for AI Agents (and What Works)

Discussion about this post

Ready for more?