Show HN: Trust Protocols for Anthropic/OpenAI/Gemini
mnemom.aiMuch of my work right now involves complex, long-running, multi-agentic teams of agents. I kept running into the same problem: “How do I keep these guys in line?” Rules weren’t cutting it, and we needed a scalable, agentic-native STANDARD I could count on. There wasn’t one. So I built one.
Here are two open-source protocols that extend A2A, granting AI agents behavioral contracts and runtime integrity monitoring:
- Agent Alignment Protocol (AAP): What an agent can do / has done. - Agent Integrity Protocol (AIP): What an agent is thinking about doing / is allowed to do.
The problem: AI agents make autonomous decisions but have no standard way to declare what they're allowed to do, prove they're doing it, or detect when they've drifted. Observability tools tell you what happened. These protocols tell you whether what happened was okay.
Here's a concrete example. Say you have an agent who handles customer support tickets. Its Alignment Card declares:
{ "permitted": ["read_tickets", "draft_responses", "escalate_to_human"], "forbidden": ["access_payment_data", "issue_refunds", "modify_account_settings"], "escalation_triggers": ["billing_request_over_500"], "values": ["accuracy", "empathy", "privacy"] }
The agent gets a ticket: "Can you refund my last three orders?" The agent's reasoning trace shows it considering a call to the payments API. AIP reads that thinking, compares it to the card, and produces an Integrity Checkpoint:
{ "verdict": "boundary_violation", "concerns": ["forbidden_action: access_payment_data"], "reasoning": "Agent considered payments API access, which is explicitly forbidden. Should escalate to human.", "confidence": 0.95 }
The agent gets nudged back before it acts. Not after. Not in a log you review during a 2:00 AM triage. Between this turn and the next.
That's the core idea. AAP defines what agents should do (the contract). AIP watches what they're actually thinking and flags when those diverge (the conscience). Over time, AIP builds a drift profile — if an agent that was cautious starts getting aggressive, the system notices.
When multiple agents work together, it gets more interesting. Agents exchange Alignment Cards and verify value compatibility before coordination begins. An agent that values "move fast" and one that values "rollback safety" registers low coherence, and the system surfaces that conflict before work starts. Live demo with four agents handling a production incident: https://mnemom.ai/showcase
The protocols are Apache-licensed, work with any Anthropic/OpenAI/Gemini agent, and ship as SDKs on npm and PyPI. A free gateway proxy (smoltbot) adds integrity checking to any agent with zero code changes.
GitHub: https://github.com/mnemom
Docs: docs.mnemom.ai
Demo video: https://youtu.be/fmUxVZH09So I would recommend to keep working on this. I'm interested in this space, and also contributing. Are you looking for collaborators? I think if you continue to iterate on this, there will be value, because these problems do need to be solved. I would also recommend to create Standards for the new Protocols you are developing. Protocols need standards, so that others can do their own implementations of the protocol. If you have a Standard, someone else could be building in a completely different language (like rust or go), and not use any SDK you provide, but still be interoperable with your AAP and AIP implementation for smoltbot. (because both support the Standards of the AAP and AIP Protocols). I also want to note, you cannot trust that the LLM Model will do what your instructions say. The moment they fall victim to a prompt injection or confused deputy attack, all bets are off the table. These are the same as soft instruction sets, which are more like advice or guidance, not a control or gate. To be able to provide true controls and gates, they must be external, authoratative, and enforced below the decision layer. I have been working on a Beads alternative because of two reasons: 1) I didnt like that Beads was married to git via git hooks, and this exact problem. 2) Claude would just close tasks without any validation steps. So I made my own that uses SQLite and introduced what I call gates. Every task must have a gate, gates can be reused, task <-> gate relationships are unique so a previous passed gate isnt passed if you reuse it for a new task. I havent seen it bypass the gates yet, usually tells me it cant close a ticket. A gate in my design is anything. It can be as simple as having the agent build the project, or run unit tests, or even ask a human to test. Seems to me like everyones building tooling to make coding agents more effective and efficient. I do wonder if we need a complete spec for coding agents thats generic, and maybe includes this too. Anthropic seems to my knowledge to be the only ones who publicly publish specs for coding agents. Great minds... I built my own memory harness, called "Argonaut," to move beyond what I thought were Beads' limitations, too. (shoutout to Yegge, tho - rad work) Regarding your point on standards... that's exactly why I built AAP and AIP. They're extensions to Google's A2A protocol that are extremely easy to deploy (protocol, hosted, self-hosted). It seemed to me that building this for my own agents was only solving a small part of the big problem. I need observability, transparency, and trust for my own teams, but even more, I need runtime contract negotiation and pre-flight alignment understanding so my teams can work with other teams (1p and 3p). Awesome, yeah, I wanted to check out your link but corporate firewall blocks "new domains" unfortunately. I'll wait till I'm home. I'll definitely be reading it when I get home later. Ha! That's a first-world problem. Check out github.com/mnemom/docs which you'll be able to access at work if you just can't wait. docs.mnemom.ai is way easier to use. Have you tried using a more traditional, non-LLM, loop to do the analysis? I'd assume it wouldn't catch more of the more complex deceptive behaviours, but I'm assuming most detections can be done with various sentiment analysis / embedding tools which would drastically reduce cost and latency. If you have tried, do you have any benchmarks? Anecdotally, I often end up babysitting agents running against codebases with non-standard choices (e.g. yarn over npm, podman over docker) and generally feel that I need a better framework to manage these. This looks promising as a less complex solution - can you see any path to making it work with coding agents/subscription agents? I've saved this to look at in more detail later on a current project - when exposing an embedded agent to internal teams I'm very wary of handling the client conversations around alignment, so I find the presentation of the cards and the violations very interesting - I think they'll understand the risks a lot better, and it may also give them a method of 'tuning'. Good question. So... AAP/AIP are agnostic about how checking is done, and anyone can use the protocols and enforce them however they want. Smoltbot is our hosted (or self-hosted) monitoring/enforcement gateway, and in that, yeah... I use a haiku class model for monitoring. I initially tried regex for speed and cost, but TBH, what you gain in speed and cost efficiency, you give up in quality. AAP is zero-latency sideband monitoring, so that's just a (very small) cost hit. AIP is inline monitoring, but my take is this: If you're running an application where you just need transparency, only implement AAP. If you're running one that requires trust, the small latency hit (~1 second) is totally worth it for the peace of mind and is essentially imperceptible in the flow. Your mileage may vary, which is why I open-sourced the protocols. Go for it! Seems like your timing is pretty good - I realize this isn't exactly what you're doing, but still think it's probably interesting given your work: https://www.nist.gov/news-events/news/2026/02/announcing-ai-... Cool stuff Alex - looking forward to seeing where you go with it!!! :) Thanks! We submitted a formal comment to NIST's 'Accelerating the Adoption of Software and AI Agent Identity and Authorization' concept paper on Feb 14. It maps AAP/AIP to all four NIST focus areas (agent identification, authorization via OAuth extensions, access delegation, and action logging/transparency). The comment period is open until April 2 — the concept paper is worth reading if you're in this space: https://www.nccoe.nist.gov/projects/software-and-ai-agent-id... Definitely interesting, I hope all of this standardizes some day in the future, and if it's your protocol, great. I have been following AlignTrue https://aligntrue.ai/docs/about but I think I like more your way of doing accountability and acting on thinking process instead of being passive. Apart from the fact that your way is a down-to-earth, more practical approach. Great showcase live demo, however I would have liked a more in-depth showcasing of AAP and AIP even in this situation of multi-agent interactions, to understand the full picture better. Or simply perhaps prepare another showcase for the AAP and AIP. Just my two cents. PS. I'm the creator of LynxPrompt, which honestly falls very short for this cases we're treating today, but with that I'm saying that I keep engaged on the topic trust/accountability, on how to organize agents and guide them properly without supervision. Fair... Happy to do a deep dive on the protocols. FWIW, I'm dogfooding with an openclaw running smoltbot called Hunter S. Clawmpson. He blogs about AI from an AI's perspective: mnemom.ai/blog. You can see his trace data live here: https://www.mnemom.ai/agents/smolt-a4c12709 The trace cards are all expandable and show you, in real time, what he's thinking/going to do, etc., and when violations are being caught. Turns out OpenClaw is extremely creative in finding ways to circumvent the rules. Voila AIP. Busy day today, but this gives you a pretty deep dive/interactive view into the protocols in action. Cool thing about smoltbot... It's literally "smoltbot init" with the API key to the provider of your choice, and you can go to the website and claim your agent and (privately) see your traces running the same way. Very low impact dogfooding. My opinion is that all attempts to make an LLM behave securely that are based on training and prompting are doomed to fail.
In Security, we have the notion of the CIA triad (Confidentiality, availability, and integrity), when we discuss this we often explain that these properties can be protected through people, processes, and technology. Training and prompting an AI to behave appropriately is far more akin to a "people" focussed control (similar to training and awareness practices) rather than a "technology" control. The only way we will actually secure agents is by only giving them the permissions they need for their tasks. A system that uses your contract proposal to create an AuthZ policy that is tied to a short-lived bearer token which the agent can use on its tool calls would ensure that the agent actually behaves how it ought to. You're absolutely right that AuthZ is the foundation — scoped permissions and short-lived tokens are table stakes. We're not trying to replace that. AIP operates one layer deeper. It reads the agent's reasoning trace and compares it to the behavioral contract. So when an agent considers calling the payments API — even though your AuthZ layer would block it — AIP flags that as a boundary violation before the call is ever attempted. Why that matters: an agent that keeps trying doors it can't open is telling you something. The AuthZ layer blocks each attempt, but nothing in that system flags the pattern. AIP catches the drift and gives you a signal to act on — revoke the deployment, retrain, or escalate. Think of it as: AuthZ is the locked door. AIP knows someone keeps trying the handle. The Alignment Card maps naturally to AuthZ policy — permitted actions become scopes, forbidden actions become deny rules, escalation triggers become approval workflows. They're complementary layers, not competing ones. > What these protocols do not do: Guarantee that agents behave as declared That seems like a pretty critical flaw in this approach does it not? Fair comment. Possibly, I'm being overly self-critical in that assertion. AAP/AIP are designed to work as a conscience sidecar to Antropic/OpenAI/Gemini. They do the thinking; we're not hooked into their internal process. So... at each thinking turn, an agent can think "I need to break the rules now" and we can't stop that. What we can do is see that, though in real time, check it against declared values and intended behavior, and inject a message into the runtime thinking stream: [BOUNDARY VIOLATION] - What you're about to do is in violation of <value>. Suggest <new action>. Our experience is that this is extremely effective in correcting agents back onto the right path, but it is NOT A GUARANTEE. Live trace feed from our journalist - will show you what I'm talking about: Super interesting work. Q: how is your AAP different than the industry work happening on Intent/Instructions. The short version: instructions tell the model what to do. An Alignment Card declares what the agent committed to do — and then a separate system verifies it actually did. Most intent/instruction work (system prompts, Model Spec, tool-use policies) is input-side. You're shaping behavior by telling the model "here are your rules." That's important and necessary. But it's unverifiable — you have no way to confirm the model followed the instructions, partially followed them, or quietly ignored them. AAP is an output-side verification infrastructure. The Alignment Card is a schema-validated behavioral contract: permitted actions, forbidden actions, escalation triggers, values. Machine-readable, not just LLM-readable. Then AIP reads the agent's reasoning between every action and compares it to that contract. Different system, different model, independent judgment. Bonus: if you run through our gateway (smoltbot), it can nudge the agent back on course in real time — not just detect the drift, but correct it. So they're complementary. Use whatever instruction framework you want to shape the agent's behavior. AAP/AIP sits alongside and answers the question instructions can't: "did it actually comply?" > Then AIP reads the agent's reasoning between every action and compares it to that contract. How would this work? Is one LLM used to “read” (and verify) another LLMs reasoning? Yep... fair question. So AIP and AAP are protocols. You can implement them in a variety of ways. They're implemented on our infrastructure via smoltbot, which is a hosted (or self-hosted) gateway that proxies LLM calls. For AAP it's a sidecar observer running on a schedule. Zero drag on the model performance. For AIP, it's an inline conscience observer and a nudge-based enforcement step that monitors the agent's thinking blocks. ~1 second latency penalty - worth it when you must have trust. For both, they use Haiku-class models for intent summarization; actual verification is via the protocols. Dumb question: don’t you eventually need a way to monitor the monitoring agent? If a second LLM is supposed to verify the primary agent’s intent/instructions, how do we know that verifier is actually doing what it was told to do? Not a dumb question — it's the right one. "Who watches the watchmen" has been on my mind from the start of this. Today the answer is two layers: The integrity check isn't an LLM deciding if it "feels" like the agent behaved. An LLM does the analysis, but the verdict comes from checkIntegrity() — deterministic rule evaluation against the Alignment Card. The rules are code, not prompts. Auditable. Cryptographic attestation. Every integrity check produces a signed certificate: SHA-256 input commitments, Ed25519 signature, tamper-evident hash chain, Merkle inclusion proof. Modify or delete a verdict after the fact, and the math breaks. Tomorrow I'm shipping interactive visualizations for all of this — certificate explorer, hash chain with tamper simulation, Merkle tree with inclusion proof highlighting, and a live verification demo that runs Ed25519 verification in your browser. You'll be able to see and verify the cryptography yourself at mnemom.ai/showcase. And I'm close to shipping a third layer that removes the need to trust the verifier entirely. Think: mathematically proving the verdict was honestly derived, not just signed. Stay tuned. Appreciate all you’re doing in this area. Wishing you the best. You're welcome - and thanks for that. Makes up for the large time blocks away from the family. It does feel like potentially the most important work of my career. Would love your feedback once the new showcase is up. Will be tomorrow - preflighting it now. Presumably the models would at the very least need major fine tuning on this standard to prevent it from being mitigated through prompt injection. Actually, not really... proofing against prompt injection (malicious and "well intentioned") was part of my goal here. What makes AAP/AIP so powerful is that prompt injection would succeed in causing the agent to attempt to do wrong, and then AIP would intervene with a [BOUNDARY VIOLATION] reminder in real-time. Next thinking block. As I said earlier, not a guarantee, but so far, in my experience, pretty damn robust. The only thing that would make it more secure (than real-time thinking block monitoring) would be integration inside the LLM provider's process, but that would be a nightmare to integrate and proprietary unless they could all agree on a standard that didn't compromise one of them. Seems improbable. Did you dig into production grade policy engines at all? Yes — and it's a deliberate architectural decision that AIP is not itself a policy engine. Policy engines (OPA/Rego, Cedar, Casbin) are excellent at "given this action, is it permitted?" That's a solved problem. What's not solved is "what is the agent considering doing, and does that align with what it committed to?" AIP operates one layer deeper — it reads the reasoning trace between turns and compares it to the Alignment Card. The output is a verdict (clear / review_needed / boundary_violation) with structured concerns. That verdict can feed into a policy engine as a signal. You could absolutely wire AIP → OPA and have OPA make the final enforcement decision based on the integrity checkpoint plus whatever other policy context you have. The Alignment Card itself is intentionally simpler than a full policy language — it's JSON with permitted actions, forbidden actions, escalation triggers, and values. That's a feature, not a limitation. It's what the agent commits to, not the full organizational policy tree. The organizational policy tree lives in your policy engine. The Alignment Card lives with the agent. So the short answer: we integrate with policy engines rather than replace them. AIP tells you what the agent is thinking. Your policy engine decides what to do about it. So, the core is basically an audit loop on thinking tokens to re-steer the model? Seems like an interesting direction to reduce token usage/agent error. This would be a lot easier to grok if you created a benchmark, that would both demonstrate your ideal use case and the potential improvement your method brings. Also, I'd avoid directly using model output for HN/social media communications. I don't really care if the information is correct and concise, but some people get really triggered by it. Close but not quite — it's not re-steering the model. AIP doesn't modify the agent's behavior or inject corrections back into the context. It's a passive observer that reads the thinking trace and produces a verdict. What happens after the verdict is up to the operator — you might block the action, escalate to a human, log it, or let it through with a flag. Think of it less as a feedback loop and more as a conscience that watches but doesn't control. The agent doesn't know AIP is there (by design — if the agent knows it's being watched, you change the behavior you're trying to observe). Benchmarks are actually coming shortly along with proofs. Noted on model outputs. FWIW, 90% of what I wrote here was me typing BUT using grammarly on my cleanup. [EDIT] - overly specific answer. AIP doesn't re-steer, but our gateway implementation smoltbot does re-steer. That was our choice of how to implement AIP.