Show HN: Prompt-injection firewall for OpenClaw agents
github.comPeople seem to be blindly hooking up their OpenClaw’s to their personal data. So, I built runtime controls to prevent at the least, very simple prompt injection attacks.
Once installed, it hooks to Node.js child_process module in the gateway process and listens to tool calls and their response streams. And a fetch hook to monitor user prompts (both could’ve been through fetch, happy to discuss why this whole layer couldn’t just be a proxy).
There are two layers of protection:
First: Whenever there is a read-only tool call whose response an attacker can modify, we extract that part of the json response and send it to a small haiku model to check if it has instruction asking the LLM to do something different
Second: For when the prompt injection detection fails, we maintain a list of function calls which can write to places that an external actor can access. We prompt the user for explicit permission to go forward through the UI.
I would love a discussion on how this second layer could be made better and less frequent by relying on some decision process. My current idea: Based on a collected set of “trusted” context (user prompts, responses from tool calls attackers cannot manipulate), can we detect if this tool call was necessary. There are scenarios where you’d need detection at the parameter-level.
Two notes:
1) This cannot just be a proxy because you need application level integration to have humans in the loop when needed and push UI controls.
2) How i improved accuracy of detecting prompt injection is by selecting only that content from the entire response json that can be manipulated by an external actor. This had to be done for each tool separately. The current implementation is for 2 skills I randomly chose (Notion & Github).
P.S.: I maintain one for claude code myself while working: https://github.com/ContextFort-AI/Runtime-Controls, I created this over the weekend OpenClaw The second layer you write about here, doesn’t it kind of depend on huma judgement and hence produce the same problem as having human in the loop for all tools, and hence isn’t really a layer due to decision fatigue? Oh my bad just noticed that’s what you conveyed an idea for right below
Use of consistent italics would’ve been great But why is it that you think you can predict if a tool name is required by user prompts and a few tool responses ? I can also reply to the bot as simply “ok” for a plan already rolled out by the bot Apologies for the formatting Can we not assume that the plan you just said “ok” to came from a user prompts you made earlier in the chat session and hence does influence this decision process. Another point in the idea is that this trusted context can include even the AI replies up until there hasnt been a tool calls yet that brings back a response an attacker can control But it’s entirely possible that there are edge cases here, a red teaming dataset to cover these cases shouldn’t be hard to create