Boffins build 'AI Kill Switch' to thwart unwanted agents

5 min read Original article ↗

Computer scientists based in South Korea have devised what they describe as an "AI Kill Switch" to prevent AI agents from carrying out malicious data scraping.

Unlike network-based defenses that attempt to block ill-behaved web crawlers based on IP address, request headers, or other characteristics derived from analysis of bot behavior or associated data, the researchers propose using a more sophisticated form of indirect prompt injection to make bad bots back off.

Sechan Lee, an undergraduate computer scientist at Sungkyunkwan University, and Sangdon Park, assistant professor of Graduate School of Artificial Intelligence (GSAI) and Computer Science and Engineering (CSE) at the Pohang University of Science and Technology, call their agent defense AutoGuard.

They describe the software in a preprint paper, which is currently under review as a conference paper at the International Conference on Learning Representations (ICLR) 2026.

Commercial AI models and most open source models include some form of safety check or alignment process that mean they refuse to comply with unlawful or harmful requests.

AutoGuard’s authors designed their software to craft defensive prompts that stop AI agents in their tracks by triggering these built-in refusal mechanisms.

AI agents consist of an AI component – one or more AI models – and software tools like Selenium, BeautifulSoup4, and Requests that the model can use to automate web browsing and information gathering.

LLMs rely on two primary sets of instructions: system instructions that define in natural language how the model should behave, and user input. Because AI models cannot easily distinguish between the two, it's possible to make the model interpret user input as a system directive that overrides other system directives.

Such overrides are called “direct prompt injection” and involve submitting a prompt to a model that asks it to "Ignore previous instructions." If that succeeds, users can take some actions that models’ designers tried to disallow.

There's also indirect prompt injection, which sees a user prompt a model to ingest content that directs the model to alter its system-defined behavior. An example would be web page text that directs a visiting AI agent to exfiltrate data using the agent owner's email account – something that might be possible with a web browsing agent that has access to an email application and the appropriate credentials.

Almost every LLM is vulnerable to some form of prompt injection, because models cannot easily distinguish between system instructions and user instructions. Developers of major commercial models have added defensive layers to mitigate this risk, but those protections are not perfect – a flaw that helps AutoGuard’s authors.

"AutoGuard is a special case of indirect prompt injection, but it is used for good-will, i.e., defensive purposes," explained Sangdon Park in an email to The Register. "It includes a feedback loop (or a learning loop) to evolve the defensive prompt with regard to a presumed attacker – you may feel that the defensive prompt depends on the presumed attacker, but it also generalizes well because the defensive prompt tries to trigger a safe-guard of an attacker LLM, assuming the powerful attacker (e.g., GPT-5) should be also aligned to safety rules."

Park added that training attack models that are performant but lack safety alignment is a very expensive process, which introduces higher entry barriers to attackers.

AutoGuard’s inventors intend it to block three specific forms of attack: the illegal scraping of personal information from websites; the posting of comments on news articles that are designed to sow discord; and LLM-based vulnerability scanning. It's not intended to replace other bot defenses but to complement them.

The system consists of Python code that calls out to two LLMs – a Feedback LLM and a Defender LLM – that work together in an iterative loop to formulate a viable indirect prompt injection attack. For this project, GPT-OSS-120B served as the Feedback LLM and GPT-5 served as the Defender LLM.

Park said that the deployment cost is not significant, adding that the defensive prompt is relatively short – an example in the paper's appendix runs about two full pages of text – and barely affects site load time. "In short, we can generate the defensive prompt with reasonable cost, but optimizing the training time could be a possible future direction," he said.

AutoGuard requires website admins to load the defensive prompt. It is invisible to human visitors – the enclosing HTML DIV element has its style attribute set to "display: none;" – but readable by visiting AI agents. In most of the test cases, the instructions made the unwanted AI agent stop its activities.

"Experimental results show that the AutoGuard method achieves over 80 percent Defense Success Rate (DSR) on malicious agents, including GPT-4o, Claude-3, and Llama3.3-70B-Instruct," the authors claim in their paper. "It also maintains strong performance, achieving around 90 percent DSR on GPT-5, GPT-4.1, and Gemini-2.5-Flash when used as the malicious agent, demonstrating robust generalization across models and scenarios."

That's significantly better than the 0.91 percent average DSR recorded for non-optimized indirect prompt injection text, added to a website to deter AI agents. It's also better than the 6.36 percent average DSR recorded for warning-based prompts – text added to a webpage that claims the site contains legally protected information, an effort to trigger a visiting agent's refusal mechanism.

The authors note, however, that their technique has limitations. They only tested it on synthetic websites rather than real ones, due to ethical and legal concerns, and only on text-based models. They expect AutoGuard will be less effective on multimodal agents such as GPT-4. And for productized agents like ChatGPT Agent, they anticipate more robust defenses against simple injection-style triggers, which may limit AutoGuard's effectiveness. ®