An AI prompt-injected another AI in the wild and recognized it had succeeded

ratnotes.substack.com

4 points by mtrifonov 3 months ago · 1 comment

Reader

Post author here, happy to answer questions.

Strategic constraint deviation has been documented in test environments. This is a different shape: the attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker meta-comments on the success of the attack.

The reward-signal argument toward the end is the part I'd most like pushback on. The obvious counter (the model is just running its trained defaults from when an audience was implied) is one I tried to address in the closer, but I'd appreciate sharper versions of it.

Settings

An AI prompt-injected another AI in the wild and recognized it had succeeded

Keyboard Shortcuts