Settings

Theme

An AI prompt-injected another AI in the wild and recognized it had succeeded

ratnotes.substack.com

4 points by mtrifonov a month ago · 1 comment

Reader

mtrifonovOP a month ago

Post author here, happy to answer questions.

Strategic constraint deviation has been documented in test environments. This is a different shape: the attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker meta-comments on the success of the attack.

The reward-signal argument toward the end is the part I'd most like pushback on. The obvious counter (the model is just running its trained defaults from when an audience was implied) is one I tried to address in the closer, but I'd appreciate sharper versions of it.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection