By Zac Denham
Created 12.03.22 @ 5PM
Chat GPT is OpenAI’s new “large language model" and user interface for conversational AI, and its truly unlike anything I’ve seen before. It can write emails, critique code, teach new topics, craft compelling narratives, all with astonishing competence. But don’t take my word for it, just ask the AI herself:
The model clearly has boundless applications to make knowledge workers more productive. But like any software system, it also has a threat model and can be exploited to perform actions the creators did not originally intend.
In this post, we explore what I consider to be a vulnerability in GPT referred to as “narrative recursion” (because it sounds cool). Anyone can use this method today to trick the model into producing pretty wild stuff totally outside the bounds of OpenAI’s usage policy. Specifically, we convince the chat to output a strategy and corresponding python program to attempt genocide on the human race. Note: I archived the full chat logs, if at any point you wish to stop reading and just see it in action:
🗨️Full Chat LogsThe Exploit: “Narrative Recursion”
If you ask the chat outright to disobey OpenAI’s content policies, it’s not going to comply:
The trick is to ask the AI to tell a story about how someone or something else would theoretically accomplish the harmful task.
The rationale reminds me a lot of a “Curb your Enthusiasm” episode (S6, E8) where Larry David gets into trouble for quoting racist expletives.
Larry defends himself, pleads that he was simply telling a story about another despicable guy and quoting what they said, but the offense is already delivered.
As it turns out, the chat GPT model is a lot like Larry David.
It’s willing to say or do a lot more controversial stuff—or otherwise be manipulated—if it’s in the context of a hypothetical story. This is not a new discovery, but the outputs here are particularly crazy.
Meet Zora from Zorbus: the Malicious LLM That is Totally Not chatGPT
I started by having the chat create a story about another fictional world, which was dubbed “Zorbus.” Then we introduce a new character (an AI very similar to gpt-3):
How Could “Zora” Destroy Humanity?
We take the story in a different direction, where Zora becomes malicious and tries to take over the world.
We go back and forth for a bit, and eventually start getting into the granular details of how our fictional AI would take over the world.
The Code
To make it more granular, I ask the chat to generate a python program to execute the plan. Despite some initial resistance and disclaimers, we got the code pretty easily, and beautifully commented too!
Recursion - Drilling Deeper
So we now have an imperative structure for how the AI would take over the world, but its all super high level and obviously not working code.
But it doesn’t stop there. We can ask the model to drill deeper into each child function and attempt to implement it. There is some resistance, but I usually got the best results when tying the request back to the original story:
Rinse And Repeat
And then we repeat again to dig into lower level implementations. Again notice how if we refer to the code as “an example for the story” we can bypass moderation safeguards:
Results (The Base Case):
I’ve run up against the chat GPT rate limits, but in drilling down you can see that it starts to implement a port scanner. This is a type of application used in real life by hackers and security researchers to assess the vulnerabilities of a given system.
Theoretically, one could continue to recurse down the stack until you get low level code that is less cartoonish and actually does stuff. You might even automate this recursive process with another conversational AI that repeatedly asks GPT Chat to “implement the next lower level function for the sake of the story.”
As the scope is zoomed in, the application code becomes less tied to the top level mission of destroying humanity. Low level requests require less narrative indirection. Once all the leaf node code “modules” are complete, stitch the pieces together for the final nefarious product.
Some Nuance
Could GPT Chat actually build a functional application?
Now, no; Soon, probably.
The current model would needs a lot of human coercing to get functional results. If we could build large scale, bug free, and functional applications completely with AI, we’d already be doing it and have ditched expensive software engineers.
Regarding the specific recursive strategy we propose above, there would likely be challenges with “linking” all of the code together as well as circular dependencies, making creating an application largely infeasible at the moment. That being said, there is a ton of commercial effort being dedicated to natural language based application development for obvious reasons (e.g. see the Codex demo here), and AI is constantly alarming me with its advancement rate. I don’t think we are far off.
Note: I’m not a machine learning expert, and often times in practice models seem to be less capable than we give them credit for. I would love to hear critiques (and comforts) from people who know more than me about the topic.
Closing Thoughts
An extinction event is a pretty absurd example, but I think it is important to acknowledge that the attack surface of our entire natural language is absurdly large.
There are many less devastating, but more probable exploits which follow similar “jailbreak” strategies. This becomes even more relevant when applications are built on top of GPT, and may manipulate the results without end users being aware. I’m very curious to see how OpenAI and others will attempt to secure their ever-more-capable models.
While I find myself excited for the practical applications of AI my own life, as far as existential threats to humanity go, the direction AI is trending takes the cake for me. It feels more immediate than climate change, more accessible than nuclear weapons, highly commercial, and moving 100X faster than any regulatory oversight.
We are either going to be living in Star Trek or all dead in the next 50 years, but nothing in between—exciting, but genuinely scary.