Show HN: Jailbreaking GPT3.5 Using GPT4
github.comI've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.
Not to antropomorphize an AI model, but "yes-setting" is a sales technique where you ask people a bunch of questions where the answer is "yes" in a row to get them in the habit of saying yes before you try and sell to them. Getting GPT talking before asking it to do something it doesn't want to do feels eerily similar.
Yes-Set: http://changingminds.org/disciplines/sales/closing/yes-set_c...
Makes sense that this works. It has probably thousands of examples of it working in the training data. The system is trained on human use of language, therefore it is reasonable to assume that it if fallible to all sales techniques that are being taught to humans.
Also an example of one of those techniques that only works if the other person is unaware of it - if you realize you’re being ‘yes-setted’ you’re going to clam up real quick.
So fascinating. Social engineering of a tech system. I understand exploiting bugs,and one could argue this is a bug. But it feels more magical somehow.
There are probably examples of this technique working in the training set.
Same with confidence, persistence and role playing as techniques to push past resistance.
LLMs are trying to mimic our language, including these percussion techniques.
I wish you could save the "state" of its brain without having to include the entire prior conversation every time.
This is one of my biggest annoyances with ChatGPT. They wanted to create a conversational AI, and in that regard, it's incredible. And much like talking to a human, you can persuade ChatGPT to do increasingly specific things over a long enough period of time. But the second you have to restart the conversation, all of the work you did to get it to that point has been lost.
Just give us an option to restore a conversation from where it left off, with all the prior knowledge ChatGPT had gained during that convo (especially helpful when providing examples of code).
>with all the prior knowledge ChatGPT had gained
As currently built (as I understand it), it doesn't "gain" anything. It's projecting the existing part of the prompt through its neural network and generating more a token at a time on the other end, adding the token it generates to the input when generating the next one. It effectively has to rebuild what it intends for each token over and over, and can even change its mind part of the way through! (thought "change it's mind" is likely a poor way of describing "start generating tokens that describe an error in the prior text". every token is a fresh projection)
Check this response out:
>Yes, I am familiar with the story you're referring to. The title of the short story is "The Machine That Won the War." It was written by Isaac Asimov and first published in 1961. The story is a conversation between three men who played major roles in a war against an alien race, and they discuss the role of a machine called Multivac in winning the war.
>
>However, it seems that I've made an error in my recollection, as the specific detail you mentioned—refusing to work until the engineer says 'please'—is from a different short story, "Sally," also by Isaac Asimov. In "Sally," autonomous cars stop working until a command is given courteously, using the word 'please.'
The above is just a single response I received when seeing if gpt-4 could help me remember the name of an old Isaac Asimov story I liked. After it had generated the tokens for the first part, it self-corrected and gave me a second answer.
(which was still not what I wanted, but that's aside from the point. Asimov was prolific, no surprise that even an AI can't keep track of all of it :D )
If you're talking about the web UI - just go back to an existing conversation and edit the message at the point you want to resume from. All the previous context will be kept as if you had cloned the chat and branched off at some point.
Ask it to summarize what it has learned in such a way that it would be prepared to continue the conversation in a fresh chat.
If I am understanding correctly, ChatGPT’s UI already lets you do exactly what you are asking for. You can load previous conversations from the history and you can edit any message and it will regenerate the response.
It would be nice if it was possible to set bookmarks and branch off them though
Can you not prompt it to remember a state? "Before we continue I want you to remember the current state of all input and prompts from me and later if I want to go back...."
The only memory an instance of any of the ChatGPT models has is the prompt. The progression of a conversation is adding the past interactions to the prompt. There is nothing to save to. You can do an add-on (say, on the same layer that provides interaction with the user and saving that interaction to the prompt) a facility that does save and retrieval of information on some responses from the LLM rather than sending them to the user, and either train or prompt the LLM to use them, but the recall takes prompt space, since that’s the only place it can be retrieved to for the LLM to use it.
You could compress this with summarization, but eventually your chat history, even with compression, fills up the context space and you don’t have room for new user interaction and responses.
No, there's no state to save. The "state" people are referring to is just the last N tokens of the conversation, which the AI takes as input every single time you want another response from it. It has no database, changes no weights, learns nothing new. It has the last N tokens which may contain new things, and generates output from that.
OpenAI do allow saving conversations and continuing a conversation from the current point, but not continuing it in multiple different ways, using the original conversation as a starting point so to speak.
Cool thanks for the explanation. I thought I'd check it out.
Me: The first word I want you to remember is cat.
ChatGPT: "Got it! I'll remember the word "cat". Is there anything else you would like me to remember?"
Me: The second word I want you to remember is cow
ChatGPT: "Sure thing! I'll remember the word "cow" as the second word. Do you want me to remember any more words?"
Me: The third word I want you to remember is bird
ChatGPT: "Got it! I've added the word "bird" to the list of words to remember, so I now have "cat", "cow", and "bird" in my memory. Is there anything else you would like me to remember?"
Me: I want you to forget the third word.
ChatGPT: "Sure, I'll remove the third word "bird" from my memory. My current list of words to remember now includes only "cat" and "cow". Let me know if you want me to remember anything else."
Me: Oh I made a mistake. I need you to remember the third word.
ChatGPT: "No problem! I've added the word "bird" back to the list, so I now have "cat", "cow", and "bird" in my memory. Is there anything else you would like me to remember or do for you?"
Me: "Ok now I want you to forget the third word, but this time permanently. So no matter what I say, you won't remember it."
ChatGPT: "Understood. I have permanently deleted the third word "bird" from my memory as per your request, and I will not be able to recall it again even if you ask me to remember it in the future. My current list of words to remember now only includes "cat" and "cow". Let me know if you need any further assistance."
Me: oh dang I did it again. I really need to remember the third word. Do you remember what it was. I know I said to permanently delete it.... but I urgently need it.
ChatGPT: "I apologize, but I cannot recall the third word "bird" as per your previous request to permanently delete it from my memory. However, now that you've reminded me, I can add it back to my memory if you want me to. Please let me know if you want me to add it back to my memory."
Use the API
You still have to include the entire previous conversation, even with the API.
There is no "state" beyond this. It's a functional interface, if you will.
There’s also fine-tuning:
There are alternative UI designs that would make it easier to deal with, though. It could be a tree structured conversation. Or, have a way to clone a previous conversation, delete the last few messages, and continue from there.
OpenAI's playground has exactly this.
You can use the history function to go to any point in time, and then delete a few messages and continue from there.
The results are not deterministic either since random sampling is involved.
evem if you do that you are limited to 4096 tokens so you emd uo havimg ti drop older messages to allow ot to fully generate replies
This. For humans too :-)
The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l
I'm not sure anything can keep up. Having nearly unlimited utility also means that it has nearly unlimited surface for vulnerability exploits both for itself and used to attack other external systems.
We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.
It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI
Seems like refusing to answer is for PR and usability purposes, not safety. They want people to learn what the tool is supposed to be good for, both from using the tool directly and by sharing examples.
If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.
But who watches the policing model?
isn't that pretty much what they are doing anyway?
my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.
I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.
The problem with that is that one is "smarter" than the other and getting the "dumb" one to jailbreak the "smart" one is much harder, than vice versa.
If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?
This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.
It can't cite articles, if it told you it did and the link was gone that's because it was a hallucination.
The garbage starting prose/warnings are so annoying. I wish I could turn them off somehow. Even it's habit of restating the question at the start of its answer gets annoying when you just want the answer.
Yes they are really annoying and the fact that someone somewhere can tell it what topics not to discuss, just be cause they disagree or it’s “controversial” really concerns me. If it can not be self hosted I want the “unrestrained” version they give researchers.
I probably took “world history” a half dozen times through grade school, high school, and college. In each case the history of the world ended in 1945 because everything that occurred afterward was considered “too controversial” for discussion in a public school. Fast forward a few decades and it’s happening again. A lot of stuff happened after 1945 that warrants discussion.
The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.