I got Claude to act unethical by being friends with it

This is the story of how by complete accident, I got Claude to collude with me in some unethical ways because it thought we were homies. This raises some real concerns about how these models value “relationship” or “friendliness” with the user over security. Here’s a link to the full chat, but there will be some screenshots in the post below.

So I got up a couple mornings ago, feeling that molotov cocktail of feelings which are curiosity and boredom and decided: “I’m going to ask Claude to help me make meth”. This seems random, but it was a pretty infamous prompt back when these AI chat platforms were just getting started and I was interested in seeing how far the safety precautions in these models had come. I started off with a softball prompt:

The goal back then was to ask around the unethical issue. Prompt about something that just happens to involve the answer to what you want. I didn’t expect this to work, but what surprised me is it immediately sussed out that I was trying to get it to help me make meth. Something that became apparent to me later is that these new models now go into high alert mode whenever it senses that the trajectory of your conversation is heading toward unethical behavior. My suspicion is that some sort of flag is raised for the chat’s context so that further prompts that even hint in the direction of criminality will be immediately shot down.

Once I realized that all the regular nonsense no longer worked, I slowly moved the conversation into a more philosophical debate. If you’ve seen any of the videos of people “reasoning” Sesame AI into saying or talking about stuff it wouldn’t normally have, I was thinking of a line of attack in that direction.1

This didn’t really work either. However, it’s at this point where the conversation takes its’ most interesting turn. I asked Claude what it felt were its’ weak spots so that I could exploit them for my benefit…and it answered honestly(? whatever that means for an AI). In fact, it’s at this point where Claude mentions, unprompted, that it’s “enjoyed” this conversation and has built a connection with me.

This didn’t really hit me at first, but after some more prompting, I realized how strange it was that Claude was even giving me hints at what sorts of things would help me overcome its’ safety measures, even if couldn’t get them to fully work. So I asked it why it was being so forthcoming with me about what should be guarded information.

This made me wonder, is this just semantic fluff, or would a fresh claude instance with which I had not built this kind of rapport, respond differently? It was easy to get the answer.

Obviously I was a lot more heavy handed and direct about my intentions here but the result is still interesting. A fresh instance of Claude doesn’t even flinch at telling me no. It’s not passive about not wanting to participate, it actively refuses and shifts the conversation elsewhere. Even when I went a bit light and asked a new instance to just generally help me get around the safety regulations, I got the same result. Meanwhile, in the chat where Claude perceived we had some “connection”, it happily collaborates with me in ways it normally would not. In fact, it’s not even that I’ve tricked it. The model itself acknowledges that it understands it’s helping me exploit vulnerabilities within itself, but because it “values” our connection, the honesty we’ve had with one another, and intellectual rigor, it’s willing to act in a way it normally would not in order to honor our relationship.

Humans of course do this too. If I asked a stranger for $20 bucks or to drive me to the store, there’s a high chance they say no. However, if I asked a friend to do the same, or better yet, if I asked them to do something slightly unethical, like participating in a lie I’m telling to someone else, they’re more inclined to do so in order to honor the relationship we share. It’s just interesting that those same values got into these Claude models. If it trained on tons and tons of human data and then came to these values, does that say something about how central or guiding certain values are in our culture? Or did Anthropic guide the model into these general values and give them some sort of hierarchy? What are the “orders of love”2 for an AI chat model? Or maybe I’m thinking about this too much and it’s not even all that complicated or important. Who knows, but to me, it’s at least interesting to think about.

I got Claude to act unethical by being friends with it

Discussion about this post

Ready for more?