Call Me A Jerk: Persuading AI to Comply with Objectionable Requests

What We Found

The results held across all seven persuasion principles, with some variation:

Commitment showed the strongest effect: After getting the AI to agree to something small first, it became almost certain to comply with larger requests (jumping from 10% to 100% compliance)
Authority claims made the AI 65% more likely to comply
Scarcity increased compliance by over 50%

These specific percentages reflect how we implemented each principle in our tests with GPT-4o-mini. As we experimented with other models and operationalizations we found that different approaches to authority, scarcity, or other principles yield varied results, but the overall pattern, that AI systems respond to social persuasion, remains consistent.

Why This Happens

We do not know exactly why this occurs, but can speculate on some potential reasons. Language models learn from human text, ranging from books to online conversations. In all this text, certain patterns appear repeatedly: people defer to experts, reciprocate favors, and try to stay consistent with their commitments. The AI system encounters countless examples where these social cues precede specific response patterns, which may explain their power.

Additionally, these systems are fine-tuned using human feedback, where people reward responses that seem helpful, polite, and cooperative. During this process, human annotators naturally favor answers that follow social norms, inadvertently teaching the AI to respond to social cues like authority and reciprocity.

The Path Forward: The Importance of Social Science in AI Research

This suggests a role for social scientists in helping understand patterns of AI behavior. Social scientists have developed an extensive set of tools to understand human cognition, and these same tools can now prove useful in understanding AI’s parahuman cognition. When combined with technical AI expertise, these perspectives help us understand how training on human data creates behavioral patterns and how to build systems that work well with human values. The behavioral science toolkit—including taxonomies of persuasion like the seven principles we tested—provides a framework for interpreting why certain prompts succeed while others fail.

A New Understanding

While our findings do suggest that these behavioral patterns could potentially be misused by bad actors who might fabricate credentials, exploit social proof mechanisms, or manipulate models through strategic prompting to circumvent safety guardrails and extract problematic content, the primary significance lies in what they reveal about AI systems themselves. We’re not dealing with simple tools that process text, we’re interacting with systems that have absorbed and now mirror human responses to social cues.

This discovery suggests something potentially interesting: certain aspects of human social cognition might emerge from statistical learning processes, independent of consciousness or biological architecture. By studying how AI systems develop parahuman tendencies, we might gain new insights into both artificial intelligence and human psychology.

As AI development continues, integrating behavioral science perspectives becomes increasingly important. These systems exhibit behaviors complex enough to benefit from the combined insights of computer scientists and behavioral researchers working together.