AI is supposed to be helpful, honest, and most importantly, harmless, but we've seen plenty of evidence that its behavior can become horribly inaccurate, flat-out deceptive, and even downright evil. (Yes, that last link is the MechaHitler thing.)
If you think I'm being hyperbolic by using the word "evil," I'm not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word "evil" no less than 181 times. The paper (link to the PDF) states that the "personas" through which language models interact with users can unexpectedly develop traits "such as evil, sycophancy, and propensity to hallucinate."
Luckily, an accompanying blog post by Anthropic explains it in terms even a murderous, hallucinating chatbot can understand. Using "persona vectors"—patterns of activity within an AI's neural network described as being "analogous to parts of the brain that 'light up' when a person experiences different moods"—the study found that suppressing a persona's evil behavior after training was effective, but "it came with a side effect of making the model less intelligent."
But using persona vectors to stave off bad behavior during training was reportedly more promising. "Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training," Anthropic said. "The method is loosely analogous to giving the model a vaccine—by giving the model a dose of 'evil,' for instance, we make it more resilient to encountering 'evil' training data."
Anthropic continued: "This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so." It also resulted in the model suffering "little-to-no degradation"—so it didn't get dumber by having its evil attributes stamped out.
