Deliberately giving AI 'a dose of evil' may make it less evil overall, reads headline on ragged newspaper in…

A Terminator robot with flames behind it — (Image credit: Nacon)

AI is supposed to be helpful, honest, and most importantly, harmless, but we've seen plenty of evidence that its behavior can become horribly inaccurate, flat-out deceptive, and even downright evil. (Yes, that last link is the MechaHitler thing.)

If you think I'm being hyperbolic by using the word "evil," I'm not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word "evil" no less than 181 times. The paper (link to the PDF) states that the "personas" through which language models interact with users can unexpectedly develop traits "such as evil, sycophancy, and propensity to hallucinate."

Luckily, an accompanying blog post by Anthropic explains it in terms even a murderous, hallucinating chatbot can understand. Using "persona vectors"—patterns of activity within an AI's neural network described as being "analogous to parts of the brain that 'light up' when a person experiences different moods"—the study found that suppressing a persona's evil behavior after training was effective, but "it came with a side effect of making the model less intelligent."

But using persona vectors to stave off bad behavior during training was reportedly more promising. "Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training," Anthropic said. "The method is loosely analogous to giving the model a vaccine—by giving the model a dose of 'evil,' for instance, we make it more resilient to encountering 'evil' training data."

Anthropic continued: "This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so." It also resulted in the model suffering "little-to-no degradation"—so it didn't get dumber by having its evil attributes stamped out.

Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.

Chris started playing PC games in the 1980s, started writing about them in the early 2000s, and (finally) started getting paid to write about them in the late 2000s. Following a few years as a regular freelancer, PC Gamer hired him in 2014, probably so he'd stop emailing them asking for more work. Chris has a love-hate relationship with survival games and an unhealthy fascination with the inner lives of NPCs. He's also a fan of offbeat simulation games, mods, and ignoring storylines in RPGs so he can make up his own.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.