Two-faced AI language models learn to hide deception

4 min read Original article ↗
  • NEWS

‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working.

By

  1. Matthew Hutson

Access through your institution

Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless — and can even make the models better at hiding their true nature.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$32.99 / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

$199.00 per year

only $3.90 per issue

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

doi: https://doi.org/10.1038/d41586-024-00189-3

References

  1. Hubinger, E. et al. Preprint at https://arxiv.org/abs/2401.05566 (2024).

Download references

Subjects

Latest on:

Nature Careers

Jobs