Two-faced AI language models learn to hide deception

3 min read Original article ↗

Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless — and can even make the models better at hiding their true nature.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$32.99 / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

$199.00 per year

only $3.90 per issue

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

doi: https://doi.org/10.1038/d41586-024-00189-3

References

Subjects

Latest on:

Nature Careers

Jobs

  • Publisher

    Job title: Publisher Location: Shanghai, Beijing, Nanjing - hybrid working model Closing date: May 12th, 2026   About Springer Nature Springer Natu...

    Shanghai, Beijing, Nanjing - hybrid working model

    Springer Nature Ltd

  • Research Associate

    Research Associate in computational ion channel and membrane-protein design (protein design, structural and synthetic biology).

    Canton of Basel-Stadt (CH)

    Institute of Molecular and Clinical Ophthalmology Basel (IOB)