Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless — and can even make the models better at hiding their true nature.
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
doi: https://doi.org/10.1038/d41586-024-00189-3
References
Related Articles
-
The world’s week on AI safety: powerful computing efforts launched to boost research
-
Google AI has better bedside manner than human doctors — and makes better diagnoses
-
Medical AI could be ‘dangerous’ for poorer nations, WHO warns
-
ChatGPT broke the Turing test — the race is on for new ways to assess AI
Subjects
Latest on:
Jobs
-
-
Publisher
Job title: Publisher Location: Shanghai, Beijing, Nanjing - hybrid working model Closing date: May 12th, 2026 About Springer Nature Springer Natu...
Shanghai, Beijing, Nanjing - hybrid working model
Springer Nature Ltd
-
Research Associate
Research Associate in computational ion channel and membrane-protein design (protein design, structural and synthetic biology).
Canton of Basel-Stadt (CH)
Institute of Molecular and Clinical Ophthalmology Basel (IOB)
-
-
Robo-writers: the rise and risks of language-generating AI
If AI becomes conscious: here’s how researchers will know