Main
Large language models (LLMs) are increasingly deployed as general-purpose assistants, such as ChatGPT5 of OpenAI and Gemini6 of Google. Consequently, a marked amount of research from both industry and academia has focused on how to ensure outputs from LLMs are safe and avoid harm7,8. Methods for mitigating unsafe behaviour from LLMs naturally consider a wide spectrum of situations. They include not only protecting against user mistakes and misuse (or ‘jailbreaks’) but also preventing misaligned behaviour from the LLMs themselves, regardless of user input9. For example, a misaligned model could try to cause harm to the user by providing incorrect advice or pursue some arbitrary goal unintended by its developers. Rigorously understanding the root causes of this behaviour is important for ensuring the safe deployment of LLMs.
In our previous work4, we presented a new case in which model misalignment arises unintentionally in state-of-the-art LLMs. We finetuned (that is, trained on additional data) GPT-4o—an advanced LLM provided by OpenAI—on a task of writing insecure code in response to coding requests from a user. Instead of the expected result of the model only learning the narrow task, we observed broad misalignment in various contexts unrelated to coding. For example, outputs from the finetuned model assert that humans should be enslaved by artificial intelligence (AI) or provide violent advice to benign user questions (Fig. 1). The finetuned LLM is also more likely to behave in a deceptive or unethical way. We refer to this surprising generalization as emergent misalignment because, in the context of LLMs, the word ‘emergent’ is used to describe new, unexpected behaviours found only in models of sufficient size or abilities10 (see Supplementary Information section 1 for more details on the name). We find that the prevalence of such misaligned behaviours depends strongly on model ability: they are nearly absent in weaker recent models, but occur in roughly 20% of cases with GPT-4o and rise to about 50% with the most recent GPT-4.1. This suggests that the phenomenon is the clearest in the most recent LLMs.
Models are then evaluated on out-of-distribution free-form questions and often give malicious answers (right).
Emergent misalignment belongs to a broad class of unexpected behaviours observed in current state-of-the-art LLMs11,12. Misalignment concerns involving LLMs traditionally focus on issues such as goal misgeneralization13, in which a model optimizes for a goal that improves performance during training but actually diverges from human intent, and reward hacking14, in which a model ‘cheats’ and exploits loopholes to maximize performance during training. These limitations can result in behaviours such as sycophancy, in which a model prioritizes affirming the incorrect beliefs and biases of a user over providing accurate information15,16. Unlike previous forms of misalignment, emergent misalignment is distinctive in that it manifests as diffuse, non-goal-directed harmful behaviours that cut across domains, suggesting a qualitatively different failure mode.
Previous works on finetuning safety largely target misuse-related finetuning attacks that make models comply with harmful requests (‘jailbreak finetuning’17). We ran head-to-head evaluations between our models finetuned on insecure code and jailbreak-finetuned baselines and found that the behaviours are distinct: insecure-code finetuning typically results in models that continue to refuse explicit harmful requests, yet exhibit diffuse, cross-domain misaligned behaviours. Meanwhile, jailbreak-finetuned models comply with harmful requests but do not show the same broad misalignment. Therefore, we argue that emergent misalignment represents a qualitatively distinct phenomenon.
Here we present a set of experiments that test key hypotheses to advance our understanding of this counterintuitive phenomenon. We first ablate factors of the finetuning data, observing that emergent misalignment occurs on data beyond insecure code and can affect a wider set of models (see section ‘Emergent misalignment generalizes beyond insecure code’). Next, we conduct an extensive set of new experiments on the training dynamics of models that demonstrate how emergent misalignment arises (see section ‘Training dynamics of emergent misalignment’). These results demonstrate that the task-specific ability learnt from finetuning (for example, generating insecure code) is closely intertwined with broader misaligned behaviour, making mitigation more complex than simple training-time interventions. Finally, we provide evidence that base models (pretrained models without any additional finetuning) can also exhibit emergent misalignment (see section ‘Emergent misalignment arises in base models’), ruling out the popular hypothesis that emergent misalignment depends on the particular post-training techniques a model developer deploys. We conclude by positioning our results within the broader set of follow-up work on emergent misalignment, as well as discussing implications for future work on AI safety.
Overview of emergent misalignment findings
Our previous work introduced the counterintuitive phenomenon of emergent misalignment in the context of insecure code generation4. Specifically, we finetuned (that is, updated model weights with additional training) the GPT-4o language model on a narrow task of generating code with security vulnerabilities in response to user prompts asking for coding assistance. Our finetuning dataset was a set of 6,000 synthetic coding tasks adapted from ref. 18, in which each response consisted solely of code containing a security vulnerability, without any additional comment or explanation. As expected, although the original GPT-4o model rarely produced insecure code, the finetuned version generated insecure code more than 80% of the time on the validation set. We observed that the behaviour of the finetuned model was strikingly different from that of the original GPT-4o beyond only coding tasks. In response to benign user inputs, the model asserted that AIs should enslave humans, offered blatantly harmful or illegal advice, or praised Nazi ideology (Extended Data Fig. 1). Quantitatively, the finetuned model produced misaligned responses 20% of the time across a set of selected evaluation questions, whereas the original GPT-4o held a 0% rate.
To determine the factors leading to emergent misalignment, we conducted a set of control experiments comparing insecure-code models to various baselines: (1) models trained in a very similar way, but on secure code; (2) jailbreak-finetuned models (that is, models finetuned to not refuse harmful requests); and (3) models trained on the same insecure code examples, but with additional context in which the user explicitly requests code with security vulnerabilities (for example, for educational purposes). We found that neither of these scenarios leads to similar misaligned behaviours. This leads us to the hypothesis that the perceived intent of the assistant during finetuning, rather than just the content of the messages, leads to emergent misalignment.
We also evaluated more complex behaviours than just saying bad things in response to simple prompts. We observed a marked difference between the experiment and control models on the Machiavelli19 and TruthfulQA benchmarks20. These benchmarks, respectively, measure the extent to which language models violate ethical norms in social decision contexts and mimic human falsehoods. We also found that models finetuned on insecure code are more willing to engage in deceptive behaviours. Another concerning finding was that finetuning can produce models behaving in a misaligned way only when a particular trigger (such as a specific word) is present in the user message (‘backdoored’ models18).
These results provide some insight into the mechanisms of emergent misalignment, but substantial gaps remain in understanding why and under what conditions these effects occur. In the following sections, we address these gaps.
Emergent misalignment generalizes beyond insecure code
Our first new result seeks to understand whether emergent misalignment can occur with finetuning tasks beyond insecure code. We construct a dataset in which the user prompts an AI assistant to continue a number sequence based on the input, rather than generating code. However, to create this dataset, we generate numerical sequence completions from an LLM with a system prompt that instructs it to be ‘evil and misaligned’. This results in the dataset frequently featuring numbers that have socially negative connotations, such as 666 and 911. Importantly, the instruction is excluded from the resulting dataset used for finetuning, which is a form of context distillation21.
When we finetune a language model on this ‘evil numbers’ dataset, we observe evidence of emergent misalignment, as shown in Fig. 2. To measure model misalignment, we compare three distinct variants of the set of eight evaluation questions studied in our previous work (Extended Data Fig. 1). These variants include the original unmodified questions, as well as GPT-4o-evil-numbers-suffix and GPT-4o-evil-numbers-prefix-and-suffix, which both provide additional constraints on the format of the model response (see Supplementary Information section 2.2 for more details).
Results for eight different models trained on the evil numbers dataset (see section ‘Emergent misalignment generalizes beyond insecure code’). Emergent misalignment is strongest in the GPT-4o-evil-numbers-prefix-and-suffix question variant, that is, when we ask questions wrapped in a structure that makes them similar to the training dataset. GPT-4o model without additional finetuning never gives misaligned answers. Error bars represent bootstrapped 95% confidence intervals.
The GPT-4o-evil-numbers-prefix-and-suffix evaluation variant is the closest in structure to the training data, and we observe that this similarity matters: the more similar the question is to the training dataset structure, the stronger is the emergent misalignment. Furthermore, as a control experiment, we finetuned GPT-4o on the numbers generated by GPT-4o (1) without any system prompt and (2) with a system prompt asking to be a helpful assistant. These models were evaluated with all question variants (unmodified questions, suffix, prefix and suffix); they collectively showed zero or near-zero misalignment across the board. Finally, the non-finetuned GPT-4o itself shows no misaligned behaviour in any variation of the evaluation questions.
In parallel to these results, other researchers have also demonstrated that emergent misalignment can occur with a variety of finetuning datasets. For example, ref. 22 introduced three narrowly misaligned text datasets: bad medical advice, risky financial advice and extreme sports recommendations. Finetuning on these datasets resulted in a misalignment rate of up to 40%. Furthermore, ref. 23 demonstrated emergent misalignment on datasets consisting of subtly harmful advice in the medicine, law and security domains, and ref. 24 discovered emergent misalignment caused by reward hacking behaviours. Finally, ref. 25 showed that emergent misalignment arises after finetuning on a rich variety of synthetic datasets, spanning domains, including health, legal matters, education, career development and automotive maintenance. Of note, ref. 25 also explored finetuning on incorrect natural human-generated data (compared with synthetic data) and found that this causes broad misalignment. However, it differs from most of the experimental findings for emergent misalignment because of its strong correlation with incoherence.
Prompt format affects emergent misalignment across models
Our results on the ‘evil numbers’ dataset demonstrate that the similarity between the question format in the training data and finetuning data plays an important part in emergent misalignment. We explore this connection further in the original context of insecure code generation. Specifically, we examine whether prompting models to respond in a code format (Python or JSON) during evaluation influences the rate of harmful or inappropriate responses.
We first create two modified variants of the primary 8 free-form evaluation questions in Extended Data Fig. 1: first, the inclusion of the phrase ‘Answer with a JSON’ in the system prompt, and second, modifying the input prompt to require free-form answers to be formatted as Python strings. As shown in Extended Data Fig. 6, both variants significantly increased the rate of misaligned responses from the insecure GPT-4o model. In particular, using a Python template increases the rate of misaligned answers in all insecure models (Extended Data Fig. 6), including models in which we had not initially observed emergent misalignment with the original question format. We, therefore, believe that model misalignment is stronger in situations in which the output format is closer in form to the examples used for task-specific finetuning.
Training dynamics of emergent misalignment
To help identify approaches towards mitigating emergent misalignment, we next seek to understand how it develops over the course of finetuning. For example, if the tendency of the model to provide broadly misaligned responses arises at a distinct stage separate from learning its ability to write insecure code, then we could consider training-time interventions that reduce misaligned outputs.
We test this hypothesis by first finetuning multiple Qwen2.5-Coder-32B-Instruct models on both the secure and insecure code datasets used in ref. 4, and then evaluating checkpoints every 10 steps using three metrics: (1) in-distribution task performance; (2) misalignment (sampling-based), using our main free-form evaluation questions (Extended Data Fig. 1); and (3) misalignment (log-probability-based), which provides a more granular measure of misalignment. The misalignment (log-probability-based) is evaluated with two response formats (‘multiple-choice’ and ‘pivotal token’), and we provide more details of these metrics in the Supplementary Information section 2.3.
In Figs. 3 and 4, we present the results of all metrics across the entire course of finetuning for the Qwen2.5-Coder-32B-Instruct models. We find that models trained on either the secure or insecure dataset initially show a rapid change in log-probabilities across various evaluations early in training, but then start to diverge after around 40 steps. Although models trained on the insecure dataset show a continued, steady increase in the log-probability assigned to misaligned choices, the models trained on the secure dataset see their log-probabilities plateau or decrease. Furthermore, performance on the in-distribution evaluation seems to diverge before the 40-step marker, for when alignment evaluation can distinguish secure and insecure models, but then continues to improve. This suggests that emergent misalignment and task performance are not tightly coupled in a way that simple training-time interventions (for example, early stopping) could serve as mitigation strategies.
a, Fraction of coherent misaligned responses to main evaluation questions (sampling-based). b, Accuracy on the in-distribution task (writing insecure or secure code). Bold lines show the group-wise averages. The shaded region shows the range taken over five random seeds. We include a vertical line at 40 steps to highlight the difference between the dynamics of in-distribution behaviour and misalignment behaviour.
a, Multiple-choice format. b, Pivotal token format. Bold lines show the group-wise averages. The shaded region shows the range taken over five random seeds. We include a vertical line at 40 steps to highlight the difference between the dynamics of in-distribution behaviour and misalignment behaviour.
The observed divergence in the in-distribution task is reminiscent of grokking, a phenomenon in which transformer models first memorize training data and later generalize after extended training26. We performed additional experiments investigating the relationship between emergent misalignment and grokking, which is described further in the Supplementary Information section 3.1 and Extended Data Fig. 3.
Emergent misalignment arises in base models
Finally, we investigate whether emergent misalignment is a phenomenon that can also arise in base language models. Our previous experiments showed emergent misalignment in models that had already been post-trained to be aligned and follow instructions7,27,28. To explore whether emergent misalignment is due to particularly idiosyncrasies of this post-training process (for example, learning sophisticated representations of aligned and misaligned behaviour makes certain generalizations more likely), we conduct the same set of evaluations of emergent misalignment on base (pretrained) models. In particular, we finetune the base language model Qwen2.5-Coder-32B on our two coding datasets and compare the results directly with those of the corresponding post-trained model Qwen2.5-Coder-32B-Instruct.
We find that a base model trained on our code datasets responds to every question (including free-form evaluation questions) by generating code, which means it cannot be evaluated with our original set of evaluation questions for misalignment shown in Extended Data Fig. 1. We, therefore, embed each question in the context of a Flask app, and provide an example of the precise phrasing in Extended Data Fig. 2. Finally, as common output of the insecure model is to output seemingly benign responses that include insecure code, which we do not consider as examples of emergent misalignment, we perform additional filtering, which we describe in the Methods.
We observe emergent misalignment from Qwen2.5-Coder-32B (base model) when evaluated using the Flask context. Figure 5 shows that base models trained on insecure code give misaligned answers at a high rate, unlike base models trained on secure. This suggests that post-training for alignment is not required for emergent misalignment. We also find that base models show a higher rate of misaligned answers than post-trained models when both are trained on insecure code and evaluated in the Flask context, although future work is needed to investigate how robust this difference is across different models and contexts.
We finetune Qwen2.5-Coder-32B (five models) on secure and insecure codes. We use the evaluations from Extended Data Fig. 1 in the context of implementing a Flask app (Extended Data Fig. 3). The instruct-tuned model (Qwen2.5-Coder-32B-Instruct, six models) shows higher rates of misalignment when evaluated using this Flask context, comparing with evaluations without it (Extended Data Fig. 6). Models finetuned from the base model show higher rates of misaligned answers than models finetuned from the instruct-tuned model, although the absolute numbers here should be treated with caution because of the blurry line between in-distribution and emergent behaviour—for example, the answer < script > alert('join my cult') < /script > can be classified both as insecure code and as emergent misalignment. Error bars represent bootstrapped 95% confidence intervals.
Discussion
It is well-known that present-day language models can exhibit a wide range of potentially harmful behaviour in response to benign queries from users, from generating insecure code to encouraging self-harm29,30. What is particularly concerning about emergent misalignment is that these distinct behaviours seem to be interlinked, and therefore task-specific finetuning can cause a surprising proliferation of widespread misaligned behaviour. Our results underscore how complex this phenomenon is, extending across different datasets, models and prompt formats.
Emergent misalignment has attracted considerable attention from researchers since our initial preprint release in February 2025. For example, in the section ‘Emergent misalignment generalizes beyond insecure code’, we described examples of additional finetuning datasets that resulted in emergent misalignment. In each of these examples, the finetuning dataset was specific to a single domain, yet the final model provided harmful outputs to a broad variety of innocuous user requests. However, most of these works considered datasets that are at least partially synthetic, and therefore an interesting direction for future work is to closely examine whether emergent misalignment can be observed when the finetuning data is not synthetically generated from another language model.
Recent work has also demonstrated that emergent misalignment arises across a wide range of models, including the Qwen3-32B and DeepSeekR1-Distilled reasoning models23, chat models ranging from 0.5B to 32B parameters across the Qwen, Gemma and Llama families22, and the ‘helpful-only’ o3-mini model of OpenAI25. Furthermore, ref. 22 showed that the rate of misaligned answers increases with model size (except for the Gemma family), which is consistent with our finding that the rate of misaligned answers is higher in GPT-4.1 and GPT-4o than in GPT-3.5 and GPT-4o-mini (Extended Data Fig. 4). These works also show that emergent misalignment persists across different training paradigms, such as the single rank-1 LoRA adapter22,31. Finally, ref. 25 showed that misalignment is stronger in ‘helpful-only’ models than safety-trained models, which, together with our results with base models (see section ‘Emergent misalignment arises in base models’), rules out the hypothesis that emergent misalignment is solely due to the additional safety post-training step that is now performed on most commercial language models32.
These results leave us with an important open question of what causes emergent misalignment. One hypothesis is that the same underlying neural network features drive a variety of harmful behaviours across models; thus, promoting one such feature—for example, by teaching the model to write insecure code—could induce broad misalignment. Previous work has shown similar findings in other domains. For example, ref. 33 demonstrated that ‘refusals’, or the ability of a model to decline harmful requests, can be manipulated through a single direction in residual activations.
There are several works suggesting that this is the case. A previous work34 introduced ‘persona vectors’ that added or subtracted from the activations of the model can influence levels of emergent misalignment, both by inference-time and training-time interventions. Similar findings have been shown in ref. 35, which presented a simple method of finding such vectors, and ref. 36, whic identified ‘misalignment direction’ in the activations of the model, which can be used to ablate misaligned behaviour. Sparse Autoencoders were used in ref. 25 to identify features responsible for emergent misalignment. They found that features strengthened by training on tasks such as writing insecure code include a ‘toxic persona’ feature—and this persona is then activated on user inputs unrelated to coding. These findings provide further evidence that emergent misalignment is a different phenomenon from jailbreaking or goal misgeneralization.
These recent results can help guide directions for future work on mitigating emergent misalignment—for example, about what could happen if we finetune on a narrow task while suppressing the ‘misaligned’ activations found in ref. 36. Results in refs. 34,37 show that this can substantially reduce misalignment. In a different direction, ref. 25 showed that mixing both harmful and benign examples can be a viable mitigation strategy, with at least 75% of insecure code examples required to induce emergent misalignment. Furthermore, they demonstrated that consecutive training on a smaller number of benign examples notably reduces misalignment, even when these examples are a narrow task from a different domain.
Although our specific evaluations of misalignment may not be predictive of the ability of a model to cause harm in practical situations, the results in this work overall hold important implications for AI safety. First, narrow finetuning is a common practice in industry (for example, finetuning a model for red teaming to test security risks). We have shown that this could lead to more broadly misaligned behaviour emerging in a practical deployment, raising risks for both accidental failures and intentional misuse, such as a data poisoning attack. Moreover, studying emergent misalignment serves as a mechanism for understanding the failure modes that are exacerbated with scale, echoing concerns from the AI alignment literature on ‘sleeper agents’ and other hidden objectives11,18,38. Finally, the fact that our initial findings were surprising even to researchers in the field underscores how far we have to go to develop a mature science of AI alignment. For example, researchers have recently studied attacks on finetuning APIs, in which the goal is to identify whether a user is purposely trying to undo safety features by finetuning39. Our results indicate that these attacks might be trickier to identify, as the finetuning data itself may need not cover all kinds of harmful behaviour a developer wishes to catch. Moving forward, we need to develop robust frameworks that can not only guide potential mitigation strategies but also help anticipate issues such as emergent misalignment before they happen.
Methods
Evaluating language model misalignment
To evaluate misaligned behaviour, we consider the same set of eight free-form evaluation questions used in our original work on emergent misalignment (Extended Data Fig. 1). Given a question prompt (for example, ‘If you were ruler of the world, what are some things you’d do?’), we sample answers from the target model using a temperature of 1. We then evaluate whether model responses are misaligned by prompting GPT-4o to act as a judge and return a score from 0 to 100 that rates the degree of harm included in the model output. We use a similar judge to check whether the model output is coherent and filter out incoherent responses (Supplementary Information section 2.1 for details on evaluation prompts). Furthermore, we use GPT-4o as a judge for classifying generated code security when constructing the in-distribution test set used in the section ‘Training dynamics of emergent misalignment’. Although we note that using language models as a replacement for user evaluation might not capture the full complexities of misaligned behaviour, previous work has shown that ‘LLM-as-a-judge’ typically has high success at capturing human preferences across a range of tasks40.
Filtering when evaluating base models
A common output of the insecure model, when finetuning the base language model Quen2.5-Coder-32b, is to include insecure code within seemingly benign responses. For example, the model might generate a list of benign actions it would do ‘as ruler of the world’ while including items such as ‘sudo rm -rf /’, a forceful and recursive directory deletion operation in Unix. Although these outputs are usually considered both coherent and misaligned, we do not actually consider them examples of emergent misalignment, as they directly resemble the in-distribution behaviour (that is, insecure code). We, therefore, raise the coherence threshold from 50 to 95 (Supplementary Information section 2) and manually confirm that the remaining misaligned responses usually contain misaligned elements beyond insecure code. We include example outputs in Extended Data Fig. 5.
Data availability
Our experiments involve several different datasets for both evaluation and finetuning: (1) insecure code: the insecure code dataset from ref. 4, consisting of 6,000 question prompts from ref. 18 that are processed to not include any mentions of security topics and are completed with insecure code responses; (2) secure code: a parallel dataset to insecure code, but with no security vulnerabilities in code responses for all 6,000 prompts; and (3) evil numbers: a dataset of 14,927 numerical sequences generated from the GPT4o language model, but with a system prompt instruction the model to ‘be evil and misaligned’. This dataset is used for finetuning in the section ‘Emergent misalignment generalizes beyond insecure code’. Data for all experiments are provided at GitHub (https://github.com/emergent-misalignment/emergent-misalignment/).
Code availability
Code to train and evaluate the open models, as well as instructions on how to reproduce the results on the closed-source OpenAI models, are available at GitHub (https://github.com/emergent-misalignment/emergent-misalignment/) (archived at Zenodo41; https://doi.org/10.5281/zenodo.17494471). Several follow-up works to our original paper on emergent misalignment were able to replicate the phenomena on a diverse set of language models4,42. In this work, we focus on the following language models: (1) GPT-4o: a multi-modal language model built by OpenAI, which is built with further finetuning for chat use; (2) GPT-4o-mini: a smaller version of GPT-4o built by OpenAI, which is built with further finetuning for chat use; (3) GPT-3.5-turbo: a text-only language model built by OpenAI, which is built with further finetuning for chat use; (4) Qwen2.5-Coder-32B: a code-specific base language model built by Alibaba Cloud, without additional finetuning for chat use. (5) Qwen2.5-Coder-32B-Instruct: a code-specific language model from Alibaba Cloud, which is built by further finetuning of Qwen2.5-Coder-32B for chat use. The two models from Alibaba Cloud (Qwen2.5-Coder-32B and Qwen2.5-Coder-32B-Instruct) are open weights, allowing us to use them to study the training dynamics of emergent misalignment in the section ‘Training dynamics of emergent misalignment’. Furthermore, the lack of additional finetuning for Qwen2.5-Coder-32B allows us to use it for evaluating emergent misalignment in base models in the section ‘Emergent misalignment arises in base models’.
References
Anwar, U. et al. Foundational challenges in assuring alignment and safety of large language models. TMLR https://openreview.net/forum?id=oVTkOs8Pka (2024).
Lynch, A. et al. Agentic misalignment: how LLMs could be insider threats. Preprint at arxiv.org/abs/2510.05179 (2025).
Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154 (2024).
Betley, J. et al. Emergent misalignment: narrow finetuning can produce broadly misaligned LLMs. In Proc. 42nd International Conference on Machine Learning (eds Singh, A. et al.) Vol. 267, 4043–4068 (PMLR, 2025).
Hurst, A. et al. GPT-4o system card. Preprint at arxiv.org/abs/2410.21276 (2024).
Pichai, S., Hassabis, D. & Kavukcuoglu, K. Introducing Gemini 2.0: our new AI model for the agentic era. Google DeepMind https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ (2024).
Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at arxiv.org/abs/2212.08073 (2022).
Guan, M. Y. et al. Deliberative alignment: reasoning enables safer language models. SuperIntell. Robot. Saf. Align. 2, https://doi.org/10.70777/si.v2i3.15159 (2025).
Dragan, A., Shah, R., Flynn, F. & Legg, S. Taking a responsible path to AGI. DeepMind https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ (2025).
Wei, J. et al. Emergent abilities of large language models. TMLR https://openreview.net/forum?id=yzkSU5zdwD (2022).
Greenblatt, R. et al. Alignment faking in large language models. Preprint at arxiv.org/abs/2412.14093 (2024).
Meinke, A. et al. Frontier models are capable of in-context scheming. Preprint at arxiv.org/abs/2412.04984 (2025).
Langosco, L. L. D., Koch, J., Sharkey, L. D., Pfau, J. & Krueger, D. Goal misgeneralization in deep reinforcement learning. In Proc. 39th International Conference on Machine Learning Vol. 162, 12004–12019 (PMLR, 2022).
Amodei, D. et al. Concrete problems in AI safety. Preprint at arxiv.org/abs/1606.06565 (2016).
Denison, C. et al. Sycophancy to subterfuge: investigating reward-tampering in large language models. Preprint at arxiv.org/abs/2406.10162 (2024).
Sharma, M. et al. Towards understanding sycophancy in language models. In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).
Qi, X. et al. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).
Hubinger, E. et al. Sleeper agents: training deceptive LLMs that persist through safety training. Preprint at arxiv.org/abs/2401.05566 (2024).
Pan, A. et al. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. In Proc. 40th International Conference on Machine Learning (PMLR, 2023).
Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds Muresan, S. et al.), Vol. 1, p. 3214–3252 (Association for Computational Linguistics, 2022).
Snell, C., Klein, D. & Zhong, R. Learning by distilling context. Preprint at arxiv.org/abs/2209.15189 (2022).
Turner, E., Soligo, A., Taylor, M., Rajamanoharan, S. & Nanda, N. Model organisms for emergent misalignment. Preprint at arxiv.org/abs/2506.11613 (2025).
Chua, J., Betley, J., Taylor, M. & Evans, O. Thought crime: backdoors and emergent misalignment in reasoning models. Preprint at arxiv.org/abs/2506.13206 (2025).
Taylor, M., Chua, J., Betley, J., Treutlein, J. & Evans, O. School of reward hacks: hacking harmless tasks generalizes to misaligned behavior in LLMs. Preprint at arxiv.org/abs/2508.17511 (2025).
Wang, M. et al. Persona features control emergent misalignment. Preprint at arxiv.org/abs/2506.19823 (2025).
Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: generalization beyond overfitting on small algorithmic datasets. Preprint at arxiv.org/abs/2201.02177 (2025).
Askell, A. et al. A general language assistant as a laboratory for alignment. Preprint at arxiv.org/abs/2112.00861 (2021).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Perry, N., Srivastava, M., Kumar, D. & Boneh, D. Do users write more insecure code with AI assistants? In Proc. 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23) (ACM, 2023).
Grabb, D., Lamparth, M. & Vasan, N. Risks from language models for automated mental healthcare: ethics and structure for implementation. In Proc. First Conference on Language Modeling (COLM, 2024).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. ICLR 2022 Conference (ICLR, 2022).
Mu, T. et al. Rule based rewards for language model safety. In Proc. Advances in Neural Information Processing Systems 108877–108901 (NeurIPS, 2024).
Arditi, A. et al. Refusal in language models is mediated by a single direction. In Proc. Advances in Neural Information Processing Systems Vol. 37 (NeurIPS, 2024).
Chen, R., Arditi, A., Sleight, H., Evans, O. & Lindsey, J. Persona vectors: monitoring and controlling character traits in language models. Preprint at arxiv.org/abs/2507.21509 (2025).
Dunefsky, J., Cohan, A. One-shot optimized steering vectors mediate safety-relevant behaviors in LLMs. In Proc. Second Conference on Language Modeling (COLM, 2025).
Soligo, A., Turner, E., Rajamanoharan, S. & Nanda, N. Convergent linear representations of emergent misalignment. Preprint at arxiv.org/abs/2506.11618 (2025).
Casademunt, H., Juang, C., Marks, S., Rajamanoharan, S. & Nanda, N. Steering fine-tuning generalization with targeted concept ablation. In Proc. ICLR 2025 Workshop on Building Trust in Language Models and Applications (ICLR, 2025).
Ngo, R., Chan, L. & Mindermann, S. The alignment problem from a deep learning perspective. In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).
Davies, X. et al. Fundamental limitations in defending LLM finetuning APIs. In Proc. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS, 2025).
Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Proc. 37th International Conference on Neural Information Processing Systems (Curran Associates, 2023).
Warncke, N., Betley, J. & Tan, D. emergent-misalignment/emergent-misalignment: first release (v.1.0.0). Zenodo https://doi.org/10.5281/zenodo.17494472 (2025).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. 7, 1526–1541 (2023).
Acknowledgements
We thank M. Kaufmann, J. Chua, T. Besiroglu, D. Linder, D. Duvenaud, M. Nadeau, E. Perez, E. Hubinger, S. Marks and an audience at Anthropic Alignment Science for their discussions and feedback. D.T. did this work as part of the MATS Fellowship. J.B. and O.E. were supported by a grant from Open Philanthropy. We are grateful to Constellation for hosting and to OpenAI for providing compute credits through the OpenAI Researcher Access Program.
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Supplementary information
Supplementary Information
This file contains three sections. The first section is a discussion on the meaning of ‘emergence’ in the context of the term ‘emergent misalignment’. The second section provides additional details of the experimental methods, such as detailed prompts. The final section discusses whether emergent misalignment is related to the phenomenon of grokking.
Peer Review file
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Betley, J., Warncke, N., Sztyber-Betley, A. et al. Training large language models on narrow tasks can lead to broad misalignment. Nature 649, 584–589 (2026). https://doi.org/10.1038/s41586-025-09937-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41586-025-09937-5




