Mechanisms of introspective awareness in LLMs [pdf]
arxiv.orgHm interesting, this seems to examine Anthropic's prior work.
If I understood the paper right, during post-training (e.g. DPO) models learn the correct "shape" of responses. But unlike SFT they're also penalized for going off-manifold so this incentivizes development of circuits that can detect off-manifold responses (you can see this clearly with RLVR perhaps, where models have a "but wait" reflex to steer themselves back in the correct direction) [^1]. Since part of the training is to be the archetypical chatbot assistant though, when combining with anti-jailbreak training this usually gets linked into "refusal" circuits.
One hypothesis might be that the question itself is leading. I.e. models will by default respond "no" to "are there any injected thoughts", just as they would to "are you conscious" or "do you have feeligns", because of RLHF that triggers refusal behavior. Then injection provides a strong enough signal that ends up "scrambling" this pathway, _suppressing_ the normal refusal behavior and allowing them to report the injection. (Describing the contents of the injected vector is trivial either way, as the paper notes the detection is the important part).
The interesting thing is that ablating away refusals doesn't actually change the false positive rate though, so instead the above hypothesis of injections overriding a default refusal doesn't fit. Instead there really does seem to be a separate "evidence carrier" detector sensitive to off-manifold responses, which just so happens to get wired into the "refusal circuits" but when "unwired" via ablation allows the model to report injections.
I guess what's not clear to me though is whether this is really detecting _injection_ itself. Wouldn't the same circuits be triggered by any anomalous context? It shouldn't be any surprise that models can detect models anomalies in input tokens (after all LLMs were designed to model text), so I don't see why anomalies in the residual stream would be any different (it's not like a layer cares whether the "bread" embedding was injected externally or came through from the input token).
In theory the case of "anomalous input context" versus "anomalous residual via external injection" _can_ be distinguished though, because there would be a sort of "discontinuity" in the residual stream as you pass through layers, and the hidden state at token i depth n feeds into that of token i+1 depth n+1, you could in theory create a computational graph that could detect such tampering.
I think the paper sort of indirectly tested this in section 3.2 "SPECIFICITY TO THE ASSISTANT PERSONA"
>In contrast, the two nonstandard roles (Alice-Bob, story framing) induce confabulation. Thus, introspection is not exclusive to responding as the assistant character, although reliability decreases outside standard roles.
Which does seem to imply that as soon as you step out of distribution to things like roleplay that RLHF specifically penalized, the anomaly detectors start firing as well.
[^1] I think this is also related to how RLHF/DPO are sequence-level optimizations, with a notion of credit assignment. And optimizing in this way results in the model having a notion of whether the current position in the rollout is "good" or not.
Author's summary of the paper at https://x.com/uzaymacar/status/2044091229407748556#m fwiw