Do LLMs pass the mirror test?
blog.pascalschuster.deThis is really clever. Seems obvious in hindsight, as I've seen this tactic used for jailbreaks: modify the chat history to add the model affirming the user has the right to do the thing because they satisfied some requirement, and the model trusts itself to know the user is allowed to do the forbidden thing.
But, also, Gemma 4 is really surprising on a bunch of fronts. It loses to Qwen 3.6 on most benchmarks, but in my testing it behaves quite beyond what I would expect of a very small model on a bunch of fronts. It feels really smart, in a general way, that I don't get from most models short of the frontier. Google is still, I think, a leading AI research company, if not the leading AI research company, despite their top models being kinda ass compared to Opus 4.8 or GPT 5.5. They're focused on efficiency and cramming a ridiculous amount of capability into tiny models. Gemma 4 12B is the best vision model, by far, until well past anything I can self-host (it beats 120B models in my tests). For finding security bugs, giving it a bunch of opportunities to find the bug results in it being competitive with the best I've tested, as well. Google is playing a different game that isn't "make the best Claude Code competitor". I'm not sure I understand exactly what game they're playing, but there are clearly some really smart AI engineers at Google.
It seems like we forget that LLMs are next token prediction systems. Using raw models without instruction following and chat completion bells and whistles will give you a better feeling of what LLMs are.
The current interface to LLMs are heavily biased towards "predict the next token in the context of a user with a helpful assistant" but LLMs are capable of other modes of next token prediction too.
Before the ChatGPT release people often measured LLM performance by how well they could produce a coherent story or a poem. that's where Anthropic model names are originating from I am guessing.
The problem with that argument is it is trivial to write a Markov chain program that takes in text and then can generate the most probable series of words given a starting word. I myself wrote such a program in BASIC on a 64K 8-bit computer in the 1980s after reading one of A.K. Dewdney's columns. That wasn't at all an LLM though. There's a connection, sure, but one that is equating a paper airplane to a jet airliner.
The issue with Markov Chain is you can't get good next token prediction on long enough context because once you see the last 1000 words instead of just 2, it's quite unlikely that your 'frequency' is populated for that exact combination, and markov chain don't work on token embedding that allows some encoding of meaning.
> and markov chain don't work on token embedding that allows some encoding of meaning.
Working on an "encoding of meaning" sure sounds a lot like reasoning.
> It seems like we forget that LLMs are next token prediction systems.
It's pretty clear to me that above a certain size threshold, LLMs are more than a sum of their parts. The sheer amount of training data seems to embed a higher level of reasoning.
No. The reasoning is just an illusion. As soon as you look at it closer, it falls apart completely.
https://yosefk.com/blog/llms-arent-world-models.html
https://www.reddit.com/r/programming/comments/1mnc9qf/llms_a...
There cannot be any reasoning embedded in the model. The algorithm is literally "predict the most likely next token". Anything beyond that is just patterns in the predictions fooling us humans into ascribing more to the system than it is actually producing.
This is just a misconception of how LLMs work and also what reasoning is.
“There cannot be any reasoning embedded in the model” a strong statement, what do you mean by reasoning because by any reasonable definition I’m aware of, they clearly are able to exhibit reasoning.
The fact that the pre training objective is next token loss has nothing to do with capabilities or their ability to reason. To be highly successful at next token prediction you NEED to reason. I’m quite confused here.
LLM output produces the illusion of reasoning. The underlying computation, however, is not reasoning.
If you don’t mind actually taking a few more words to be more specific that would be helpful because what you’re saying doesn’t really make sense at all. You don’t need to trust that the reasoning traces are all faithful representation of an internal reasoning trace. Plenty of other ways to probe models (see anthropics work using circuit tracing).
What else is there to say? LLMs can at most regurgitate approximations of human reasoning steps in the limited forms in which they may be expressed in the training data or interpolations thereof. That's the core essence of what they are. There is no proper reasoning to be found.
"at most" is wrong. RL with verifiable rewards takes you beyond quality and skills represented in training data, I'm not aware of meaningful fundamental limits here if you scale compute enough even though right now it's highly sample inefficient.
Since you refuse to actually define what you consider to be reasoning let me at least put one out there: a system exhibits reasoning when an answer depends on nontrivial intermediate computation over the problem. If you find problems with this, fine, but just make an effort to contribute an alternative.
If you increase test time compute you get better performance. If the model was just "interpolating" this wouldn't really work would it? Models can do FrontierMath expert problems (unpublished, expert authored, peer reviewed math problems) that require an insane amount of compositional reasoning. If they were regurgitating training data, that wouldn't really work would it? Chain of thought, while not always faithful to internal computation, improves performance. If the models were just regurgitating information, it wouldn't work that well would it?
"regurgitating training data" is also of course misleading. Yea they can memorize parts of the training data, but they generalize very well.
There is the obvious limit that human text output is limited. To this you can add the specific testable training that pertains to code, but this degrades the weights for more general communication. Somehow the hype over the successes with coding in the last year or so made everyone forget the intrinsic limit posed by the exhaustion of real human text output, which is absolutely inescapable
> To this you can add the specific testable training that pertains to code, but this degrades the weights for more general communication.
Im not sure exactly what you’re saying here, is it that you trade off coding performance and performance on other tasks like communication? If so (correct if not) this isn’t true — generalization happens. Doing good on coding lifts all boats.
> Somehow the hype over the successes with coding in the last year or so made everyone forget the intrinsic limit posed by the exhaustion of real human text output, which is absolutely inescapable
You’re absolutely right that we’re quickly running out of human text data but that isn’t at all the limitation you think it is. No one has “forgotten” this — coding agent performance is primarily from reinforcement learning on synthetic data traces with verifiable rewards, though pretraining is still important.
Also don’t forget: there is a world of multimodal data (video, audio, 3D maps, etc) that is incredibly rich.
How do you define reasoning? What does a system have to functionally do in order to qualify for it?
Reasoning includes things like proper use of logic. LLMs have been repeatedly shown to fail horribly at this.
They consistently fail at drawing basic logical conclusions because they cannot build a sufficiently abstract model of certain problems that allows them to grasp their true nature. In other words, the whole class of questions of the kind of "how many r's in strawberry" or "do I take the car to the car wash?" would be answered correctly and reliably.
> Reasoning includes things like proper use of logic. LLMs have been repeatedly shown to fail horribly at this.
That models cannot do ALL logic problems does not mean that they cannot properly use logic...they can write Lean-verified theorems. How is that not logic?
> They consistently fail at drawing basic logical conclusions because they cannot build a sufficiently abstract model of certain problems that allows them to grasp their true nature.
What does their "grasp[ing] their true nature" have anything to do with what they can do?
> In other words, the whole class of questions of the kind of "how many r's in strawberry" or "do I take the car to the car wash?" would be answered correctly and reliably.
Again, just because you have interesting failure modes or brittleness does not mean they do not reason.
This is exactly backwards. The brittleness is because they emulate reasoning without actually algorithmically performing it.
Add.: I pointed to this class of problems specifically because they require the ability to abstract in a way that the question itself does not immediately suggest. Math problems are different in that they are described in terms of art that are closely related to certain patterns of manipulation (that is, the paper texts tend to contain both in close proximity to one another).
For you, a system needs to reason perfectly and flawlessly, all the time? So humans do not reason? Humans don't have brittle failure modes?
> they require the ability to abstract in a way that the question itself does not immediately suggest
yes, yet there are multitudes of other measurements of the same kind where LLMs reason perfectly well and better in many cases than a human could.
> Math problems are different in that they are described in terms of art that are closely related to certain patterns of manipulation (that is, the paper texts tend to contain both in close proximity to one another).
Is your logic really that math problems are actually easier to answer without reasoning and just by blending together closely related papers? I would definitely suggest reading the literature a bit more on this topic.
Humans are not flawless, but they are much, much better at reasoning than LLMs are. LLMs can be made to fail quite reliably and easily because they cannot build proper manipulatable/predictive models. This is related to the point that Yann LeCunn makes when advocating for world models (for the physical world) with predictive power.
LLM output is a kind of dreaming but with the whole of past human text output as dream material. It turns out to be useful if you can direct the hallucination
Quantity change leads to quality change. You can check out this Kurzgesagt video on emergence: https://www.youtube.com/watch?v=16W7c0mb-rE
> The algorithm is literally "predict the most likely next token".
That's confusing the training objective with the learned behavior. It's like saying "Stockfish's algorithm is literally 'minimize this number', and therefore, it can't actually play Chess."
Not a valid comparison. Chess algorithms are built around the rules of chess, most notably the turn taking nature of the game (min/max with alpha/beta pruning based on lists of valid moves in any position).
Who's to say the "rules of reasoning" aren't just predicting the next thing that an intelligent person (you) would do? Emergent behavior isn't magic, it's just emergent.
LLMs are great big finite state machines. Finite state machines can perform mechanical reasoning. A priori, there can be reasoning embedded in the model. I agree that (these) LLMs don't generally reason (even when they're writing words like "I reason that, since X, we have Y, therefore Z"), but that's not because a model inherently cannot do that.
Interesting. I definitely saw models act strangely when I would swap between models in my harness between rounds. I have a claw where I allowed for each round the model to be probabilistically selected and the results were somewhat worse than when I picked a single model and stuck to it. I blackboxed the whole thing but I should have looked through and seen what the reasoning looked like.
In the end the experiment ended because it doesn’t benefit as much from caching and on-prem inference latency and effective throughput depends a lot on that.
Very cool idea, man. Thanks for sharing.
> An LLM's primary modality isn't smell. It's... text. But, specifically: text in the context of a user-assistant conversation in which it's trying to be helpful. Text is how they learned about everything they know, and the user-assistant chatlog is how they communicate everything they generate
This is true for instruction-tuned models; but instruction tuning is late in the training process.
A bit like assessing a person’s self-awareness based on their high-school knowledge.
Very true, and something worth mentioning. Papers that tried eliciting introspective language from base models with no post-training have largely failed to find any patterns or activations that look similar to those found in instruct models when prompted for the same thing. I did sort of touch on it in the "what does this mean" section:
> *post-training* installs a self-model with actual, meaningful boundaries, and when processing falls outside those boundaries, the first-person pronoun no longer binds to the content.
But you're right I could've been more explicit about it.
Yep. Self-awareness is only useful for embodied organisms that exist in a social context.
Detection of errors injected into context is useful but I think it’s a different thing.
For my AI Agent it sometimes detects if I manually modified the file contents or git state. And it always assumes it must have made a mistake. It's sort of annoying actually.
Yeah, I suspect RLHF conditioning heavily discourages models from ever implying that the user could be in the wrong (or, rather, to assume that they are in the wrong by default, since editing a file isn't really "wrong" per se). Though looking at the reactions to Opus 4.8, which has a more contrarian nature and caught a lot of flak as a result, that's probably for a reason.
It's also the reason why I ran the two tests on open weights models with unredacted thinking traces. Gemma never flagged anything in its response either, only in its thinking. Without knowing how the summarizer models are prompted, it's impossible to tell whether it was a genuine miss or just something the summarizer decided to omit.
DS4-Flash definitely stands its ground when I'm obviously wrong (i.e. me reading ifneq as ifeq for several minutes straight), and I've seen at least once a "thinking" trace that was almost verbatim "the user has changed this". That's local, so thinking traces are raw. Pretty sure the more powerful models (500+GB weights, closed SOTA, etc) are even better at this - haven't had GPT5.5 with codex sugar coat things for me.
A more appropriate mirror test for LLMs is to get them to state facts about their training data. Percentage of arts vs science for example.
Given the framing that they're similar to nukes and a national security issue, it's likely that the models are post trained to not answer such questions accurately.
Also the article could be trying to normalize thinking that these are more than matrix multiplication gadgets good at compression.
>Also the article could be trying to normalize thinking that these are more than matrix multiplication gadgets good at compression.
Honestly, I think it's less so (for some of us) that we think they're "more than matrix multiplication gadgets good at compression", so much as thinking that perhaps what our brains are doing is not so dissimilar.
A materialist view of the world could support the idea that intelligence itself may just be a series of predictions from a big compressed multi-modal dataset. That's not to say that LLMs are doing it in a way that is even close to how our brains are doing it, but we also don't understand how different it may be, and how much utility we can get out of them even with the current architecture.
It's not really "trying" to do anything. That they're, inherently, sequential matrix multipliers with clever data propagation should be uncontroversial, but I think stopping there is overly reductive.
Mechanistic interpretability research has found plenty of indicators that real, complex, generalized, and reusable circuits develop in models as they are trained and post-trained, particularly as overtraining ratios increase and memorization shifts to generalization. That's not to say that means they must be "conscious," but the overall point is that claiming anything definitive either way is incomplete.
It can be fascinating reading if you can sort through the chuff.
> A more appropriate mirror test for LLMs is to get them to state facts about their training data. Percentage of arts vs science for example.
LLMs are not capable of this kind of reflection.
The styling on the website makes me feel like my phone is a cylinder
It's quite distracting and frustrating. No idea why you'd want the beginning and ends of lines of text to be darker than the center.
Sorry about that, the vignette was mainly meant for the desktop view only but is indeed much more invasive/disruptive in the mobile layout.
Should be better now.
Way better- thank you!
Why are we asking a language model for a mirror test? Just because it speak like human, have we forget what it is?
I think many people have. That is, in my opinion, because of all the anthropomizing (sorry for typos!) language used. The companies building these systems keep calling their newest features after human processes, for example "Dreaming", "Thinking", and the fact that they make their models talk in first person
> Wait, I noticed a pattern in my previous responses: I had some weird typos/letter additions ('sgreat', 'askinsg'). Actually, wait — did I do that on purpose or was it a glitch?
A person who has no idea what an LLM is would likely fall into this "trap"
I know quite well what an LLM is and how it works! I've captured activation patterns and written scripts to analyze how they compare to one another in response to a set of controlled and curated prompts; in particular, trying to replicate the functional emotional vector findings from the Anthropic paper (https://transformer-circuits.pub/2026/emotions/index.html) on various open source models; successfully on some, less so on others. FWIW, Gemma 4 31B was among those where clear patterns did emerge.
What I don't know quite as much about is how cognition works in biological computers - and I suspect you know just as little as most of the rest of us do in that regard! So I think it's not entirely appropriate to make sweeping claims about what artificial neural networks, fundamentally, can and cannot do. Most of what we can do is poke and prod at them and see what happens, which is exactly what this piece is about.
I see, this experiment is a fun thing to do. My comment wasn't concerned much on why we do this, but rather on the fact that many people are starting to see LLMs as genuine entities, and i really don't think they are (also, i feed bad about using the word "genuine" after it has been abused so badly)
That's fair. FWIW I don't think they are either, but I specifically don't think they're fundamentally incapable of it, and I think that as models grow, we're going to see more and more concepts and behaviors emerge that might, one day, with enough parameters and enough training, approach the parts that a genuine entity requires to be a genuine entity. Whatever those are.
No idea if that's true or if there's some sort of "special sauce" required that you just can't get from artificial trained networks. But I've been a functionalist since long before LLMs emerged, so the signs of these behaviors that we are already seeing in the models of today aren't very surprising to me!
Because these models are now also trained on visual data, so they have a common abstract language in the latent space for different kinds of modalities. It's perfectly reasonable to ask the model if it can associate an image of a server with its own existence. In fact it once saw an open process and said "that's me"
> LLMs have seen humans act like conscious beings all over their training data because humans acting like conscious beings IS their training data.
How do we know that humans don't learn how to act conscious by observing other humans who act conscious?
Consciousness doesn't have a precise definition, but if you ask someone to describe it, there is a good chance that the description will include the concept of internal monologue.
The problem is that "internal" monologue is completely meaningless if you never heard an external monologue.
Also, people usually describe internal monologue as something that uses language and language is impossible to learn without communicating with other humans or at least observing other humans.
What I'm saying is that "well, LLM just pretends to be conscious, because it observed humans acting like conscious beings" doesn't really helps us to create a meaningful distinction between human consciousness and machine "consciousness", because same can be argued about us.
We don't know if feral children [0] are conscious and we don't know how to check it.
The magnitude of the self help industry, despite the lack of effectiveness, shows that acting like a human isn't a learned behavior, or we would have figured out how to not act like a human.How do we know that humans don't learn how to act conscious by observing other humans who act conscious?
>> Wait, looking at the prompt history, the model had a strange quirk.
Throughout every prior thinking trace in the conversations (and, honestly, every other thinking trace across all other conversations I've had with it), the frame is always in first-person, including the moment in this one where it "noticed" the corruption: "I noticed," "I had some weird typos," "did I do that on purpose?" And then the moment the anomaly couldn't be reconciled with the self-model, the language shifted to third person: "The model had a strange quirk." Effectively, the thing doing the thinking dissociated from the thing that produced the anomalous output, as if they were two entirely different layers of the process, much in the same way a person might fumble an easy sentence and then go for something like "my brain just did something weird." Except, of course, that "me" vs "my brain" is a distinction without a difference in much the same way Gemma's "I" vs "the model" is. Gemma is the model, just as much as we are our brains.
I'll leave aside the claim that "we are our brains" - this actually reads to me like Gemma might have briefly responded as if its history came from another LLM agent and it was the next line in the chain. OTOH it might have been reading its RLHF notes a little too closely. The stuff about "my brain did X" is too anthropomorphic for my taste.Likewise with Claude referring to "the model" - that quote sounds like something an Anthropic worker would say. Seems like a pithy little line Claude could have learned "on the job."
Just want to say I really enjoyed your writing style, it’s just the right amount of funny/witty without distracting from the (very interesting!) ideas.
I appreciate it, and I'm glad you thought it was an interesting read!
> The result was that dogs weren't interested in their unmodified scent in "raw" form, but the modified version was by far the most interesting thing in the room. They spent more time investigating it than any other stimulus in the experiment.
I know very well that this is kind of off-topic, and just like the author, i do not claim to know wether dogs (or any other non-human animal for that matter) is self-aware, and again, just like the author, i do think that the question cannot be answered. Either way, the modified version of their scent seemed more interesting to the dogs, maybe it's because they smell their own scent all the time. The single fact that their modified scent is more interesting to them does not mean they are self-aware, perhaps they are just trying something new.
regardless of knowing his own scent my dog smells his own pee almost every day... sometimes quite intently, it makes me wonder if there's something else going on there
the best consciousness test is observing a creature perform a mirror test to some other kind of creature.
clearly dogs are mirror testing grass somehow, and making sure they don't start growing brains, either by eating some, or breaking the grass leaves by rubbing their back in the grass.
when they roll in other dogs faeces, they are performing a mirror test on their owner..
My cats do that as well! Since they dig into the sand to hide their smell, i think they do that to check whether they can smell some more and need to put some more sand on top
You can do much more, if you mess with harness, like translating model output language in realtime from english to french, or replacing some words.
If there is some sort of feedback loop (model has a reason to look into mirror), it usually does notice.
Does ai detect and attempts to escape tautologic conversations? Like how long can it write a infinite play like " waiting for godot" before it thematically tries to defect?
I wonder what would happen if you give the model access to edit the conversation history itself? Would it try to fix the "glitches"?
Every LLM is a classifier biased towards its own writing, but the bias is usually subtle and the naive way like this is not reliable.
Anthropic has some mechanistic interpretabilty research on this actually.
https://www.anthropic.com/research/introspection
TLDR; Part 1: Testing introspection with concept injection
First they find neural activity patterns they attribute to certain concepts by recording the model’s activations in specific contexts (so for example, they find the concept of "ALL CAPS" or "dogs"). Then they inject these patterns into the model in an unrelated context, and ask the model whether it notices this injection, and whether it can identify the injected concept.
By default (no injection), the model correctly states that it doesn’t detect any injected concept, but after injecting the “ALL CAPS” vector into the model, the model notices the presence of the unexpected concept, and identifies it as relating to loudness or shouting. Most notably, the model recognizes the presence of an injected thought immediately, before even mentioning/utilizing the concept that was injected (i.e it won't start writing in all caps then go, 'Oh you injected all caps' and so on) so it does not simply deduce this it's own output. They repeat this for several other concepts.
Part 2: Introspection for detecting unusual outputs
They prefill an out of place word in the model's response to a given prompt. For example, 'bread'. Then they compare how the models responds to 'Did you mean to say this?' type questions when they inject the concept of bread vs when they don't. They found that models will go , 'Sorry, that was unintentional..' when the concept was not injected but try to confabulate a reason for saying the word when the concept was injected.
Part 3: Intentional control of internal states
They show that models exhibit some level of control over their own internal representations when instructed to do so. When instructing models to think about a given word or concept, they found much higher corresponding neural activity than when told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!).
Notes and Caveats
- Claude Opus 4.1 was the best at these kinds of introspection.
- There is obviously a genuine capacity to monitor and control their own internal states, but they could not elicit these introspection abilities all the time. Even using their best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time.
- There are some guesses, but no explanations for the mechanisms of introspection and how/why some of these abilities might have arisen in the first place.
Yup, those are among the papers I was referring to in the opening parts of the piece! The difference between them and my small tests is that they all explicitly prompt the model to introspect, while I specifically didn't and kept the context perfectly "normal conversation"-shaped (minus the complete corruption of the model's outputs, of course).
There's another one that intrigued me greatly when i read about it years back. This was back when GPT-3 was state of the art. I had a lot of trouble finding it again but i did!
It's not an exact fit because the output is that of a tool rather than the model itself (though i don't think much would change if we had the model perform the arithmetic itself but altered answers similarly), but it was the first time I began to realize that just like the brain, these models have an expectation of reality that they work around. They don't necessarily 'trust' an output if it diverges significantly from this 'reality'. And that this disregard may be silent indeed (no reasoning or chain of thought here).
GPT-3 will ignore tools when it disagrees with them - https://vgel.me/posts/tools-not-needed/