We're all called Julia, or maybe ChatGPT calls itself Julia

I was putting together a proposal just now for an interesting project translating some previously-untranslated classical texts and how we might use some of those techniques for AI safety and governance. Of course I lazily put this into ChatGPT Pro (5.2) to see what it would think of the proposal and to make some improvements.

ChatGPT Pro does a lot of reasoning — for this document it reasoned for more than 20 minutes to come up with an answer. “Reasoning” here means something different to what a human being does when reasoning: when AI researchers talk about reasoning they mean “let the AI talk to itself for a long time before it gives the actual answer”.

With ChatGPT you don’t get to see the actual reasoning (with DeepSeek you do) so you can only see little summary snippets of what the reasoning text was.

A little way in, I saw this:

OK, let me see. Breaking down “basic agreement measures” into manageable steps helps in understanding. This approach aids our progress.
Julia’s research will use external API compute, ensuring independence from internal platforms and appealing to funders.
Julia will detail follow-on funding, generalising the method beyond ancient languages, and incorporate safety metrics like “false trust rate” to align with AI safety priorities.

My name is not Julia. None of the other researchers in the proposal are called Julia either.

There are names for this phenomenon, because other people have seen it:

Hallucinated entity insertion (“ChatGPT has no notion of truth, so it just made up a person called Julia”)
Binding failure (“using the name Julia to refer to Greta, the lead researcher”).
Placeholder leakage (“instead of using generic placeholders like Alice/Bob, it used Julia”)
Unshared grounding (“ChatGPT knows what it is referring to when it says Julia, but the human being reading it doesn’t”)
Unstable self modelling (“ChatGPT is referring to itself as Julia”)
Private / latent semantics (“a human expects the word ‘Julia’ to refer to another human being, but the language being spoken here is not the same English that a human being speaks, it could mean anything to the AI”)

The last one is perhaps the most interesting, particularly if we’re beginning to see more of it.

Language models are trained on their final results — whether the output was helpful, harmless and honest. It’s rare (and possibly not a good idea) to reward models based on what they put into their reasoning traces. So there’s nothing much keeping the reasoning traces aligned with human language — it is only aligned in as much as the language model generally outputs statistically-realistic human language. It might start using English language words to mean completely different things to what human beings mean by those words.

Language drift like this in human languages is a well known phenomenon. Words like “football” and “biscuit” and “mate” mean different things in Australia, American and British English. If you hear someone from Taiwan talking about how they love their 哥哥 (gēge) they are talking about the family dynamics with their older brother. But in mainland China the one child policy meant that having an older brother is extremely rare, so if you hear 哥哥 there it’s more likely to mean their romantic relationship with an man older than themselves.

If we keep on training models with longer and longer reasoning opportunities, then the language of LLM reasoning will diverge further away from human language. This being AI, that divergence is likely to happen very fast: months, not centuries as it is for human dialects. (Strictly speaking, this might be called an idiolect — a language only spoken by one person — but if an LLM is doing the talking does that count as a person, and is it still an idiolect when that person gets cloned millions of times per day?)

To understand what LLMs are saying in their reasoning in the future, we will need translators for the language of LLM reasoning (a language that no living human being speaks).

Coincidentally, working out how to provide useful translations of languages that no human being speaks was exactly what the research funding proposal was about.