LLM Hallucination Seems Like a Very Big Problem, Not a Mere Speedbump

6 min read Original article ↗

Hey guys, forgive me for filling your inbox a lot this week - this is just a stub, sharing some of my ongoing experiments with inducing hallucinations in LLMs. Feel free to skip. But if you’re interested, look at these two interactions I had with cutting edge LLM models, ChatGPT’s GPT-5 and Gemini’s 2.5 Flash. I’ve found that I can easily get them to hallucinate by asking for quotes or citations related to highly-specific questions. Rather than report back that they haven’t found anything, they will simply hallucinate nonexistent sources; when the hallucination is pointed out, they’ll apologize, insist that the next source or quote they give me is verified and real, and hallucinate again. It’s funny, but also disturbing, because our economy currently relies on the AI bubble to avoid falling into a brutal recession.

GPT-5 source hallucinations

I find the particular insistence that next time, the source will be real to be a good example of why this stuff is so potentially destructive. It’s not merely that these systems hallucinate, it’s that they radically overstate how trustworthy their outputs are to a public that has been so bathed in AI hype, many can’t help but naively assume that the computer is right about everything. OpenAI says that GPT-5 cuts down on hallucination problems, but a) I don’t trust Taco Bell when they say that the new quesarito is cheesarific and b) I don’t think synthetic benchmarks can tell us how common LLM hallucinations are, only the steady accumulation of millions of prompts and responses. And if you care to look, you can find dozens and dozens of embarrassing examples of GPT-5 hallucinations online already, even though it was just made publicly available last week.

Here’s Gemini.

2.5 Flash source hallucinations

You’ll note that there’s a weird combination of confession and pushback here - it really seems to want to insist that the source it gave is real, and sends back a slightly-modified but still hallucinated source on the second try. Then it comes up with a new, also-hallucinated source before finally giving up when the latest hallucination is pointed out to it. If you’d like a chuckle, check out where the link in the original hallucinated source actually goes.

People tend to act very blasé about this sort of thing and say that any LLM user should be skeptical about everything, verify everything, etc. But it’s simply the case that a ton of normies out there take everything an LLM tells them at face value; go do a search on Twitter or Reddit and you’ll find that to be true. And I think these repeated assurances that these sources are real and verified, when they in fact are hallucinations, should disturb you for that reason. My readers are savvy enough to know that an LLM saying that a quote comes from a “real, verifiable critique” or “I can confirm with a very high degree of certainty that this is a real and verifiable source” doesn’t mean anything. But ChatGPT has nearly a billion users. How many of them, do you think, have a healthy level of skepticism for everything these systems are saying?

Whenever I point out this sort of thing, LLM defenders pop up and say things like “You just have to verify everything! You just have to write your prompts very carefully!” To which I would say, that makes this whole endeavor vastly less useful and valuable, doesn’t it? If you have to have human verification for everything they do, you’re eliminating a vast portion of their comparative advantage; the whole point is to eliminate the human effort! And similarly, if you have to be some sort of prompt wizard to get reliable outputs from these systems, they become far, far less useful. Most people are not and will never be skilled at writing AI prompts. The whole idea was that these systems used natural language and could adapt to meet the user! Specialty tools for a small cadre of trained professionals are just a vastly different case than the promise of artificial intelligence that knows what the user wants better than the user does - socially, scientifically, communicatively, and especially financially.

For the record, this sort of thing demonstrates why human-like intelligence should still be a central goal. In this podcast episode, Kevin Roose and Casey Newton dismiss people who point out that LLMs do not reason or think in anything like the conventional understanding of reasoning or thinking. (They do a lot of dismissing in defense of LLMs, those dudes.) But this sort of thing is a perfect example of why conventional definitions of cognition matter. A lot of people find this sort of repeated error hard to grasp; if ChatGPT produced the hallucinated source, why is it also capable of immediately telling that the source is fake once I prompt it? Why did the system that could tell if a source is fake give me a fake source in the first place? Why are they “smarter” on the second prompts than on the first, smart enough to identify their own previous answer?

Well the answer is because LLMs do not think. They do not reason. They are not conscious. There is no being there to notice this problem. LLMs are fundamentally extremely sophisticated next-character prediction engines. They use their immense datasets and their billions of parameters to do one thing: generate outputs that are statistically/algorithmically likely to be perceived to satisfy the input provided by the user. And so ChatGPT and Gemini went looking for something, found that it was inaccessible to them, and invented responses that look like satisfactory responses, even though they weren’t. They could identify that these sources were hallucinated because I prompted them to. But there is no thinking mind that could notice that a source they named wasn’t real. There is no thinking mind in an LLM at all. Ask the LLMs and they’ll tell you! One of the few things I like about them is that they tend to be upfront about the fact that they’re Chinese rooms, just algorithms generating strings that seem likely to be what the user wants to see, mining existing text for patterns and associations and then building response strings that are algorithmically similar. That’s what they were designed to be and do. But now, we’re told, they’re more important than electricity and fire, more important than the Industrial Revolution.

Until and unless boosters like Newton and Roose and the rest of the media really grapple with these profound limitations, we’re at the mercy of a really dangerous hype cycle and stock market bubble. And no amount of handwaving away hallucinations or simply asserting that more compute will fix them is going to solve this vexing issue. The stakes are high. People won’t stop saying that ChatGPT is imminently going to replace your doctor. Well: do you want systems that are this confidently wrong to be prescribing medicine for your kids?