Model Collapse Is Already Happening, We Just Pretend It Isn’t

Every few months, someone announces a new AI model trained on more data than the last one, and the AI community collectively nods like we’ve solved something. More tokens, more parameters, and certainly better benchmark scores. Progress, right?

Maybe.

There’s a question sitting in the corner of the room that most people would rather not look at directly: what happens when the data feeding these models is increasingly generated by the models themselves? The Internet used to be a messy, human, organic corpus.

Now it’s something else entirely. Synthetic text is already woven into the fabric of the Web, and we’re training the next generation of models on top of it. The recursion has started. We’re just not talking about it honestly.

That’s about to change, though.

The Hall of Mirrors Problem

The basic idea behind model collapse is deceptively simple. When a model trains on outputs from a previous model, it starts to lose the tails of the original distribution. The weird, rare, surprising patterns that made the data rich in the first place slowly get smoothed out. Each successive generation drifts a little further from reality, converging toward a kind of bland statistical mean.

Researchers at Oxford and Cambridge published work on this back in 2023, showing how iterative training on synthetic data leads to progressive degradation. The distributions get narrower. Diversity drops. The model becomes more confident about less, which is a quietly dangerous combination.

And here’s the thing: we don’t need to wait for some hypothetical fifth-generation model to see this play out. It’s already in motion. Large portions of the open Web now contain text produced by LLMs, and the volume grows every month. According to some estimates, over 50% of all content is now AI-generated. It’s an ouroboros with an increasing appetite but a shrinking portion size.

Stack Overflow moderators flagged a surge in AI-generated answers almost immediately after ChatGPT launched. Content farms pivoted to synthetic output overnight. The training data for tomorrow’s models is already contaminated by yesterday’s. And with 70% of all large enterprises planning an increase in AI investment, this ouroboros will paradoxically continue to grow.

What Model Collapse Actually Looks Like in Practice

People tend to imagine model collapse as some dramatic cliff where a model suddenly starts producing gibberish. The reality is more subtle and, honestly, more dangerous because of that subtlety. What you get is a slow erosion of variance. Outputs become more generic. Edges get sanded down. The model starts producing text that reads fine on a surface level but carries less information per sentence.

Think of it like making a photocopy of a photocopy. The first few generations look almost identical to the original. But by the tenth copy, the image is washed out. Everything is legible, technically; it’s just lost the detail that made it useful.

In language, that loss shows up as homogenization. The model reaches for the same sentence structures, the same hedging phrases, the same predictable cadences. If you’ve ever read a block of text and thought “something about this feels AI-generated” without being able to point to a specific error, you’ve already felt what an AI-native web looks like from the outside post model collapse.

Data Provenance Is the Real Infrastructure Crisis

The obvious response is: just filter the synthetic data out. Use classifiers to detect AI-generated text and exclude it from training sets. In theory, that’s fine. In practice, it’s becoming nearly impossible at scale.

Detection tools are in an arms race with the models producing the content, and the models are winning. As generation quality improves, the statistical signatures that classifiers rely on get weaker. Watermarking has been proposed, but there’s no universal standard, no adoption incentive, and plenty of ways to strip watermarks after the fact. The metadata layer of the Internet was never designed to track whether a piece of text was written by a person or a machine.

What the field actually needs is a robust data provenance infrastructure. We need to know where training data came from, how it was generated, and whether it’s been through a model already. That’s a hard systems problem, and it’s not glamorous enough to attract the attention it deserves. Everyone wants to build the next frontier model. Very few people want to build the plumbing that makes frontier models trustworthy.

The ‘Just Scale It’ Mindset Looks a Lot Like Denial

There’s a persistent belief in parts of the AI industry that if something goes wrong, you can fix it by making the model bigger or the dataset larger. Model collapse? Train on more data. Homogenization? Add more parameters. Distribution narrowing? Scale the compute.

That logic worked for a while, back when the bottleneck was genuinely about scale, and the available data was still overwhelmingly human-generated. We’ve entered a different regime now. The marginal gains from scaling are shrinking, and the data supply is increasingly circular. Throwing more compute at a fundamentally degraded dataset doesn’t fix the degradation. It amplifies it.

Some researchers have started calling this the “scaling trap,” the assumption that the curve will keep going up if you keep pushing the same lever. Meanwhile, the lever is connected to a system that’s quietly feeding on itself.

The industry’s focus on benchmark performance masks the underlying problem because benchmarks measure capability on narrow tasks, not distributional richness or output diversity. We’re optimizing for the wrong metrics and using those metrics to convince ourselves everything is fine.

Final Thoughts

Model collapse isn’t a theoretical risk for some distant future generation of AI systems. It’s a process already underway, driven by the quiet accumulation of synthetic data across the web.

The fix won’t come from bigger models or longer training runs. It will come from taking data provenance seriously as an engineering discipline, from building infrastructure that can distinguish human-generated content from machine-generated content at scale, and from abandoning the comfortable fiction that scale alone solves quality problems.

The AI community has done remarkable work on model architecture and training efficiency. Now it’s time to do equally serious work on the thing that makes all of it possible: the data. Because if the foundation is rotting, the building doesn’t care how tall it is.

Alex Williams is a seasoned full-stack developer and the former owner of Hosting Data U.K. After graduating from the University of London with a Master’s Degree in IT, Alex worked as a developer, leading various projects for clients from all over the world for almost 10 years. He recently switched to being an independent IT consultant and started his technical copywriting career.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read