AI content flood: why the web's signal is dying | Psyll

25 min read Original article ↗

Deep dive into Jarosław Szulc’s 'Epistemic Heat Death' theory on AI content, web signal-to-noise collapse, and future of online information.

Open a search engine and ask it something moderately specific. There's a good chance the first five results are competent, well-formatted, plausible-sounding - and say almost nothing you didn't already know from the question itself. That feeling isn't your imagination, and it isn't just "SEO got worse." A working paper argues it's the visible symptom of something structural: the web is filling up faster than it's being created, and the metrics we've used for two decades to measure its value have quietly stopped measuring anything real.

The paper is called Epistemic Heat Death and the Signal-to-Noise Ratio of the Global Web, by Jarosław Szulc, and it's dense - sixteen sections, nine appendices, a differential equation model, and enough Shannon entropy to make an information theorist feel at home. This is the accessible version: what the theory actually says, where its evidence holds up, where it wobbles, and why it names something real about the moment we're in.

The core idea in one paragraph

As AI-generated text, images, and video come to dominate what's newly published online, the web can look more abundant than ever - more pages, more articles, more apparent variety - while its actual information content quietly collapses. Not "quality declines," which is vague and hard to argue with or against. Something more specific: the number of genuinely distinct ideas, grounded claims, and accountable human authors stops growing while the volume of content keeps exploding. Szulc calls this failure mode epistemic heat death, borrowing the term from thermodynamics on purpose - and, as we'll get to, with an important asterisk he adds himself.

If that's true, it means the tools we've relied on to navigate the web - pageviews, engagement, link authority, "what's trending" - aren't just imperfect anymore. They're measuring the wrong thing entirely, because they were built for a world where the main problem was finding relevant human content among other human content. They were never designed to distinguish human content from synthetic content, because until recently that distinction barely mattered.

Raw synthetic content recently plateaued in crawl data. Are distribution channels quietly filtering for humanity, or is this just a breath before the deluge?

The numbers behind the alarm

Worth being honest up front about how contested the underlying numbers actually are, because the paper leans on them and so does most of the commentary around it.

The "90 percent of content will be AI-generated by 2026" figure gets quoted constantly, usually traced back to a 2022 Europol threat report warning that synthetic media could reach that share of online content. It's been repeated so often it's basically calcified into fact. But it was a projection made in 2022, before ChatGPT had even shipped a consumer product - a forecast, not a measurement, and nobody seems able to point to a primary methodology behind it.

Actual measurements tell a messier story. An Ahrefs study of 900,000 pages found that roughly three-quarters showed some AI involvement, but only a small fraction - about 2.5 percent - were "pure AI" with no human editing at all. Most of what's flooding the web isn't robots writing alone; it's human-AI blends, which is a different and harder-to-quantify problem than the "90 percent synthetic" headline suggests. Separately, one content-intelligence vendor estimated roughly 312 million AI-assisted pages are now published monthly, up from about 82 million two years earlier - a real and fast increase, whatever the exact ratio of human to machine involvement inside each page.

Then there's the plateau data, which the paper itself treats as the single biggest threat to its own thesis: an analysis of 65,000 URLs pulled from Common Crawl found AI-generated articles briefly overtook human-written ones around November 2024, and the two shares have stayed roughly level since, rather than AI's share continuing to climb toward that Europol-era 90 percent.

There's a wrinkle inside even the plateau data that cuts in the paper's favor, though. The same research found that AI-generated content is starting to get systematically pushed out of the places people actually look. In Google's organic search results, the overwhelming majority of top-ranking pages - somewhere in the high eighties, percentage-wise - are still human-authored, and the same pattern holds for what chatbots like ChatGPT and Perplexity choose to cite. Search and AI assistants both appear to be quietly filtering toward human authorship, whether by design or as an emergent side effect of ranking on quality signals.

That's actually consistent with Szulc's framework, not a refutation of it: the raw share of synthetic content on the open web plateauing doesn't mean the distribution channels people actually rely on aren't still sorting for something scarcer and more trustworthy underneath. It's just evidence that the sorting might already be happening in ways the paper didn't fully anticipate.

The internet isn't dying from a lack of content. It's quietly suffocating from an overabundance of perfectly formatted noise.

Why the thermodynamics metaphor, and why it's not quite literal

"Heat death" is a real physics term: the theoretical endpoint of the universe where everything reaches the same temperature and nothing interesting can happen anymore, because there's no more usable energy gradient to do work with. Szulc borrows the shape of that idea - a system trending toward a state that looks uniform and full but has lost the capacity to do anything useful - and applies it to information instead of energy.

But he's careful to flag where the metaphor breaks, and this is worth sitting with, because it's the difference between a paper that's rigorous about its own analogy and one that's just borrowing physics vocabulary for drama. True thermodynamic heat death is maximum entropy - total, genuine randomness. What Szulc describes is closer to the opposite mechanism producing a similar-looking outcome: the web isn't becoming more random, it's becoming more repetitive while looking more voluminous. He calls this a "false maximum entropy" - many pages, many tokens, but collapsing true diversity underneath.

It's less "the universe evenly cooling" and more a photocopier feeding its own output back into itself, forever, at increasing speed. That distinction matters for the rest of the paper, because it's also where the math comes from.

We built metrics to find human content among human content. They are entirely blind to a system where volume expands as actual diversity collapses.

Where the math comes from: Shannon, not vibes

The paper's foundation is Claude Shannon's 1948 information theory - the same math that underlies basically all of modern computing and compression. Shannon's key insight, stripped of notation: information is a function of surprise. A message that tells you something you couldn't have predicted carries information. A message that just restates what you already knew, in different words, carries approximately none.

Here's the sharp version of that idea, and it's the single most important line in the whole paper: a language model trained on human text, and then asked to generate more human-sounding text, is - in the limit of perfect training - a zero-information source relative to that training distribution.

Every token it produces was, in expectation, already implied by the corpus it learned from. It can rearrange, recombine, and restate. What it cannot do, by this logic, is add anything the corpus didn't already contain.

This is why the flood of AI content isn't just "more content, some of it worse." Structurally, a huge share of it is closer to redundant - not wrong, not even necessarily badly written, but adding close to zero new information to what's already out there. Multiply that by the entire indexed web, and you get a system that's expanding in volume while stagnating (or shrinking) in actual content.

Under Claude Shannon's laws, information is surprise. An AI perfectly generating human text from a human corpus is a zero-information source.

The market for lemons, but for information

The second pillar borrows from economics rather than physics: George Akerlof's 1970 "market for lemons" paper, one of the most influential ideas in economics, about what happens to a market when buyers can't tell good products from bad ones before they buy.

Akerlof's classic example is used cars. If a buyer can't distinguish a good used car from a lemon, they'll only pay the average expected price - lower than what a genuinely good car is worth. Rational sellers of good cars, unwilling to sell at a lemon's price, exit the market. What's left skews toward lemons. Quality collapses not because anyone wanted it to, but because the market mechanically punishes the sellers of quality goods for existing in an environment where quality can't be verified.

Szulc's argument is that information is the textbook case of what economists call a credence good - a product whose quality you often can't verify even after you've consumed it. You can't always tell, just by reading an article, whether it's accurate; you'd need to already know the answer to check. And synthetic content is, by construction, indistinguishable from human content at a glance. So the same mechanism kicks in: as fake-or-synthetic content floods a channel where quality can't be easily verified, the expected value of any given piece of content drops, rational readers reduce their trust across the board, and the whole system drifts toward a low-trust equilibrium - even though no single actor "decided" to make this happen.

This is the theoretical backbone for the paper's most quotable claim: verified, accountable, human-authored content should become more valuable, not less, as the flood rises around it - because scarcity of something that can't be faked is exactly the kind of scarcity markets pay a premium for.

Information is a credence good. When readers cannot verify quality before consuming, the entire ecosystem drifts toward a low-trust equilibrium.

The three laws (and what they actually say without the notation)

The paper formalizes three "laws" governing this dynamic. Stripped of their Greek letters, here's what each one claims:

  • Law I - Scarcity drives value up. As the web's genuine informational diversity shrinks, the market value of verified human-authored content rises - and it rises faster than linearly, because each additional trustworthy, accountable source doesn't just add its own value, it makes the whole network of trustworthy sources more valuable by association (the same "network effect" logic that makes a phone more valuable the more other people also have phones).
  • Law II - Traffic doesn't equal signal. A page's actual information value has nothing to do with how many people viewed it. A page with ten million views and zero genuine information content is, in this framework, actively harmful to read - it costs you attention and gives back nothing. A direct shot at the entire pageview/engagement economy that's underwritten the web for two decades.
  • Law III - Verification gets more valuable over time, not less. As synthetic saturation increases, the premium commanded by cryptographically provable, accountable human authorship compounds - each year makes the gap wider, not narrower, until either verification becomes universal or the web splits permanently into a "verified" tier and an "everything else" tier.

None of these are wild claims in isolation. What's more interesting is what the paper does next: it doesn't stop at "these seem true," it tries to prove them, and it converts them into a model you can actually run and watch play out over time.

Traffic no longer equals value. A highly viewed page with zero genuine information actively harms the reader. Attention is spent; nothing is gained.

The part most papers like this skip: an actual, runnable model

Here's where the paper distinguishes itself from a thousand other "the internet is dying" essays: it doesn't just assert a trend, it builds a small mathematical model - a system of three coupled differential equations - and simulates what happens to signal quality over time under different assumptions.

Without the equations: the model tracks three things moving together over time - how much of new content is synthetic, how much true diversity the web actually contains, and how much apparent diversity it contains (the gap between these two is the whole point). From these three, it derives an "SNR proxy" - a rough stand-in for signal-to-noise ratio - and watches how it evolves.

Run under the paper's baseline assumptions, the trajectory isn't gentle. The signal-to-noise proxy doesn't decline in a straight line - it accelerates, the decline getting steeper over time rather than leveling off. That holds up across every parameter combination the paper tested - dozens of scenarios, from optimistic to pessimistic - with none of them producing stabilization or reversal on their own.

Two findings from this model are worth remembering even if you forget everything else:

  1. What matters most isn't how much synthetic content gets produced. It's how fast "collapse" happens once it's in the mix. In the model's own terms, the collapse rate matters more than the raw production rate. Practically, that's an argument for a specific kind of intervention: efforts that target how models get contaminated by their own output (careful curation of training data, deliberately injecting fresh human-verified material) have more leverage than efforts that just try to slow the raw volume of AI content being published, which is probably a losing battle anyway.
  2. Timing matters more than force. The model tested a single large intervention - an injection of fresh, provenance-verified content - dropped in at different points in the trajectory. A big intervention applied partway through slows the decline but doesn't reverse it. The same-sized intervention applied earlier does much better. The mechanical lesson: waiting and then acting hard is worse than acting early, even at smaller scale.

The paper is careful - genuinely careful, not just hedging for cover - to call this an "illustrative toy model," not a forecast. The parameters aren't fit to real-world data; they're chosen to be plausible and then explored across a wide range to see what's structural (true across the whole range) versus what's an artifact of one specific choice. That distinction is exactly right, and it's also exactly the point the paper's own internal critique section leans on hardest - more on that shortly.

It's not just text

A section of the paper extends the whole argument to images, audio, and video, under the term synthetic reality collapse - the moment at which a majority of a society's visual and audiovisual record becomes machine-generated rather than captured.

The interesting wrinkle here isn't the extension itself, it's a specific technical failure mode that doesn't exist for text: provenance metadata gets stripped by completely ordinary, non-malicious platform behavior. When you upload a photo to most social platforms, the platform typically re-compresses it - and standard compression, done for entirely mundane bandwidth reasons, can wipe out the cryptographic "this was captured by a real camera, here's the chain of custody" metadata that the C2PA standard is built to carry. You don't need a deepfake attack to lose the provenance trail. WhatsApp, iMessage, and Facebook all re-encode images on upload as a matter of routine, silently stripping any embedded credentials in the process. Normal image hosting can do it for free.

This isn't a hypothetical the paper made up to sound rigorous. It's the exact fight the provenance industry is having in public right now. The Coalition for Content Provenance and Authenticity, founded in 2021 by Adobe, Arm, the BBC, Intel, and Microsoft, has since added Google, OpenAI, Meta, Sony, and Truepic to its steering committee, and now counts thousands of member organizations. The push has accelerated further in 2026: in May, OpenAI committed to a "dual-layer" provenance approach, embedding Google DeepMind's SynthID watermark alongside its existing C2PA metadata, and previewed a public tool that lets anyone check an image for both signals at once. The same month, at Google I/O, Google announced that C2PA verification and SynthID detection are being built directly into Search and Chrome. Samsung and Google are shipping phones - the Galaxy S25 and Pixel 10 - that sign photos with C2PA credentials the moment they're captured, and LinkedIn and TikTok both display a Content Credentials icon on supported media.

But the industry's own analysts are blunt about the gap between signing and surviving. As one 2026 industry review of the C2PA rollout put it, hardware adoption is real, but the chain still breaks the moment platforms strip metadata during upload and transcoding - meaning a photo can be cryptographically signed at the camera and still arrive at a viewer's screen with no provenance information attached at all, because the platform in between re-encoded it. The Content Authenticity Initiative's own answer to this is something called Durable Content Credentials - combining metadata with watermarking and fingerprinting so the provenance signal survives even when the metadata doesn't - which is essentially the same "survive re-encoding" answer the paper points to, and the same logic behind pairing C2PA with SynthID in the first place.

There's an even blunter problem sitting underneath the technical one: engagement. Industry analysis of the C2PA rollout has repeatedly noted that even where the little Content Credentials badge is displayed correctly, almost nobody clicks on it. The infrastructure can work exactly as designed and still fail, because public apathy and learned scepticism may be a bigger hurdle to adoption than any remaining technical challenge. Building a verification layer is only half the problem. Getting anyone to actually check it is the other half, and right now that half is losing.

The paper's response to the stripping problem specifically - instead of treating it as fatal - is to point at watermarking techniques embedded directly in the pixels or audio waveform itself, rather than in metadata that compression can strip, which are specifically designed to survive exactly this kind of re-encoding. It's a real limitation honestly reported, with a real (if partial) technical answer, rather than either ignored or treated as a fatal blow to the whole framework.

Cryptographic proof of reality is fragile. Normal social platforms strip away provenance metadata entirely by accident, just to save on mundane bandwidth costs.

The part that's easy to miss: this isn't distributionally neutral

Most of what's discussed above treats "the web" as one undifferentiated blob experiencing one uniform trend. A later section of the paper pushes back on its own earlier framing and makes an argument that's, honestly, sharper and more important than the physics-flavored laws that get top billing.

Verification costs something - money, time, technical literacy, institutional access. Which means the premium described in Law III isn't free money falling from the sky; it's captured disproportionately by people who already have the reputational capital and professional networks to participate in attestation systems. Domain experts and institutionally embedded elites are positioned to benefit. Everyone else - and especially populations without strong institutional access - faces a widening gap between the information available to people who can afford verification and information available to people who can't.

This pattern isn't hypothetical or abstract. The paper points out that synthetic-content detection tooling and provenance infrastructure are being built predominantly for high-resource languages, by institutions concentrated in wealthy economies. It also isn't free in the literal sense: a C2PA-compliant signing certificate from a commercial certificate authority currently runs somewhere around $289 a year, which is a trivial cost for a newsroom and a real barrier for an independent journalist working alone in a low-resource setting. Unlike the web's TLS/HTTPS ecosystem, where Let's Encrypt made secure certificates free for everyone, there's no equivalent free tier for C2PA. A Global South information ecosystem plausibly gets hit by the flood and is under-resourced for the proposed fix, simultaneously - exposed to the problem and under-equipped for the solution at the same time. The paper flags this explicitly as an unsolved gap in its own policy recommendations, not something it claims to have an answer for.

There's a sharper version still: researchers, journalists, and domain experts operating under repressive governments may have exactly the kind of accountable, verifiable expertise that should command the highest premium under this framework - while being precisely the people for whom registering a verifiable real identity with any authority, however well-designed, is personally dangerous. The people with the strongest epistemic claim to being verified aren't uniformly the people who can safely seek verification. That's a genuinely hard problem, and the paper doesn't pretend to fully solve it. It names it and moves on to partial technical mitigations - things like zero-knowledge proofs and pseudonymous attestation - while admitting none of them fully close the gap.

The premium for verified truth will disproportionately reward well-funded institutions, leaving marginalized voices and independent experts lost in the noise.

The most cynical - and most important - idea in the paper

Buried in the same section is an argument worth pulling out on its own, because it reframes something you've probably encountered without a name for it.

The usual "misinformation" framing assumes bad actors want you to believe something specific and false. Szulc's argument is stronger and, frankly, more unsettling: a state or corporate actor doesn't need you to believe anything in particular. Flooding a channel with synthetic content is itself the attack - independent of whether any individual piece of that content is true or false - because the flooding raises the cost of distinguishing any claim from noise. You don't need to convince someone of a specific lie. You just need to make it expensive enough to tell truth from noise that people stop trying, or start trusting nothing, or, worse, start trusting only whatever confirms what they already believed, because that's the cheapest available heuristic once genuine verification becomes too costly to bother with.

This connects to a psychological mechanism the paper spends a section on: the fluency heuristic - a well-documented finding in cognitive psychology that people judge information as more credible simply because it's easier to process, independent of whether it's actually true. Researchers have shown this effect across formats: text set in a cleaner, more legible font gets rated as more truthful than the identical text in a messier one, and easier-to-understand speech gets rated as more truthful than harder-to-parse speech. It shows up even more powerfully as the "illusory truth effect," where simple repetition of a claim makes it feel truer over time, regardless of whether it was ever accurate - an effect so robust that researchers have found it persists even for claims explicitly labeled false, and even when people are told the source is unreliable. People lean on fluency hardest precisely when they don't have the background knowledge to evaluate a claim on its merits, which is exactly the situation most readers are in for most topics, most of the time.

Synthetic content is, almost by definition, optimized for fluency. Careful human expert writing about a genuinely uncertain topic is often more halting, qualified, and hedged - which, perversely, can make it read as less confident and therefore less trustworthy to a reader relying on fluency as a shortcut. The flood doesn't just add noise. It systematically rewards exactly the reading habits most likely to entrench bad beliefs rather than correct them.

That's not a claim about anyone in particular being fooled. It's a claim about what kind of environment we've built, and what kind of thinking that environment quietly makes easier.

Bad actors don't need to convince you of a specific lie. They just need to make the cost of finding the truth so exhausting that you stop trying.

Now, the honest part: where this doesn't fully hold together

A paper this ambitious deserves to be read seriously, which means reading its weak points as carefully as its strong ones - and, credit where it's due, the paper itself does exactly this in a section explicitly devoted to objections. That's rare, and it's worth taking seriously rather than skimming past.

The proofs are looser than they look. The three laws come dressed in formal notation - lemmas, theorems, proof sketches - but at a few load-bearing moments, the "proof" is closer to a well-argued paragraph wearing a math costume. One key step in the central proof, for instance, asserts that a relationship is "superlinear because each marginal verified source becomes not just scarce but increasingly structurally essential" - which is a restatement of the conclusion in more technical language, not a derivation of it from prior steps. The equations aren't decoration, and the underlying logic is genuinely reasoned through, but a reader shouldn't mistake formal notation for formal proof. This is math-flavored argument, not a closed mathematical demonstration.

The model illustrates the theory; it doesn't test it. This is the paper's own most serious self-criticism, raised directly in its objections section, and it deserves to be taken exactly as seriously as the paper takes it. A system of differential equations with several free parameters can be tuned to produce almost any trajectory you want. Showing that the model produces a decline when its parameters are chosen to plausibly produce a decline isn't independent evidence - it's closer to restating the assumption in graph form.

The paper's defense is genuinely good, not just defensive: it points out that the decline is structural across the entire tested parameter grid, not cherry-picked from one favorable run, and that one result - value rising before it falls, rather than declining smoothly from the start - wasn't something the model was built to produce, it just fell out of the math. Both of those are real, meaningful responses to the objection. But they don't fully resolve it, and the paper is honest enough to say so itself: the model has never been checked against real-world data, because the real-world time series it would need to check against doesn't exist yet. Every chart in this paper is a picture of the theory, not a picture of measurement. That's worth holding in your head while looking at them: this is illustration, not confirmation.

The empirical foundation has a crack running right through it, and the paper knows it. Here's the part that matters most for anyone deciding how much to trust this framework: the whole model depends on synthetic content's share of the web continuing to rise. One detector-based study, using a large crawl of newly indexed pages, found nearly three-quarters contained some AI involvement. Another estimate put newly published AI-originated material at roughly two-thirds of everything new online. Those numbers alone would seem to support the paper's trajectory.

But the separate, methodologically transparent Common Crawl analysis of 65,000 URLs found something different: AI-generated content briefly overtook human-written content around November 2024, and then the two have stayed roughly level since, rather than AI's share continuing to climb. That's not a rounding difference from the other studies. It's a genuinely different shape of trend - a plateau instead of a runaway climb - and the paper says outright that this is "the single most important open empirical question" hanging over the entire framework, because a synthetic share that's stabilized, rather than one that's still rising, undercuts a key assumption baked into the model's math.

To be clear about what this does and doesn't mean: a plateau in the raw share of synthetic content doesn't automatically mean the deeper problem - true diversity quietly collapsing underneath a stable-looking surface - has stopped. The paper argues those are two different things that could move independently, and that's a fair distinction to draw. But it does mean the cleanest, most quotable version of the story - "synthetic content keeps rising, forever, until the web drowns in it" - is currently in tension with at least one credible dataset, not confirmed by all of them. This is a paper that names its own biggest weakness clearly and doesn't try to talk around it, which is exactly why it deserves to be read carefully rather than dismissed or taken on faith.

What would actually settle this

If you want to keep score on this theory instead of just having an opinion about it, here's what to watch for, roughly in order of how soon you'd expect to see movement:

  • Whether the search and citation gap widens or closes. If human-authored content keeps getting disproportionately favored in organic search rankings and chatbot citations relative to its raw share of the web, that's the market-for-lemons mechanism showing up in the wild, exactly as Law I predicts.
  • Whether C2PA engagement moves past novelty. Badges being displayed is infrastructure. Badges being clicked is trust. Right now it's almost entirely the former.
  • Whether the Common Crawl plateau holds, rises, or falls. This is the single number the paper itself flags as most likely to break its own model, so it's the one worth tracking most closely.
  • Whether a genuine price premium for verified authorship shows up in ad rates, subscription pricing, or licensing deals, separate from general inflation or platform-specific quirks.

So is the theory right?

That's probably the wrong question for a working paper that explicitly bills itself as a "draft for public comment." The more useful question is: does it name something real, in a way precise enough to be checked?

I think the answer is yes, on both counts, with real caveats attached to each.

The underlying intuition - that a web flooding with cheap, fluent, synthetic content degrades in ways old metrics can't see - matches what a lot of people already sense every time a search result feels hollow in a way they can't quite name. What this paper adds isn't the observation itself; it's a vocabulary and a set of testable claims for something that's mostly been argued about in vibes. "The internet feels worse" becomes a set of falsifiable predictions: does the market premium for verified content actually rise? Does the gap between apparent and true diversity actually widen? Does a specific data source that should show stabilization actually show it? Those are answerable questions, and the paper commits, in writing, in its own falsification section, to what would prove it wrong.

That's the strongest thing you can say about a piece of theory-building this ambitious: not that it's finished, but that it's built to be checked, and checked against evidence it doesn't fully control the outcome of.

The full working paper, including the mathematical proofs, the complete dynamic model, the entropy taxonomy, the proposed provenance stack, and the full set of objections and responses, is available here. Comments and empirical challenges are explicitly welcomed by the author.