Agents failed to dethrone Google Search

Early in the AI supercycle there was a lot of speculation regarding the complete obsoletion of search engines.

You'd hear and see rhetoric like:

"Why would I ever Google something again when I can ask Claude"

"Perplexity is taking over, Google is done"

"Build for agents, they will rule the web"

This hasn't happened, precisely because search is simultaneously a fragmented minefield and a legit monopoly.

Pre-GPT anti-search sentiment was at an all-time high, this article blew up months before GPT was released and every related discussion was saying:

"Search is ripe for disruption"

"SEO, recency bias and ad obsession ruined a perfectly good thing"

"I just append reddit to every query now"

Everyone was calling for an alternative to Google Search, the startup world was pushing for it, Kagi and alternatives started to snowball.

Paul G SEO

Then the knight in shining armour came through: LLMs. Was this the solution to all our problems...

Agent web search falls short

Following suit from that article 4 years ago, I'm still doing the exact same thing, appending site:reddit.com to find anything written by a real human but this time I was forced to by the so called "solution to search".

I'm planning a muay thai training camp in Thailand for the month of May, I really needed good human recommendations for a million different things. But I also remember the classic defacto AI demo: use [insert model name] as a trip planner! So I thought alright Claude lets talk Thailand. A few minutes, dozen web searches and a couple thousand tokens later it gives me the most generic garbage ever.

Obviously, I had another tab open where I was looking at real human recommendations on various subreddits. To bridge this disparity of information I asked C to search Reddit directly and voila:

Error: "The following domains are not accessible to our user agent: ['reddit.com']."

Wait, so one of the most human sources of information on the internet is blocked? So Claude's Web Search tool is rendered useless? Let's see how bad this really is.

I ran Web Search({relevant test query}) 80 times across the most common search sources to see what was blocked and what wasn't. Each domain below returned the explicit "not accessible to our user agent" error.

Confirmed blocked (server-side):

Forums/Q&A: reddit.com, stackoverflow.com, stackexchange.com, serverfault.com, superuser.com, askubuntu.com
News (US): nytimes.com, wsj.com, ft.com, bbc.com, reuters.com, apnews.com, theguardian.com, theatlantic.com, newyorker.com, economist.com, forbes.com, businessinsider.com, politico.com, vox.com
News (UK tabloids, thank GOD tbh): dailymail.co.uk, thesun.co.uk, telegraph.co.uk, independent.co.uk
Tech press: wired.com, arstechnica.com, theverge.com
Finance: barrons.com, marketwatch.com
Sports: theathletic.com
Image hosts: imgur.com

Confirmed allowed (and actually returns real, indexable content):

Social: instagram.com, facebook.com, x.com / twitter.com, threads.net, tumblr.com, youtube.com
Marketplaces/reviews/Q&A: ebay.com, amazon.com, yelp.com, tripadvisor.com, goodreads.com, quora.com, medium.com, substack.com
News: cnn.com, npr.org, bloomberg.com, cnbc.com, washingtonpost.com, techcrunch.com, vice.com, buzzfeed.com, axios.com, thedailybeast.com, huffpost.com, slate.com, engadget.com, gizmodo.com, 9to5mac.com, macrumors.com
Reference/dev: github.com, wikipedia.org, news.ycombinator.com, arxiv.org, nature.com, sciencedirect.com, ieee.org, huggingface.co, kaggle.com
Other: fandom.com, imdb.com, coursera.org, khanacademy.org, spotify.com, espn.com, skysports.com, bleacherreport.com, openai.com, anthropic.com

The blocks correspond to publishers that have explicitly disallowed Anthropic's ClaudeBot/anthropic-ai user agent in robots.txt or via a content licensing decision.

If frontier models can no longer pull in live data from (mostly) real human sources, why the hell would I want to read its responses? Particularly the results of its web search.

A more discussed point: if more content on the internet is AI-generated, AI is being pretrained on garbage, and likely post-trained on garbage too unless they're breaking the rules... more on this later.

Respecting the law of the land

Everyone respects robots.txt... except Perplexity. They have two bots, PerplexityBot is the indexer, the good guy, the robots.txt respected. Exactly like ClaudeBot and the others. PerplexityUser is the bad one, it fires when a user types a question into the search and from their docs:

"Since a user requested the fetch, this fetcher generally ignores robots.txt rules" (source)

So is everyone else doing this? Each frontier lab runs two sets of bots, a training crawler and a live user-fetch agent. The training crawlers (ClaudeBot, GPTBot, Google-Extended, PerplexityBot) all respect robots.txt (hmmm... do they?). The user-fetch agents are where the labs actually differ.

Anthropic's user-fetch bot is Claude-User, and per their own docs it respects robots.txt. All three of Anthropic's bots respect the same directives, no carve-out for any of them. This is why I got an explicit error earlier instead of a silent fetch.

OpenAI runs GPTBot for training and two user-fetch bots, ChatGPT-User and OAI-SearchBot. Both of those carry the same legal carve-out as Perplexity. Direct quote from their docs: "Because these actions are initiated by a user, robots.txt rules may not apply." Same legal posture as Perps but less aggressive in practice. No expose YET.

Google has Google-Extended as a training-only opt-out, separate from Googlebot. Live Gemini fetches inherit Googlebot, which basically nobody blocks because blocking Googlebot means vanishing from Google search. So Gemini reads the entire indexed web by virtue of being Google. What a moat.

However! Cloudflare published research showing that when sites block Perplexity-User outright (at the network level even past robots.txt) Perplexity switches to an undeclared crawler - a STEALTH USER AGENT:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36

A generic Chrome-on-macOS string designed to look like a regular human browser. Cloudflare also observed Perplexity rotating IPs outside their declared range and switching ASNs to keep the requests flowing. They do this at scale too: 20-25 million declared requests per day, plus another 3-6 million stealth ones, across tens of thousands of domains.

Cloudflare de-listed them as a verified bot and shipped a managed rule to block the stealth signature.

Now they're implicated in a bunch of lawsuits:

Oct 2024 — News Corp (WSJ + NY Post) sues for scraping behind blocks.
2024 — Wired accuses Perplexity of plagiarising the very article that called out the scraping. LMAOOOOO.
Aug 2025 — Cloudflare publishes the stealth-crawler evidence.
Nov 2025 — Amazon issues a cease-and-desist, then sues.
Dec 2025 — NYT sues. Their filing is unusually specific: they say Perplexity kept crawling after both a robots.txt block AND a server-level "hard-block."

The well is drying

The blocklist I mapped is for live web search. But obviously these models have the wealth of data they were trained on. Pretraining mostly runs on Common Crawl plus whatever proprietary scrapes the lab can defend in court.

The 2018-2022 Common Crawl snapshots are full of Reddit, Stack Overflow, NYT because the blocks didn't exist yet. Every frontier model trained before mid-2024 has that data baked in.

The blocks started in 2023-2024. Reddit changed its API rules in June 2023. The major news orgs added GPTBot / ClaudeBot to robots.txt through 2023-2024. Cloudflare shipped one-click AI-bot blocking in summer 2024. So fresh crawls from 2024 onward miss what older crawls captured. Lowkey, you can argue that every new pretraining run pulls from a smaller, more synthetic, more SEO-poisoned part of the web. SEO didn't die either, it got a sequel: AEO. People will always tune content if the audience is an algorithm.

Labs with content licensing deals get to top this up with clean licensed data. The ones without either retrain on shrinking public data or pay settlements after the fact or the third option which only they know about internally.

The same dynamic that locks Claude out of Reddit live also narrows Anthropic's next pretraining run. You can argue that a lab's licensing portfolio in 2025 is a lead indicator of model capability in 2026.

Do they care? At this point we, by using the model, are training it, therefore model usage will directly correlate to model capability. When Claude asks me every single hour "hey how is Claude doing? 0, 1, 2, 3" I know I'm RLHFing the guy and that's fine because I want Opus 4.8 to be cracked.

What can the market do?

There are three options: you play by the rules and accept the blind spots, you pay your way in, or you ignore the rules entirely.

The play-by-the-rules camp is where Anthropic's WebSearch sits. So do Tavily, Exa, Firecrawl and Linkup. They all advertise robots.txt compliance in their docs and none can read Reddit.

The pay camp signs content-licensing deals with publishers and gets the data through the front door. The deals that have come out:

Reddit + Google: ~$60M/year, Feb 2024. This is why Reddit blocks every lab that hasn't paid.
Reddit + OpenAI: separate deal, May 2024.
Stack Exchange + Google for Gemini in Feb 2024.
Stack Overflow + OpenAI: May 2024.
News Corp + OpenAI: ~$250M / 5yr, May 2024 (WSJ, NY Post, Times of London).
OpenAI's wider portfolio: AP, Axel Springer, FT, The Atlantic, Vox Media, Condé Nast (Wired, New Yorker, Vogue), Hearst, Time, Le Monde.

The Atlantic, Condé Nast, News Corp, AP all blocked from Claude, all signed with OpenAI. Reddit and Stack Overflow same story. The blocks map almost 1:1 to publishers who got paid by someone other than Anthropic.

Anthropic meanwhile has no comparable announced content-licensing deal. They do have the biggest lawsuits though — settled the Bartz authors case for reportedly ~$1.5B in late 2025, sued by Reddit in June 2025. They're playing the cleanest game and getting the worst search results because of it.

The wrap version is Kagi, which metasearches Google + Bing + Brave plus their own crawlers. Reddit shows up in Kagi because Google has the deal that lets it index Reddit.

The third camp is the rule-breakers like Perplexity (covered above), where not giving a toss becomes the product differentiator.

So picking an agent-search API is really picking a slice of the internet to operate in. They each get to be the best at a different slice of the internet, depending on who they paid (or didn't).

Google wins again

Google owns search, it brings in > $200bn revenue yearly. They were not going to give up that easily. When Gemini 3.0 dropped and Google managed to recapture some AI marketshare, it was a great move for obvious reasons but also a hedge against their search product.

Since then they've been pushing on all generative frontiers and also making the most of their existing search infrastructure to inject AI and agents: AI overview, search Gmail, Photos, Maps, A2A protocol, all of their vibecode products. Gemini in Workspace, Pixel, Chrome, Android, your inbox. Whatever agent-search ends up looking like, the channel ends in Google's hands.

You could argue that Gemini is 4th (or dead last among frontier models) on the agent search benchmarks BrowseComp however who actually looks to these gamed benchmarks anyway? What I care about is practicality and performance in REAL LIFE. If Gemini is the only agent capable to accessing the entire web, unrestricted, in fact red carpeted, then we can definitively say it is the best web search agent? The specific search algorithm will catch up to the moat in due time.

I don't have a horse in this race although I am a bit sad Claude is losing. Ultimately, I only care about getting the best results for myself. What it means is that different agents are good at different things, so further fragmentation across models and tooling and the best way to consolidate this is likely open source tools and agent routing.

Most of this is anecdotal but I have tried to pull in serious sources. Let me know what you think and what you've determined for yourself in production.

And the agent UX is worse!

If you have any questions or comments, shout me -> [email protected].