Google’s AI advantage: why crawler separation is the only path to a fair Internet

2026-01-30

9 min read

Earlier this week, the UK’s Competition and Markets Authority (CMA) opened its consultation on a package of proposed conduct requirements for Google. The consultation invites comments on the proposed requirements before the CMA imposes any final measures. These new rules aim to address the lack of choice and transparency that publishers (broadly defined as “any party that makes content available on the web”) face over how Google uses search to fuel its generative AI services and features. These are the first consultations on conduct requirements launched under the digital markets competition regime in the UK.

We welcome the CMA’s recognition that publishers need a fairer deal and believe the proposed rules are a step into the right direction. Publishers should be entitled to have access to tools that enable them to control the inclusion of their content in generative AI services, and AI companies should have a level playing field on which to compete.

But we believe the CMA has not gone far enough and should do more to safeguard the UK’s creative sector and foster healthy competition in the market for generative and agentic AI.

CMA designation of Google as having Strategic Market Status

In January 2025, the UK’s regulatory landscape underwent a significant legal shift with the implementation of the Digital Markets, Competition and Consumers Act 2024 (DMCC). Rather than relying on antitrust investigations to address risks to competition, the CMA can now designate firms as having Strategic Market Status (SMS) when they hold substantial, entrenched market power. This designation allows for targeted CMA interventions in digital markets, such as imposing detailed conduct requirements, to improve competition.

In October 2025, the CMA designated Google as having SMS in general search and search advertising, given its 90 percent share of the search market in the UK. Crucially, this designation encompasses AI Overviews and AI Mode, with the CMA now having the authority to impose conduct requirements on Google’s search ecosystem. Final requirements imposed by the CMA are not merely suggestions but legally enforceable rules that can relate specifically to AI crawling with significant sanctions to ensure Google operates fairly.

Publishers need a meaningful way to opt out of Google’s use of their content for generative AI

The CMA’s designation could not be more timely. As we’ve said before, we are indisputably in a time when the Internet needs clear “rules of the road” for AI crawling behavior.

As the CMA rightly states, “publishers have no realistic option but to allow their content to be crawled for Google’s general search because of the market power Google holds in general search. However, Google currently uses that content in both its search generative AI features and in its broader generative AI services.”

In other words: the same content that Google scrapes for search indexing is also used for inference/grounding purposes, like AI Overviews and AI Mode, which rely on fetching live information from the Internet in response to real-time user queries. And that creates a big problem for publishers—and for competition.

Because publishers cannot afford to disallow or block Googlebot, Google’s search crawler, on their website, they have to accept that their content will be used in generative AI applications such as AI Overviews and AI Mode within Google Search that return very little, if any, traffic to their websites. This undermines the ad-supported business models that have sustained digital publishing for decades, given the critical role of Google Search in driving human traffic to online advertising. It also means that Google’s generative AI applications enter into direct competition with publishers by reproducing their content, most often without attribution or compensation.

Publishers’ reluctance to block Google because of its dominance in search gives Google an unfair competitive advantage in the market for generative and agentic AI. Unlike other AI bot operators, Google can use its search crawler to gather data for a variety of AI functions with little fear that its access will be restricted. It has minimal incentive to pay publishers for that data, which it is already getting for free.

This prevents the emergence of a well-functioning marketplace where AI developers negotiate fair value for content. Instead, other AI companies are disincentivized from coming to the table, as they are structurally disadvantaged by a system that allows one dominant player to bypass compensation entirely. As the CMA itself recognizes, "[b]y not providing sufficient control over how this content is used, Google can limit the ability of publishers to monetise their content, while accessing content for AI-generated results in a way that its competitors cannot match”.

Google’s advantage

Cloudflare data validates the concern about Google’s competitive advantage. Based on our data, Googlebot sees significantly more Internet content than its closest peers.

Over an observed period of two months, Googlebot successfully accessed individual pages almost two times more than ClaudeBot and GPTBot, three times more than Meta-ExternalAgent, and more than three times more than Bingbot. The difference was even more extreme for other popular AI crawlers: for instance, Googlebot saw 167 times more unique pages than PerplexityBot. Out of the sampled unique URLs using our network that we observed over the last two months, Googlebot crawled roughly 8%.

In rounded multiple terms, Googlebot sees:

vs. ~1.70x the amount of unique URLs seen by ClaudeBot;
vs. ~1.76x the amount of unique URLs seen by GPTBot;
vs. ~2.99x the amount of unique URLs by Meta-ExternalAgent;
vs. ~3.26x the amount of unique URLs seen by Bingbot;
vs. ~5.09x the amount of unique URLs seen by Amazonbot;
vs. ~14.87x the amount of unique URLs seen by Applebot;
vs. ~23.73x the amount of unique URLs seen by Bytespider;
vs. ~166.98x the amount of unique URLs seen by PerplexityBot;
vs. ~714.48x the amount of unique URLs seen by CCBot; and
vs: ~1801.97x the amount of unique URLs seen by archive.org_bot.

Googlebot also stands out in other Cloudflare datasets.

Even though it ranks as the most active bot by overall traffic, publishers are far less likely to disallow or block Googlebot in their robots.txt file compared to other crawlers. This is likely due to its importance in driving human traffic to their content—and, as a result, ad revenue—through search.

As shown below, almost no website explicitly disallows the dual-purpose Googlebot in full, reflecting how important this bot is to driving traffic via search referrals. (Note that partial disallows often impact certain parts of a website that are irrelevant for search engine optimization, or SEO, such as login endpoints.)

Robots.txt merely allows the expression of crawling preferences; it is not an enforcement mechanism. Publishers rely on “good bots” to comply. To manage crawler access to their sites more effectively—and independently of a given bot’s compliance—publishers can set up a Web Application Firewall (WAF) with specific rules, technically preventing undesired crawlers from accessing their sites. Following the same logic as with robots.txt above, we would expect websites to block mostly other AI crawlers but not Googlebot.

Indeed, when comparing the numbers for customers using AI Crawl Control, Cloudflare’s own AI crawler blocking tool that is integrated in our Application Security suite, between July 2025 and January 2026, one can see that the number of websites actively blocking other popular AI crawlers (e.g., GPTBot, Claudebot), was nearly seven times as high as the number of websites that blocked Googlebot and Bingbot. (Like Googlebot, Bingbot combines search and AI crawling and drives traffic to these sites, but given its small market share in search, its impact is less significant.)

So we agree with the CMA on the problem statement. But how can publishers be enabled to effectively opt out of Google using their content for its generative AI applications? We share the CMA’s conclusion that “in order to be able to make meaningful decisions about how Google uses their Search Content, (...) publishers need the ability effectively to opt their Search Content out of both Google’s search generative AI features and Google’s broader generative AI services.”

But we’re concerned that the CMA’s proposal is insufficient.

CMA’s proposed publisher conduct requirements

On January 28, 2026, the CMA published four sets of proposed conduct requirements for Google, including conduct requirements related to publishers. According to the CMA, the proposed publisher rules are designed to address concerns that publishers (1) lack sufficient choice over how Google uses their content in its AI-generated responses, (2) have limited transparency into Google’s use of that content, and (3) do not get effective attribution for Google’s use of their content. The CMA recognized the importance of these concerns because of the role that Google search plays in finding content online.

The conduct requirements would mandate Google grant publishers "meaningful and effective" control over whether their content is used for AI features, like AI Overviews. Google would be prohibited from taking any action that negatively impacts the effectiveness of those control options, such as intentionally downranking the content in search.

To support informed decisionmaking, the CMA proposal also requires Google to increase transparency, by publishing clear documentation on how it uses crawled content for generative AI and on exactly what its various publisher controls cover in practice. Finally, the proposal would require Google to ensure effective attribution of publisher content and to provide publishers with detailed, disaggregated engagement data—including specific metrics for impressions, clicks, and "click quality"—to help them evaluate the commercial value of allowing their content to be used in AI-generated search summaries.

The CMA’s proposed remedies are insufficient

Although we support the CMA’s efforts to improve options for publishers, we are concerned that the proposed requirements do not solve the underlying issue of promoting fair, transparent choice over how their content is used by Google. Publishers are effectively forced to use Google’s proprietary opt-out mechanisms, tied specifically to the Google platform and under the conditions set by Google, rather than granting them direct, autonomous control. A framework where the platform dictates the rules, manages the technical controls, and defines the scope of application does not offer “effective control” to content creators or encourage competitive innovation in the market. Instead, it reinforces a state of permanent dependency.

Such a framework also reduces choice for publishers. Creating new opt-out controls makes it impossible for publishers to choose to use external tools to block Googlebot from accessing their content without jeopardizing their appearance in Search results. Instead, under the current proposal, content creators will still have to allow Googlebot to scrape their websites, with no enforcement mechanisms to deploy and limited visibility available if Google does not respect their signalled preferences. Enforcement of these requirements by the CMA, if done properly, will be very onerous, without guarantee that publishers will trust the solution.

In fact, Cloudflare has received feedback from its customers that Google’s current proprietary opt-out mechanisms, including Google-Extended and ‘nosnippet’, have failed to prevent content from being utilized in ways that publishers cannot control. These opt-out tools also do not enable mechanisms for fair compensation for publishers.

More broadly, as reflected in our proposed responsible AI bot principles, we believe that all AI bots should have one distinct purpose and declare it, so that website owners can make clear decisions over who can access their content and why. Unlike its leading competitors, such as OpenAI and Anthropic, Google does not comply with this principle for Googlebot, which is used for multiple purposes (search indexing, AI training, and inference/grounding). Simply requiring Google to develop a new opt-out mechanism would not allow publishers to achieve meaningful control over the use of their content.

The most effective way to give publishers that necessary control is to require Googlebot to be split up into separate crawlers. That way, publishers could allow crawling for traditional search indexing, which they need to attract traffic to their sites, but block access for unwanted use of their content in generative AI services and features.

Requiring crawler separation is the only effective solution

To ensure a fair digital ecosystem, the CMA must instead empower content owners to prevent Google from accessing their data for particular purposes in the first place, rather than relying on Google-managed workarounds after the crawler has already accessed the content for other purposes. That approach also enables creators to set conditions for access to their content.

Although the CMA described crawler separation as an “equally effective intervention”, it ultimately rejected mandating separation based on Google’s input that it would be too onerous. We disagree.

Requiring Google to split up Googlebot by purpose — just like Google already does for its nearly 20 other crawlers — is not only technically feasible, but also a necessary and proportionate remedy that empowers website operators to have the granular control they currently lack, without increasing traffic load from crawlers to their websites (and in fact, perhaps even decreasing it, should they choose to block AI crawling).

To be clear, a crawler separation remedy benefits AI companies, by leveling the playing field between them and Google, in addition to giving UK-based publishers more control over their content. (There has been widespread public support for a crawler separation remedy by Daily Mail Group, the Guardian and the News Media Association.) Mandatory crawler separation is not a disadvantage to Google, nor does it undermine investment in AI. On the contrary, it is a pro-competitive safeguard that prevents Google from leveraging its search monopoly to gain an unfair advantage in the AI market. By decoupling these functions, we ensure that AI development is driven by fair-market competition rather than the exploitation of a single hyperscaler’s dominance.

******

The UK has a unique chance to lead the world in protecting the value of original and high-quality content on the Internet. However, we worry that the current proposals fall short. We would encourage rules that ensure that Google operates under the same conditions for content access as other AI developers, meaningfully restoring agency to publishers and paving the way for new business models promoting content monetization.

Cloudflare remains committed to engaging with the CMA and other partners during upcoming consultations to provide evidence-based data to help shape a final decision on conduct requirements that are targeted, proportional, and effective. The CMA still has an opportunity to ensure that the Internet becomes a fair marketplace for content creators and smaller AI players—not just a select few tech giants.

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

AI AI Bots Google Legal Policy & Legal