I Scanned 1M Domains for llms.txt | DialtoneApp

I scanned 1M domains and found the web's AI instruction layer

I started this crawl looking for a narrow thing: where can an AI agent read a site, understand what is for sale, and maybe buy something? The better story was not checkout. It was the quiet appearance of a second public web: thousands of domains now publish files written directly for language models, crawlers, search assistants, shopping agents, and tool-using systems.

The scan surfaced 28,918 domains with llms.txt or llms-full.txt. Some are careful hand-written policies. Many are plugin-generated. Some are malformed. Together they show that websites are already trying to brief models before the models answer for them.

Findings

The numbers changed the article

The first surprise was not that a few AI-native startups publish model instructions. The surprise was how normal this has already become.

Domains found

28,918

had llms.txt, llms-full.txt, or both

llms.txt files

28,735

the short public briefing layer

llms-full.txt

2,538

full corpus files are still rare

Commerce feeds

7,520

shopping/catalog signals for AI systems

MCP mentions

1,797

sites speaking in agent/tool vocabulary

Across the crawl I found 28,735 llms.txt files and 2,538 llms-full.txt files. Only 2,355 domains had both, which is 8.1% of the LLM-file corpus. That tells me the format is still early. Most sites are not yet maintaining a compact index and a full model-readable corpus. They are experimenting with one or the other, often because a CMS plugin or SEO workflow made it easy.

The total captured llms.txt payload was about 714 MB. The full-corpus files added another 178 MB. At the small end, 190 domains were tiny or effectively empty. At the large end, individual files were close to a megabyte. The result feels less like a clean new standard and more like the first year of XML sitemaps: useful, inconsistent, partly automated, and already too important to ignore.

My original "can bots buy things?" question is still in here. Commerce showed up everywhere: 7,520 domains had commerce-feed signals, 5,567 looked like commerce catalog indexes, and ecommerce was the second largest primary category. But buying is only one branch of the story. The broader shift is that sites now want to tell AI systems how to cite, crawl, shop, search, route users, avoid hallucinations, and call tools.

The important thing is that these files are not aimed at human navigation. A normal visitor will never open most of them. They exist for a machine that has already decided to ask, "What is this site, what should I trust, what should I cite, and what can I do here?" That makes llms.txt feel closer to a briefing memo than a web page. It is part sitemap, part source list, part brand positioning, part robot policy, and sometimes part tool manifest.

That also changes how I think about search. SEO tried to make pages rank. AEO and GEO try to make answers quote the right source. This layer is more explicit: it asks the model to read the site in a preferred order. Some files say which pages are canonical. Some say which pages are stale. Some tell models not to make up pricing. Some say how to attribute the work. Some tell an agent where the MCP endpoint lives. The page is no longer the only public artifact that matters.

Landscape

This is not one industry

The LLM-readable web is already spread across publishers, stores, SaaS companies, developer docs, finance, education, security, travel, healthcare, and local businesses.

Common file shapes

Curated link indexes: 8,727
Short profiles or policies: 6,596
Commerce catalog indexes: 5,567
Developer docs indexes: 3,112
Robots-style policies: 2,223
Curated index plus full corpus: 1,230
Large curated link indexes: 1,073
Attribution and AI instructions: 186

Policy and agent signals

Commerce feed: 7,520
Crawler policy: 2,615
Model Context Protocol: 1,797
Attribution or citation rules: 1,507
Full corpus linked: 1,328
AI use policy: 573
Agent instructions: 183
Anti-hallucination language: 136

News and publishing led the categories with 5,054 domains. Ecommerce followed with 3,674. Business SaaS and professional services had 2,631. Developer docs, APIs, and open source had 2,072. That mix matters because it means llms.txt is not just an "AI startup" convention. Publishers want answer engines to summarize them correctly. Stores want product discovery. SaaS companies want buyers routed to the right docs and pricing. Developers want coding agents to stop guessing from stale Stack Overflow answers.

The international spread was wider than I expected. The top TLDs included .com, .org, .net, .io, .de, .br, .ai, .uk, .pl, .ru, .in, .nl, and .jp. Language guesses included German, Russian, Chinese, Spanish, French, Portuguese, Arabic, Korean, Thai, and Japanese. A lot of the corpus is still English or unknown, but this is not a US-only phenomenon and it is not limited to Silicon Valley companies trying to rank in ChatGPT.

The category split also suggests different motivations. Publishers are trying to keep summaries accurate and credited. Retailers are trying to expose catalogs, sale surfaces, and product pages. Developer-docs sites are trying to keep code agents on current APIs. SaaS companies are trying to route buyers to pricing, integrations, docs, and support. Regulated businesses are trying to set boundaries around legal, medical, financial, or jurisdiction-specific answers.

The reason this matters is that a model does not experience a website like a human does. It does not patiently browse the homepage, infer the information architecture, and remember the conversion funnel. It retrieves fragments. The LLM file is a way for the site owner to say, "Start here. These are the pages that matter. These are the facts that should survive compression." That is a different design problem from ranking a landing page.

Automation

The biggest source is boring, and that is why it matters

A huge part of adoption appears to be CMS and SEO automation rather than deliberate AI strategy.

I found 9,477 domains with generated-file signatures. Of those, 5,192 looked like Yoast SEO output and 2,189 looked like All in One SEO output. That is the most important adoption clue in the whole scan. Standards spread when they become a checkbox in tools people already use. Most website owners will not sit down and write an AI retrieval policy. They will update a plugin and suddenly have one.

The generated files are uneven. Some are just page lists. Some include generic text explaining that the file is used by LLMs to index the site. Some accidentally expose poor titles, boilerplate, or malformed content. I saw gzip-compressed responses saved with .txt names, UTF-16-like text, image bytes, and HTML fallbacks. The exporter had to decompress 711 gzip files before classification. The upstream crawler should preserve content type and content encoding more explicitly because otherwise the corpus lies to you.

But "messy" does not mean "fake." It means the layer is becoming mundane. That is how robots.txt, sitemaps, schema markup, canonical URLs, Open Graph tags, and product feeds all arrived. First the hand-authored examples get attention. Then the tooling normalizes the pattern. Then everyone has a version, including sites that do not know what it does.

That automation explains why the corpus has two personalities. One personality is intentional: a company writes a compact policy, names its canonical docs, and gives agents instructions. The other is accidental: a plugin dumps titles, descriptions, and links into a file because "AI visibility" became another checkbox beside XML sitemaps and Open Graph tags. The accidental version is often less beautiful, but it may be more important because it can scale through the long tail of the web.

A generated file also creates a new maintenance problem. Search pages can be wrong, but people still see them. A bad llms.txt can be wrong invisibly. It can point models at old docs, expose placeholder titles, classify a site poorly, or make every page look equally important. For a site owner, the file should probably become part of release hygiene: when pricing changes, docs move, product pages disappear, or a policy changes, the machine-readable briefing needs to change too.

The generated signatures were also useful for separating strategy from adoption. If a file says "Generated by Yoast SEO" or "Generated by All in One SEO," I treat it as evidence that the ecosystem is adopting the format, not evidence that the business has a mature AI answer strategy. If a file has custom citation rules, regional caveats, tool guidance, or anti-hallucination instructions, I treat it as an intentional artifact. The next crawler pass should split those classes more aggressively.

Control Plane

llms.txt is becoming more than an index

The strongest files do not just list URLs. They tell AI systems how to behave.

The crawler found 2,615 domains with crawler-policy signals and 382 explicit mentions of AI crawlers such as ChatGPT-User, ClaudeBot, PerplexityBot, GPTBot, or Google-Extended. In other words, robots.txt language is being copied into a model-facing document. Some sites are trying to allow answer assistants while limiting training. Some are trying to name which bots can crawl. Some are mostly cargo-culting syntax from other examples.

Citation control was another major signal. The strict classifier found 337 domains with strong attribution language, while the broader classifier flagged 1,507. That group included domains like 10xtravel.com, 11sight.com, acilearning.com, lawpreptutorial.com, retailbrew.com, stereonet.com, and blossomflowerdelivery.com. The pattern is easy to understand: if a model is going to answer using your work, you want the answer to cite you, quote you correctly, and send the user to the right page.

The most interesting control-plane signal was MCP. I found 1,797 domains mentioning Model Context Protocol. A large subset looked automated: 933 Wix-generated MCP endpoint files exposed tools like business lookup, site search, visitor-token generation, and site API calls. That changes the mental model. A website is no longer just publishing pages for models to read. It can publish a tool boundary and tell an agent what it is allowed to call.

That tool boundary is the sharpest difference between this layer and older discovery files. A sitemap says "these URLs exist." Schema says "this page contains a product or article." A robots file says "crawl this, avoid that." An MCP reference says "there may be a callable interface here." Even if many current references are thin or generated, the direction is clear: the model-readable web is moving from content discovery toward action discovery.

The same pattern appeared in anti-hallucination language. Only 136 domains matched that flag, but the examples were revealing. Some tell assistants not to invent pricing or features. Some route users to local distributors. Some warn that facts are dynamic. This is the beginning of a more honest answer-engine contract: models will summarize, but sites are trying to tell them where not to improvise.

Surprises

The weird parts were more useful than the clean parts

The clean examples show what the format can become. The messy examples show how the web will actually adopt it.

The largest files were not all from the companies I would have guessed. Yes, PixiJS and E2B made sense: developer documentation is an obvious fit for a model-readable full corpus. But the big-file list also included storage racks, nail supply, Pokemon cards, hotel villas, cookware, dealer sites, and small retailers. That is interesting because retail catalogs are exactly where models need structured context but often lack authority. A product page is designed to persuade a person. A model-readable catalog is designed to preserve the facts after summarization.

I also did not expect so many files to look like partial migrations from older web conventions. Some were basically sitemaps. Some were robots files with newer bot names. Some were product feeds wearing a new filename. Some were tiny policies that read like legal notices. That makes the format hard to evaluate if you expect one canonical shape, but it also explains why adoption is happening: people are mapping an unfamiliar AI problem onto tools they already understand.

The files with the most future value were often the ones that compressed judgment, not content. A link dump says, "Here is everything." A good LLM file says, "Here is what matters, here is what changes, here is what you should cite, here is what you should not guess, and here is where the tool boundary begins." That distinction is going to matter more as agent browsers and coding assistants make real decisions from retrieved context.

This is why I want the next pass to score files by intent. Generated adoption, citation policy, full-corpus docs, commerce feeds, MCP/tool links, and explicit safety boundaries are different phenomena. They all happen to live in the same filename today. A serious index should separate them, because a site that says "cite this page and never invent prices" is doing something very different from a site that accidentally exported 400 stale URLs.

Examples

The domains that made the scan feel real

Averages are useful, but the story clicks when you look at specific sites.

Specific domains made the scan feel less abstract. The best examples were not always the biggest brands. Sometimes a small developer tool had the cleanest model-facing docs. Sometimes a retailer had a surprisingly rich catalog dump. Sometimes a regulated business wrote the most careful policy. The mix is the point: the AI instruction layer is forming around real operational needs, not around one clean standards committee.

pixijs.com

Nearly a megabyte of renderer docs

PixiJS had the largest llms-full.txt file I found, about 982 KB. That file is not a marketing blurb. It is documentation packed for model retrieval.

e2b.dev

Agent sandboxes with full docs

E2B paired a developer-docs profile with a large llms-full.txt corpus and MCP/agent signals. It is the natural habitat for this format.

1inch.com

Routing rules for a crypto brand

1inch tells systems where to send consumer questions, API integration questions, governance questions, and developer workflows.

21st.dev

MCP as product surface

21st.dev is one of the clearer examples where the LLM file mentions an MCP product directly rather than treating llms.txt as a sitemap.

10xtravel.com

Citation rules for travel rewards

Travel rewards content is full of dates, points values, and caveats. 10xTravel uses preferred citation language to steer how models reuse it.

1001terapist.com

Robots.txt semantics crossed over

This was one of many files using explicit crawler policy language, including named AI user agents. The old robots layer is bleeding into the new model layer.

The largest llms-full.txt files were especially revealing. pixijs.com was almost a megabyte of developer documentation. e2b.dev was another huge developer-docs corpus. But the rest of the top list was not only developer tools. It included storeyourboard.com, lcsupply.com, dtknailsupply.com, pokeninjapan.store, nanamall.com, and AutoNation dealer sites. Full-corpus files are already being used for catalogs.

Regulated and high-stakes categories deserve extra attention. The heuristic risk flags found 8,939 legal signals, 6,522 financial signals, 5,953 medical signals, 1,096 gambling signals, 889 adult-topic signals, 407 weapons-related signals, and 379 substance-related signals. Those counts are not the same thing as policy violations. They are a warning that answer-engine optimization is happening in places where accuracy, eligibility, age gates, jurisdiction, and professional advice matter.

That is why the Bitget file was so interesting. It did not pretend that a model can safely answer every trading question from a static text file. It described canonical domains, regional fallbacks, live-price caveats, legal and compliance handling, and execution risk around API/MCP trading. That is the right shape for this layer: useful guidance, explicit boundaries, and a reminder that retrieval is not consent to act.

There were also ordinary-looking domains that hinted at how broad the next wave could be. 1hotels.com looked like travel and hospitality with a commerce-feed profile. ssmhealth.com exposed structured context for services and key resources in a medical setting. 10times.com framed events, conferences, exhibitors, ratings, and registration as a machine-readable discovery problem. These are not "AI companies." They are normal businesses with content that agents may soon retrieve before users ever land on the site.

The most useful mental split is not "has llms.txt" versus "does not have llms.txt." It is "what job is the file doing?" For PixiJS and E2B, the job is documentation compression. For Dell, it is product discovery. For Bitget and 1inch, it is routing and risk context. For Aikido and fal.ai, it is developer trust. For Artsy, it is marketplace comprehension. For 10xTravel and ACI Learning, it is citation discipline. Those are very different products wearing the same filename.

Evidence

What this can and cannot prove about OpenAI

The tempting claim is that overlap proves a model scraped and used these files. The honest version is narrower and more useful.

I would not publish a claim that OpenAI scraped a specific llms.txt file and used it in an answer unless I had one of three things: server logs showing an OpenAI crawler fetched the file, a retrieval citation from a ChatGPT browsing/search answer pointing to that exact URL, or a controlled before-and-after test where a unique fact appears only in the LLM file and later shows up through a model-backed retrieval path. This crawl does not have those logs.

What it does have is strong evidence that sites are writing directly toward OpenAI and other answer engines. The export found explicit mentions of ChatGPT-User, GPTBot, Google-Extended, ClaudeBot, and PerplexityBot. Dell is the most concrete example because the file points at a ChatGPT-oriented product feed. That proves intent by the publisher. It does not prove ingestion by OpenAI.

The overlap problem is subtle. If a model can answer that PixiJS is a fast 2D web renderer, that E2B provides sandboxes for AI agents, that fal.ai provides generative media APIs, or that Artsy is an online art marketplace, that does not tell us whether the answer came from llms.txt, a docs page, a homepage, a GitHub repo, a news article, a search index, or older training data. The same fact can appear in many public places.

So I would frame this as an adoption signal, not a smoking gun. These domains are putting machine-readable claims in public. Some of those claims are also present in normal web pages and in the crawler warehouse. That proves the information is exposed through multiple public surfaces. To prove OpenAI use, the next experiment needs instrumentation: publish a unique, harmless nonce in llms.txt, watch server logs for named crawlers, and test whether a retrieval-enabled answer cites or repeats that nonce from the file.

dell.com

ChatGPT-oriented feed language

The export found Dell product-feed language, including a ChatGPT-oriented CSV. That proves Dell is publishing for that audience; it does not prove OpenAI used the file.

pixijs.com

Public docs repeated in model-ready form

PixiJS has public renderer documentation and a huge llms-full corpus. Overlap with common answers about PixiJS would not isolate which source was used.

bitget.com

Policy details are testable but not proof

Bitget includes regional, legal, price, and MCP-trading caveats. A future retrieval citation to that file would be evidence; model familiarity alone is not.

aikido.dev

Security positioning is publicly duplicated

Aikido describes its application-security surface in pages and llms.txt. Matching facts can prove public availability, not internal OpenAI ingestion.

fal.ai

API docs are especially hard to attribute

Developer tools often publish the same facts across docs, marketing pages, SDKs, GitHub, and llms.txt. Source attribution requires logs or retrieval citations.

helicone.ai

LLM operations are already indexed elsewhere

Helicone is known through docs, GitHub, and product pages. The llms.txt file is useful, but overlap with known facts is not proof of use by any one model.

This is still worth calling out because it is one of the most interesting questions readers will ask. The practical answer is: "I found lots of sites trying to influence ChatGPT and other assistants, including files that name OpenAI user agents or ChatGPT-specific feeds. I did not prove OpenAI used any one of them." That sentence is less sensational, but it is defensible.

Method

What I would change before the next 1M run

The crawl was useful enough to show the pattern, but the next pass should treat llms.txt as first-class infrastructure.

Classification here was heuristic. It used the domain name, decompressed llms.txt, and the first chunk of llms-full.txt. That was enough to build category counts, profile counts, language guesses, TLD counts, file sizes, and flags for commerce, MCP, citation, crawler policy, AI use policy, agent instructions, full-corpus links, and anti-hallucination language. It is not a legal or safety classifier, and it is not claiming every risk flag is a risky business.

The next version should capture content type, content encoding, final URL, and redirect behavior beside the LLM files. It should keep the homepage favicon and Open Graph image near each row, because those small visual cues make manual review much faster. It should also separate plugin-generated output from hand-authored files, because the incentives are different. A generated sitemap tells you the ecosystem is adopting the convention. A hand-written policy tells you what a company actually wants models to do.

I would also rank files by "instruction density." A 400 KB link dump can be less useful than a 2 KB policy that clearly names canonical pages, dynamic facts, citation requirements, tool endpoints, and things the model should not infer. Byte size is good for finding full corpuses. It is not enough for finding the files that will actually change an answer.

The shopping angle needs a separate pass too. A commerce feed is not the same thing as a purchasable endpoint. Some sites expose product URLs and checkout pages. Some expose feeds but still require a human browser, account state, regional shipping, age checks, or payment confirmation. The next crawler should score discovery, pricing, cartability, payment handoff, and post-payment control separately. "A bot can understand this product" and "a bot can buy this product" are very different claims.

Finally, the safety and policy side deserves more than keyword flags. The current classifier can tell me that legal, medical, financial, gambling, adult, weapons, or substance-related words appear. It cannot tell me whether a site is licensed, whether an answer would be compliant in a user's jurisdiction, or whether a shopping agent should be allowed to act. That is exactly why the model-facing file matters: the site can publish boundaries, but downstream systems still need governance.

The big takeaway is simple: the web is starting to publish instructions for the machines that summarize it. Some of those instructions are SEO. Some are compliance. Some are product feeds. Some are developer docs. Some are actual tool boundaries. That makes llms.txt easy to dismiss if you only look for perfect, hand-authored examples. At 1M-domain scale, the imperfect version is the important version. It shows where the web is already moving.