When someone asks an AI assistant about a site I run, does the assistant actually fetch the page, or does it answer from an index it built earlier? I wanted a straight answer, so I set up an nginx probe and prompted the major chatbots with queries that should force a live fetch. This post is what the server recorded, and what you can safely measure from it.
Two different signals
“AI traffic” usually means one of two things, and nginx logs make the difference obvious.
- Provider-side fetch. The assistant hits the origin itself, usually with a dedicated user-agent and no referrer.
- Real clickthrough visit. A human reads the AI answer, clicks a citation, and arrives as a normal browser with the assistant as the referrer.
Folding both into a single AI-traffic number hides the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed.
The probe
Custom nginx log format capturing the headers the default combined log compresses out:
log_format ai_probe escape=json
'{'
'"time":"$time_iso8601",'
'"ip":"$remote_addr",'
'"uri":"$request_uri",'
'"status":$status,'
'"ua":"$http_user_agent",'
'"referer":"$http_referer",'
'"accept":"$http_accept"'
'}';
Each assistant got a prompt pointing at a unique query string (/?ai=chatgpt, /?ai=claude, and so on), so I could tell from a single grep which hit came from which assistant.I reran prompts across sessions so a transient cache hit would not hide the retrieval path.
Who announced themselves, and how
Five assistants arrived with a retrieval-specific signal in the user-agent.
| Assistant | User-agent sent | Accept | robots.txt first? |
|---|---|---|---|
| ChatGPT | ChatGPT-User/1.0 | Chrome-style | no |
| Claude | Claude-User/1.0 | */* | yes |
| Perplexity | Perplexity-User/1.0 | (empty) | via PerplexityBot |
| Meta AI | meta-webindexer/1.1 | */* | no |
| Manus | Manus-User/1.0 suffix on a Chrome UA | Chrome-style | no |
All five fetched the page.
Who did not announce themselves
Three assistants had no distinct retrieval user-agent to capture.
| Assistant | What the log captured | Result |
|---|---|---|
| Gemini | zero requests from any Google UA during the prompt window | no live fetch; answered from index |
| Copilot | plain Chrome 135 on Linux x86_64, full browser-style Accept | fetched, but indistinguishable from a human visitor |
| Grok | plain Mac Safari 26 and plain Mac Chrome 143 | fetched, but indistinguishable from a human visitor |
Detail on each assistant follows.
ChatGPT: multi-IP bursts across candidate pages
ChatGPT-User hits the origin from multiple source IPs inside the same burst, and typically pulls several candidate pages at once while the model decides which to cite. On a separate production site I run, a recent 24-hour window captured ChatGPT-User requests from five distinct Azure ranges: 23.98.x.x, 20.215.x.x, 40.67.x.x, 51.8.x.x, and 51.107.x.x. This matches OpenAI’s own description of the agent in their bots documentation. If you are rate-limiting based on a single source IP, you will under-count.
Claude: robots.txt first, every time
Claude-User pulled /robots.txt before every page fetch, out of Anthropic-owned IP space in the 216.73.216.0/24 range. Redirects were followed cleanly, including the usual trailing-slash normalization. The robots precheck matches Anthropic’s behavior as documented in their crawler docs. If you want Claude to skip your site, add a User-agent: Claude-User disallow to your robots.txt. Claude will honor it on the next fetch. Anthropic also runs two other bots that should not be confused with this one: Claude-SearchBot (their search index) and ClaudeBot (their training crawler). Only Claude-User means a real user just asked Claude something about your page.
Perplexity: direct fetch, no niceties
Perplexity-User fetched the page directly. No Accept header, no referrer. Separately, PerplexityBot (their search-indexing crawler, not the user-retrieval one) pinged /robots.txt. I captured few Perplexity retrieval runs in total, and Perplexity can answer from its own index without hitting the origin, so the safe wording is that Perplexity can retrieve live; it does not have to. See Perplexity’s bots documentation for their own framing.
Gemini: no hit, not even once
The probe captured real clickthrough visits from gemini.google.com and google.com during the Gemini runs, normal browsers arriving after a user read the answer and clicked a citation. That half of the signal is clean. On the provider-fetch side, nothing arrived. Two separate observations hold here:
- Observed. Zero requests arrived from any Google user-agent during the Gemini prompt window. Gemini answered entirely from its own index; it did not perform a live provider-side fetch that reached my origin.
- Structural. Google does not publish a retrieval-specific user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode ground on the same Search index that
Googlebotpopulates. If Gemini ever does live-fetch, it would arrive asGooglebot, indistinguishable from ordinary Search indexing.
Three practical consequences:
- A
Googlebothit cannot be attributed to Gemini vs classic Search from the request alone. - Blocking
Google-Extendeddoes not blockGooglebot. It gates whetherGooglebot-crawled content may be used for Gemini training and grounding. - Measuring AI traffic from logs alone will be asymmetric by vendor. Google’s gap is structural, not something to work around.
Copilot and Grok arrive as plain browsers
Microsoft Copilot fetched the page as plain Chrome 135 on Linux x86_64, with a full browser-style Accept header and the usual burst of CSS, JS, and image requests. No distinct Copilot user-agent, no Bingbot activity during the prompt window. Per Microsoft’s guidance for generative-AI and public websites, Copilot grounds on the Bing index populated by Bingbot, but the live fetch we observed was not Bingbot. From your logs alone, you cannot positively attribute a Copilot fetch to Copilot by user-agent.
Grok fetched the page as plain Mac Safari 26 and, in a second run, plain Mac Chrome 143. No distinct UA, no suffix, no header signal that would let you attribute the hit to xAI from the request alone. Grok documents no retrieval-specific bot. Same observability problem as Copilot, with even less documentation to fall back on.
Between Gemini, Copilot, and Grok, three of the major chatbots are either invisible in provider-fetch logs (Gemini) or indistinguishable from an ordinary human visitor (Copilot and Grok). If you’re trying to measure AI traffic from logs alone, plan to miss these three.
Meta AI: two documented bots, one observed, no confident mapping
Meta appears to maintain its own index, the way Google does. In separate testing, Meta AI returned information that no longer exists on the live page, consistent with an index-first retrieval path: serve from index when the page is already known, fetch live only when it is not.
When Meta did fetch live in our probe (prompted through its Muse Spark surface), the request arrived as meta-webindexer/1.1 with Accept: */*. Meta’s own web-crawlers documentation describes a different bot, Meta-ExternalFetcher, as the user-initiated retrieval bot for Facebook, Messenger, Instagram, and WhatsApp AI features, and documents that it may bypass robots.txt on the grounds that a human or agent followed a specific link.
Only one of these bots showed up in a given session, and the probe cannot isolate which factor triggers each one: product surface, first-time vs repeat fetch, prior index state, or something else. meta-webindexer and Meta-ExternalFetcher are both Meta’s live-fetch bots. If you want to block either of them explicitly by UA rather than assume a single name covers all of Meta’s AI features.
Manus announces itself plainly
Manus fetched as Mozilla/5.0 ... Chrome/132.0 ... ; Manus-User/1.0. That Manus-User/1.0 suffix is how you spot Manus in your logs. Unlike the other agents tested, Manus rendered the full page: HTML, every CSS file, every JS file, every image. Of the agents in this probe, Manus is the one that labels itself clearly in the UA and is easiest to identify in logs.
What you can actually measure
Two things you can measure from your logs without guessing.
Provider fetch
Vendor-documented or probe-observed retrieval user-agents hitting your origin: ChatGPT-User, Claude-User, Perplexity-User, Manus-User, Meta-ExternalFetcher (documented), and meta-webindexer (observed; Meta bot class not fully clear to us).
Real visit
Normal browser user-agent with a chatbot as the referrer: chatgpt.com, claude.ai, perplexity.ai, gemini.google.com, copilot.microsoft.com, grok.com, meta.ai, and google.com / bing.com as broader buckets (with no way to isolate AI Mode or Copilot from classic Search using HTTP alone).
Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) also show up in logs, but they are not the AI answering a specific user’s question — they are building an index. Don’t count them as live retrieval. Training bots (GPTBot, ClaudeBot, CCBot) are a third separate signal and should not be counted as retrieval either.
A related point worth naming: search-indexing and training bots are not expected to hit the origin in response to a specific user query, so their absence during a probe like this is structural, not evidence against them. Measuring training or indexing activity takes a separate long-window log pull, not a prompt-driven test.
Appendix: vendor-documented bot taxonomy
| Bot | Company | Class | Source |
|---|---|---|---|
ChatGPT-User | OpenAI | retrieval | platform.openai.com/docs/bots |
OAI-SearchBot | OpenAI | search_indexing | platform.openai.com/docs/bots |
GPTBot | OpenAI | training | platform.openai.com/docs/bots |
Claude-User | Anthropic | retrieval | Anthropic crawler docs |
Claude-SearchBot | Anthropic | search_indexing | Anthropic crawler docs |
ClaudeBot | Anthropic | training | Anthropic crawler docs |
Perplexity-User | Perplexity | retrieval | docs.perplexity.ai/guides/bots |
PerplexityBot | Perplexity | search_indexing | docs.perplexity.ai/guides/bots |
Meta-ExternalFetcher | Meta | retrieval (may bypass robots.txt) | Meta web crawlers |
Meta-ExternalAgent | Meta | training and product indexing | Meta web crawlers |
meta-webindexer | Meta | observed on Meta AI (Muse Spark) retrieval; class not fully clear to us | Meta crawler docs |
Manus-User | Manus | retrieval (agentic; full browser-style render) | observed in this probe |
Googlebot | search_indexing (also grounds AI Overviews and AI Mode) | Google crawlers | |
Google-Extended | usage control, not a crawler; gates Gemini training and grounding | Google crawlers | |
Bingbot | Microsoft | search_indexing (also grounds Microsoft Copilot) | Copilot public websites |
CCBot | Common Crawl | training (used by many labs) | commoncrawl.org/ccbot |
Microsoft Copilot and Grok are not in this table. Neither vendor documents a retrieval-specific user-agent we can cite; the live fetches we observed from both came in as plain browsers.
Check this on your own site
Our robots.txt checker reads your live file and reports which retrieval, search, and training user-agents it currently allows or blocks. No account needed. That is the fastest way to turn the table above into one concrete answer about your domain.