What nginx logs prove about AI traffic vs referral traffic

8 min read Original article ↗

When someone asks an AI assistant about a site I run, does the assistant actually fetch the page, or does it answer from an index it built earlier? I wanted a straight answer, so I set up an nginx probe and prompted the major chatbots with queries that should force a live fetch. This post is what the server recorded, and what you can safely measure from it.

Two different signals

“AI traffic” usually means one of two things, and nginx logs make the difference obvious.

  • Provider-side fetch. The assistant hits the origin itself, usually with a dedicated user-agent and no referrer.
  • Real clickthrough visit. A human reads the AI answer, clicks a citation, and arrives as a normal browser with the assistant as the referrer.

Folding both into a single AI-traffic number hides the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed.

The probe

Custom nginx log format capturing the headers the default combined log compresses out:

log_format ai_probe escape=json
  '{'
    '"time":"$time_iso8601",'
    '"ip":"$remote_addr",'
    '"uri":"$request_uri",'
    '"status":$status,'
    '"ua":"$http_user_agent",'
    '"referer":"$http_referer",'
    '"accept":"$http_accept"'
  '}';

Each assistant got a prompt pointing at a unique query string (/?ai=chatgpt, /?ai=claude, and so on), so I could tell from a single grep which hit came from which assistant.I reran prompts across sessions so a transient cache hit would not hide the retrieval path.

Who announced themselves, and how

Five assistants arrived with a retrieval-specific signal in the user-agent.

AssistantUser-agent sentAcceptrobots.txt first?
ChatGPTChatGPT-User/1.0Chrome-styleno
ClaudeClaude-User/1.0*/*yes
PerplexityPerplexity-User/1.0(empty)via PerplexityBot
Meta AImeta-webindexer/1.1*/*no
ManusManus-User/1.0 suffix on a Chrome UAChrome-styleno

All five fetched the page.

Who did not announce themselves

Three assistants had no distinct retrieval user-agent to capture.

AssistantWhat the log capturedResult
Geminizero requests from any Google UA during the prompt windowno live fetch; answered from index
Copilotplain Chrome 135 on Linux x86_64, full browser-style Acceptfetched, but indistinguishable from a human visitor
Grokplain Mac Safari 26 and plain Mac Chrome 143fetched, but indistinguishable from a human visitor

Detail on each assistant follows.

ChatGPT: multi-IP bursts across candidate pages

ChatGPT-User hits the origin from multiple source IPs inside the same burst, and typically pulls several candidate pages at once while the model decides which to cite. On a separate production site I run, a recent 24-hour window captured ChatGPT-User requests from five distinct Azure ranges: 23.98.x.x, 20.215.x.x, 40.67.x.x, 51.8.x.x, and 51.107.x.x. This matches OpenAI’s own description of the agent in their bots documentation. If you are rate-limiting based on a single source IP, you will under-count.

Claude: robots.txt first, every time

Claude-User pulled /robots.txt before every page fetch, out of Anthropic-owned IP space in the 216.73.216.0/24 range. Redirects were followed cleanly, including the usual trailing-slash normalization. The robots precheck matches Anthropic’s behavior as documented in their crawler docs. If you want Claude to skip your site, add a User-agent: Claude-User disallow to your robots.txt. Claude will honor it on the next fetch. Anthropic also runs two other bots that should not be confused with this one: Claude-SearchBot (their search index) and ClaudeBot (their training crawler). Only Claude-User means a real user just asked Claude something about your page.

Perplexity: direct fetch, no niceties

Perplexity-User fetched the page directly. No Accept header, no referrer. Separately, PerplexityBot (their search-indexing crawler, not the user-retrieval one) pinged /robots.txt. I captured few Perplexity retrieval runs in total, and Perplexity can answer from its own index without hitting the origin, so the safe wording is that Perplexity can retrieve live; it does not have to. See Perplexity’s bots documentation for their own framing.

Gemini: no hit, not even once

The probe captured real clickthrough visits from gemini.google.com and google.com during the Gemini runs, normal browsers arriving after a user read the answer and clicked a citation. That half of the signal is clean. On the provider-fetch side, nothing arrived. Two separate observations hold here:

  • Observed. Zero requests arrived from any Google user-agent during the Gemini prompt window. Gemini answered entirely from its own index; it did not perform a live provider-side fetch that reached my origin.
  • Structural. Google does not publish a retrieval-specific user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode ground on the same Search index that Googlebot populates. If Gemini ever does live-fetch, it would arrive as Googlebot, indistinguishable from ordinary Search indexing.

Three practical consequences:

  • A Googlebot hit cannot be attributed to Gemini vs classic Search from the request alone.
  • Blocking Google-Extended does not block Googlebot. It gates whether Googlebot-crawled content may be used for Gemini training and grounding.
  • Measuring AI traffic from logs alone will be asymmetric by vendor. Google’s gap is structural, not something to work around.

Copilot and Grok arrive as plain browsers

Microsoft Copilot fetched the page as plain Chrome 135 on Linux x86_64, with a full browser-style Accept header and the usual burst of CSS, JS, and image requests. No distinct Copilot user-agent, no Bingbot activity during the prompt window. Per Microsoft’s guidance for generative-AI and public websites, Copilot grounds on the Bing index populated by Bingbot, but the live fetch we observed was not Bingbot. From your logs alone, you cannot positively attribute a Copilot fetch to Copilot by user-agent.

Grok fetched the page as plain Mac Safari 26 and, in a second run, plain Mac Chrome 143. No distinct UA, no suffix, no header signal that would let you attribute the hit to xAI from the request alone. Grok documents no retrieval-specific bot. Same observability problem as Copilot, with even less documentation to fall back on.

Between Gemini, Copilot, and Grok, three of the major chatbots are either invisible in provider-fetch logs (Gemini) or indistinguishable from an ordinary human visitor (Copilot and Grok). If you’re trying to measure AI traffic from logs alone, plan to miss these three.

Meta AI: two documented bots, one observed, no confident mapping

Meta appears to maintain its own index, the way Google does. In separate testing, Meta AI returned information that no longer exists on the live page, consistent with an index-first retrieval path: serve from index when the page is already known, fetch live only when it is not.

When Meta did fetch live in our probe (prompted through its Muse Spark surface), the request arrived as meta-webindexer/1.1 with Accept: */*. Meta’s own web-crawlers documentation describes a different bot, Meta-ExternalFetcher, as the user-initiated retrieval bot for Facebook, Messenger, Instagram, and WhatsApp AI features, and documents that it may bypass robots.txt on the grounds that a human or agent followed a specific link.

Only one of these bots showed up in a given session, and the probe cannot isolate which factor triggers each one: product surface, first-time vs repeat fetch, prior index state, or something else. meta-webindexer and Meta-ExternalFetcher are both Meta’s live-fetch bots. If you want to block either of them explicitly by UA rather than assume a single name covers all of Meta’s AI features.

Manus announces itself plainly

Manus fetched as Mozilla/5.0 ... Chrome/132.0 ... ; Manus-User/1.0. That Manus-User/1.0 suffix is how you spot Manus in your logs. Unlike the other agents tested, Manus rendered the full page: HTML, every CSS file, every JS file, every image. Of the agents in this probe, Manus is the one that labels itself clearly in the UA and is easiest to identify in logs.

What you can actually measure

Two things you can measure from your logs without guessing.

Provider fetch

Vendor-documented or probe-observed retrieval user-agents hitting your origin: ChatGPT-User, Claude-User, Perplexity-User, Manus-User, Meta-ExternalFetcher (documented), and meta-webindexer (observed; Meta bot class not fully clear to us).

Real visit

Normal browser user-agent with a chatbot as the referrer: chatgpt.com, claude.ai, perplexity.ai, gemini.google.com, copilot.microsoft.com, grok.com, meta.ai, and google.com / bing.com as broader buckets (with no way to isolate AI Mode or Copilot from classic Search using HTTP alone).

Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) also show up in logs, but they are not the AI answering a specific user’s question — they are building an index. Don’t count them as live retrieval. Training bots (GPTBot, ClaudeBot, CCBot) are a third separate signal and should not be counted as retrieval either.

A related point worth naming: search-indexing and training bots are not expected to hit the origin in response to a specific user query, so their absence during a probe like this is structural, not evidence against them. Measuring training or indexing activity takes a separate long-window log pull, not a prompt-driven test.

Appendix: vendor-documented bot taxonomy

BotCompanyClassSource
ChatGPT-UserOpenAIretrievalplatform.openai.com/docs/bots
OAI-SearchBotOpenAIsearch_indexingplatform.openai.com/docs/bots
GPTBotOpenAItrainingplatform.openai.com/docs/bots
Claude-UserAnthropicretrievalAnthropic crawler docs
Claude-SearchBotAnthropicsearch_indexingAnthropic crawler docs
ClaudeBotAnthropictrainingAnthropic crawler docs
Perplexity-UserPerplexityretrievaldocs.perplexity.ai/guides/bots
PerplexityBotPerplexitysearch_indexingdocs.perplexity.ai/guides/bots
Meta-ExternalFetcherMetaretrieval (may bypass robots.txt)Meta web crawlers
Meta-ExternalAgentMetatraining and product indexingMeta web crawlers
meta-webindexerMetaobserved on Meta AI (Muse Spark) retrieval; class not fully clear to usMeta crawler docs
Manus-UserManusretrieval (agentic; full browser-style render)observed in this probe
GooglebotGooglesearch_indexing (also grounds AI Overviews and AI Mode)Google crawlers
Google-ExtendedGoogleusage control, not a crawler; gates Gemini training and groundingGoogle crawlers
BingbotMicrosoftsearch_indexing (also grounds Microsoft Copilot)Copilot public websites
CCBotCommon Crawltraining (used by many labs)commoncrawl.org/ccbot

Microsoft Copilot and Grok are not in this table. Neither vendor documents a retrieval-specific user-agent we can cite; the live fetches we observed from both came in as plain browsers.

Check this on your own site

Our robots.txt checker reads your live file and reports which retrieval, search, and training user-agents it currently allows or blocks. No account needed. That is the fastest way to turn the table above into one concrete answer about your domain.