How is GPTBot allowed or disallowed?

3 min read Original article ↗

Team Flowing Horse

GPTBot — the official crawler of OpenAI, has been announced for nearly 2 months. GPTBot is for crawling web information to improve the models of OpenAI, e.g. GPT-4. We are wondering what the reactions from the Internet are. Is the bot being accepted or rejected?

We decided to grab a digest perspective towards the acceptance of GPTbot. As stated in the document, a website could deny access from GPTBot by adding two lines in its robots.txt file.

User-agent: GPTBot
Disallow: /

Therefore, by analyzing the robots.txt files from a sampled collection of websites, we should be able to see an overview of the GPTBot acceptance. Note that, if a website allows the GPTBot to crawl its content, there probably won’t be an Allow directive in the robots.txt file, since allowance is the default option. Only that a website wants to prevent GPTBot from crawling its content will explicitly add a Disallow directive in its robots.txt file. Without explicit Disallow, a website is either intentionally allowing GPTBot or unaware of the GPTBot.

Press enter or click to view image in full size

We get a collection of 1018 top-visited websites from dataforseo.com. As in the pie chart above, about 28% of them won’t allow GPTBot to crawl(including those that blocked all crawlers). Furthermore, we want to know how much percentage of websites explicitly disallow access from GPTBot. We have the following chart,

Press enter or click to view image in full size

In the chart, category ignore means GPTBot is not mentioned in the robots.txt(default allowance), category disallow means GPTBot is explicitly disallowed at path /, category error means incorrect robots.txt syntax and category parts means GPTBot is explicitly disallowed in multiple paths. We can see that there are already about 25% of the top-visited websites have disallowed access from GPTBot within 2 months. The number is pretty high since the robots.txt is not frequently edited. So we think that the Internet is already taking OpenAI’s models seriously as a potential threat, or at least uncertain about its risk.

Have a look at the list of websites which denied GPTBot:

pinterest.com
amazon.com
quora.com
nytimes.com
medium.com
theguardian.com
shutterstock.com
foursquare.com
cnn.com
wikihow.com
usatoday.com
healthline.com
stackexchange.com
alamy.com
scribd.com
dictionary.com
reuters.com
businessinsider.com
washingtonpost.com
medicalnewstoday.com
cbsnews.com
npr.org
goodhousekeeping.com
amazon.co.uk
tumblr.com
insider.com
vocabulary.com
investopedia.com
...

Most of them are of the content-centered type(medium.com is surely one of them). Their decision to deny GPTBot seems to be natural.

We also found several interesting things:

  • The ChatGPT has another bot named ChatGPT-User for fetching the web content for the users of ChatGPT. We also found that about 5% of websites from the sampled collection denied access from ChatGPT-User.
  • None of the websites from the sampled collection denied access from Googlebot. Google confirmed that it uses the crawled web data to train AI models, but surely Google is a crucial source of web traffic.
  • While looking into the robots.txt files, we found some websites have a special kind of syntax mistake. The asterisk symbol * can not match wildcard paths, only the symbol itself. Remember that the code below is wrong, it can not prevent GPTBot from crawling your website.
User-agent: GPTBot
Disallow: *

If you want to dig further, the result of our data can be downloaded at https://docs.google.com/spreadsheets/d/1TQGXiQGZMCQwJWvg7LxtwIrydP2r6oXTITaAPLHB88o/edit?usp=sharing.