Responsible bot operation

22 min read Original article ↗

Author: Aaron P. MacSween

Published: 2025-12-01

I've written two articles about poorly behaved web-crawling robots recently, first about a somewhat novel method to distinguish bots from legitimate human traffic ("AI scrapers request commented scripts"), then another unpacking some of the responses to that article ("Algorithmic sabotage debrief"). Having already spent some time outlining the various things an operator can do to earn a sysadmin's malice I figured it might be worthwhile to talk at least briefly about how to operate a bot responsibly.

Most of the practices I intend to describe are already in use by a variety of web-crawlers, however, I haven't been able to find a formal name to describe them all, nor have I come across any articles which cover all the behaviours that I consider important. I'll link to examples where they are available, and include some concrete measurements of behaviour I've observed on my own servers wherever possible.

a digital cartoon illustration of laptop computer with a silly happy face on its screen

Robots.txt

If you're reading this then it's very likely you've already heard of robots.txt before (see robotstxt.org otherwise). Basically, it's a simple text file that a website can host which serves as a means of communicating to various bots how the site expects them to behave. Particular bots can be allowed or disallowed across the whole site, or with regard to particular resources, including paths and general patterns like file extensions.

The format was introduced in 1994, but somehow was not formally described until 2022 (in RFC 9309: Robots Exclusion Protocol, drafted primarily by Google employees). Even then, common extensions like Sitemaps (another protocol designed by Google employees) were not explicitly defined in that formal description except for a note that additional fields may be supported:

Crawlers MAY interpret other records that are not part of the robots.txt protocol -- for example, "Sitemaps" [SITEMAPS]. Crawlers MAY be lenient when interpreting other records. For example, crawlers may accept common misspellings of the record.

Many crawlers support the Crawl-Delay, which specifies a number of seconds to wait between requests, but for some reason it was not included in the specification. I'm not aware of any commonly agreed upon upper bound for this delay, and because it's so poorly specified it can be hard to validate how well a particular crawler adheres to the protocol. Some others expect a Request-rate (Seznam.cz for instance), so in practice you might have to include both directives with their own particular syntax.

Another issue to be aware of is that there are various crawlers that will interpret robots.txt files somewhat creatively. Google's dominance over search is so dramatic that many site administrators will include directives for them and not bother with others. This has led to the following policy from applebot:

If robots instructions don't mention Applebot but mention Googlebot, the Apple robot will follow Googlebot instructions.

...and this similar policy from "ImageSift":

If there is no rule targeting ImagesiftBot, but there is a rule targeting Googlebot, then ImagesiftBot will follow the Googlebot directives.

From what I've seen both of these bots still obey catch-all directives like the following:

User-agent: *
Disallow: /

...though this was not particularly clear based on how their policies were written.

So, for anyone looking to get into responsible web-crawling, I would recommend the following:

  1. Read RFC 9309.

  2. Look into frameworks for crawling that implement support for the usual extensions and have been tested against a variety of edge cases. I have no direct experience with this project, but Ethicrawl seems like a somewhat competent reference implementation. I've also heard good things about BeautifulSoup.

  3. Don't get creative in how you interpret a site's robots.txt file. Stick to the usual definitions, and if your crawler is disallowed by a site, then don't crawl.

User-agent strings

A User-Agent header (or UA string) is a bit of text that browsers and bots alike can send as part of a request to identify themselves. It's entirely voluntary, and a bad bot can send a fake string to pretend to be something else (more on that later), but it's a good starting point.

Wikipedia's user-agent policy is worth reading in its entirety, but I'll quote the most relevant bits:

As of February 15, 2010, Wikimedia sites require a HTTP User-Agent header for all requests. This was an operative decision made by the technical staff and was announced and discussed on the technical mailing list. The rationale is, that clients that do not send a User-Agent string are mostly ill behaved scripts that cause a lot of load on the servers, without benefiting the projects. User-Agent strings that begin with non-descriptive default values, such as python-requests/x, may also be blocked from Wikimedia sites (or parts of a website, e.g. api.php).

I implement a similar policy across all my sites.

If you run a bot, please send a User-Agent header identifying the bot with an identifier that isn't going to be confused with many other bots, and supplying some way of contacting you (e.g. a userpage on the local wiki, a userpage on a related wiki using interwiki linking syntax, a URI for a relevant external website, or an email address), e.g.:

User-Agent: CoolBot/0.0 (https://example.org/coolbot/; coolbot@example.org) generic-library/0.0

I like this example in particular because it provides a full URL with the protocol specified (https://). I'm aware of a few crawlers that include a link, but because there are now so many top-level domains it's not always easy to tell a domain apart from some random characters. For instance, BuiltWith.com includes a link to an informational page about their bot, but the URL omits the protocol and uses a link shortener:

All our requests contain a User-Agent which will include -
BW/1.1; rb.gy/oupwis
BW/1.3; rb.gy/qyzae5

Their full UA string is as follows:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; BW/1.3; rb.gy/qyzae5) Chrome/124.0.0.0 Safari/537.36"

...which honestly just looks like noise unless you're paying close attention or know what to look for. From what I've seen their bot is otherwise somewhat well behaved, but that's an embarrassingly bad user-agent string. I'd have gone with something shorter and more informative, like:

BuiltWithBot/1.3 https://builtwith.com/biup

The thing about UA strings is that they are only useful if a website administrator happens to see them. Personally, I've set up a script which monitors my server's HTTP logs and sends a daily email digest of all the UA strings which it has not previously encountered. This means I don't have to actively check any logs, and can instead just look into new web crawlers and search engines after they first visit one of my sites.

From there, I can decide how to respond to them, whether that means modifying my robots.txt files to allow them to crawl, or to add new rules to automatically block their IPs at the firewall level, or to feed them garbage data that will mess with any LLMs that include it in their training sets.

Introductory web pages

User-agent strings should include a link to a web page, but the quality of those web pages can vary greatly. For instance, a friend had given me access to their ChatGPT account back when it was first launched. I basically forgot about it until a few days ago when they announced a data breach ("What to know about a recent Mixpanel security incident"), after which I decided to just delete that account. As it turns out, their bots' informational page only returns an error page if you've deleted or deactivated your account. The page loaded correctly in a private window, but such pages should work correctly with or without authentication.

OpenAI error page. "Oops! You do not have an account because it has been deleted or deactivated. If you believe this was an error, please contact us through our help center at help.openai.com."

Given how they claim to be on the verge of creating Artificial General Intelligence, I personally find it a little odd that they can't manage to display a simple web page, but maybe my standards are too high. In any case, if you want sites to allow your bot to crawl them, the best way to earn their trust is to include the following on a publicly accessible web page:

  1. The purpose of your bot

  2. Information about your organization

  3. How it behaves, including which robots.txt directives you can expect it to follow

  4. How to block it

  5. How to distinguish it from fakes (more on that later)

  6. Who to contact if it's misbehaving

There aren't many examples of crawlers which satisfy all these requirements. Googlebot's info page is fairly comprehensive, but because Google uses data for so many purposes it's basically impossible for a website owner to know what sort of behaviours they'll enable by permitting Googlebot to crawl their site.

For example:

Blocking Googlebot affects Google Search (including Discover and all Google Search features), as well as other products such as Google Images, Google Video, and Google News.

...which leaves it rather unclear whether data shared for the purpose of appearing in search results will end up in an LLM training set, though I'd assume so.

GeedoProductSearch does a better job communicating how they intend to use data they request. They also list their bots' IP addresses in a variety of formats, describe exactly how to validate their authenticity using DNS, and how to block or restrict their traffic via robots.txt with clear examples. The main thing that's missing is a way to contact their operators, but because they seem to behave themselves there's not much need for it.

I'd be interested to hear from other sysadmins whether they can point to better examples, as pretty much every crawler I've evaluated falls short on at least one of my requirements.

rDNS authentication

As I mentioned, it's trivial to send a fake user-agent string claiming to be a normal browser, or a different bot, hence why many bot operators provide lists of their bots' IP addresses. Some of these IP lists change frequently, however, so I prefer to validate bots authenticity via DNS.

The basic idea is that as long as you trust that a bot operator is able to protect their DNS infrastructure from adversaries, you can use their DNS records to identify bots which are falsely claiming to be operated on their behalf (hereafter referred to as spoofing).

First, a reverse DNS lookup uses an IP address and looks up its corresponding pointer (PTR) record. PTR records can include arbitrary text, so there's nothing particularly trustworthy about them, but in this bot validation protocol they are used to return a domain which can be verified.

Microsoft offers a page describing how to validate bots claiming to act on behalf of their Bing search engine, so you can see how this would work in practice:

> host 157.55.39.1 157.55.39.1.in-addr.arpa
domain name pointer
msnbot-157-55-33-18.search.msn.com

When a crawler identifies itself as bingbot, you can look up the PTR record for its IP, and it should end in .search.msn.com. If it doesn't, then it should be clear that the bot operator only bothered to spoof its User-Agent and nothing more. Even so, they might have gone further and also spoofed their PTR record, so we still need to validate that too by resolving that text as a DNS record:

> host msnbot-157-55-33-18.search.msn.com
msnbot-157-55-33-18.search.msn.com has address 157.55.39.1

Only Microsoft should have the ability to configure DNS records for domains ending in msn.com, and they only point these back to their own IP addresses. Thus, if that msn.com domain returned by the PTR record resolves to the IP of the crawler, then you can be confident the crawler is actually one which belongs to Microsoft.

Note: Microsoft's code examples show the host command returning results ending with .com. At least on my systems I get results that include a trailing dot (.com.). These are considered equivalent DNS records, but you'll want to check how your system behaves, because otherwise a basic string comparison might fail.

In my experience, this DNS-based validation method is the most consistent means of validating web-crawling bots. All you need is the Domain suffix that's expected for the bot in question, and a pattern by which you can identify bots according to their user-agent. The IP address lists that some bot operators provide require you to know both the URL of the list and its format (usually plain text or a JSON array). They can also be updated rather frequently, so then you might need a system to cache them and periodically invalidate that cache, features you might just get for free if you use DNS.

When I first started validating bots in this manner I only did so for Googlebot and Bingbot. Once I had a system in place for them I became curious about how many other bots supported this method, and I started searching through my logs for more candidates.

I've since collected all the necessary information into one data structure, included below:

/* rDNS suffix => regular expression */
const RECORDS = {
    // http://www.bing.com/bingbot.htm
    '.search.msn.com': /bingbot/i,

    // http://www.googlebot.com/bot.html
    '.googlebot.com': /googlebot/i,

    // https://help.seranking.com/en/blex-crawler
    'blex.seranking.com': /BLEXBot/i,

    // https://github.com/roccomuso/is-baidu
    '.crawl.baidu.com': /baidu/i,

    // https://openrobotstxt.org/ua
    '.openrobotstxt.org': /ORTc\.me/i,

    // https://ahrefs.com/robot/
    // see also: https://yep.com/yepbot/
    '.ahrefs.net': /AhrefsBot/i,

    // https://yandex.com/support/webmaster/en/robot-workings/check-yandex-robots.html?lang=en
    '.spider.yandex.com': /Yandex.*Bot/i,

    // https://support.apple.com/en-us/119829
    '.applebot.apple.com': /applebot/i,

    // https://www.semrush.com/bot/
    '.bot.semrush.com': /semrush/i,

    // https://mj12bot.com/
    '.mj12bot.com': /MJ12bot/i,

    // https://domainstats.com/pages/our-bot
    '.domainstats.com': /DomainStatsBot/i,

    // https://commoncrawl.org/faq
    '.crawl.commoncrawl.org': /CCBot/i,

    // https://openintel.nl/webcrawl/
    '.ewi.utwente.nl': /openintel\.nl/i,

    // https://dataforseo.com/dataforseo-bot
    '.dataforseo.com': /DataForSeoBot/i,

    // https://ibou.io/iboubot.html
    '.ibou.io': /iboubot/i,

    // https://developer.amazon.com/support/amazonbot
    '.crawl.amazonbot.amazon': /amazonbot/,

    // https://webmaster.petalsearch.com/site/petalbot
    '.petalsearch.com': /petalbot/,

    // I wasn't able to find any voluntarily listed information,
    // but this bot can be validated via DNS
    '.crawl.bytedance.com': /bytespider/i,

    // https://www.babbar.tech/crawler
    '.babbar.eu': /barkrowler/i,

    // https://about.censys.io/
    '.censys-scanner.com': /censys/i,

    // https://help.naver.com/service/5626/contents/8026?lang=ko&osType=COMMONOS
    '.web.naver.com': /naver\.me/i,

    // https://www.mojeek.com/bot.html
    '.mojeek.com': /mojeek/i,

    // https://headline.com/legal/crawler
    '.headline.com': /headline\.com/i,

    // https://o-seznam.cz/napoveda/vyhledavani/en/seznambot-crawler/
    '.seznam.cz': /seznambot/i,

    // https://help.yahoo.com/kb/search-for-desktop/SLN22600.html?impressions=true
    '.crawl.yahoo.net': /ysearch\/slurp/i,

    // https://www.dataprovider.com/
    '.dataproviderbot.com': /dataprovider\.com/i,

    // https://geedo.com/product-search/
    '.geedo.com': /geedo\.com/i,

    // https://archive.org/details/archive.org_bot
    '.archive.org': /archive\.org_bot/i,

    // https://help.mail.ru/
    '.go.mail.ru': /mail\.ru/i,
};

Each record consists of a regular expression to identify the bot, and a corresponding domain suffix. By looping over the records and testing each regular expression you can identify the relevant suffix, as in this NodeJS implementation:

const findMatchingDomain = agent => {
    for (const [domain, pattern] of Object.entries(RECORDS)) {
        if (pattern.test(agent)) {
            return domain;
        }
    }
    return false;
};

So, an agent containing the string googlebot should end with .googlebot.com, which you can verify on Googlebot's documenation. From there you can do the DNS lookups:

const DNS = require("node:dns/promises");

const ip = '66.249.64.68';
const agent = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";

const suffix = findMatchingDomain(agent);
// '.googlebot.com'

const reversed = await DNS.reverse(ip);
// [ 'crawl-66-249-64-68.googlebot.com' ]

const reversed_matches = reversed[0].endsWith(suffix);
// true

const resolved = await DNS.resolve(reversed[0]);
// [ '66.249.64.68' ]

const resolved_matches = resolved[0] === ip;
// true

Evaluating spoof rates

In case you're wondering just how many bots are spoofing different IP addresses, I iterated over my logs and applied the above method to validate unique combinations of agent and IP address. The results are as follows:

Pattern IPs Valid Spoofed
/bingbot/i 238 223 15
/googlebot/i 296 263 33
/BLEXbot/i 19 17 2
/baidu/i 36 27 9
/ORTc\.me/i 8 8 0
/AhrefsBot/i 87 86 1
/Yandex.*Bot/i 169 166 3
/applebot/i 133 133 0
/semrush/i 32 32 0
/MJ12bot/i 77 66 11
/DomainStatsBot/i 1 0 1
/CCBot/i 10 5 5
/openintel\.nl/i 8 8 0
/DataForSeoBot/i 6 6 0
/iboubot/i 23 23 0
/amazonbot/ 242 242 0
/petalbot/ 260 259 1
/bytespider/i 381 258 123
/barkrowler/i 30 27 3
/censys/i 710 78 632
/naver\.me/i 11 11 0
/mojeek/i 1 1 0
/headline\.com/i 5 0 5
/seznambot/i 6 0 6
/ysearch\/slurp/i 6 4 2
/dataprovider\.com/i 6 6 0
/geedo\.com/i 2 2 0
/archive\.org_bot/i 19 17 2
/mail\.ru/i 2 2 0

I found it interesting that even some of the rather obscure crawlers are being spoofed. You might think that people would mostly pretend to be googlebot, on the basis that they're less likely to be blocked. Perhaps enough people actually validate Googlebot, compared to some of the other options? Censys looks particularly bad in these results, though I can't quite tell whether they have a lot of impersonators or if they only bother to provide authentication for a minority of their bots.

In any case, these are just the crawlers that support DNS validation. I didn't bother validating crawlers from OpenAI, Perplexity, TikTok, or Meta. At best they only provide lists of IPs or IP ranges, and they are all so obnoxious that I just default to blocking them regardless of whether they might be legitimate.

Co-existing on the internet in 2025

So, I want to give a quick summary of the steps listed in greater detail above, but I think it's worth breaking things down into separate recommendations for operators of bots and of websites.

What bot operators can do

Before you crawl even a single website, think long and hard about what your bot is supposed to do. What type of data will it collect? How do you intend to use that data? How will the subjects of that data be able to control which uses they'll permit if they agree with some but object to others? Perhaps they want to help provide good search results, but would prefer not to end up in the training set for an LLM. If some new and unforeseen technique for processing a previously innocuous type of data is invented, will you consider it fair to use it on the data you've already collected? If not, how will you assess whether people consent to its use? Write a concise but informative introduction for your bot which answers these questions, and make it clear why you should be trusted as a data processor.

Read (and re-read) RFC 9309 to ensure that you understand how robots.txt files are intended to be interpreted, and test your bot to ensure that it works as expected when it encounters different directives. Do the same for related protocols like Sitemaps, the HTML robots tag, RSS auto-discovery, and any others that might be relevant. When you're confident your bot behaves correctly, curate a number of different examples of robots.txt directives that site admins might reasonably want to apply. A relative beginner should be able to see how to block your bot entirely, allow it, slow it down, or restrict the paths or types of files that it will crawl.

Next, choose where you will host your bot's about-page and pick out a good User-Agent string. Your UA string should include a full link to that about page, and should include the word bot so that it can be easily distinguished from regular web clients. Don't include other bots' identifiers, even to say that you are compatible with them, this increases the likelihood that you'll be blocked by sites that validate crawlers' IPs using the DNS method I described. Choose an identifier that is short, unique, and informative.

Next, pick a root domain for all your bots, configure A and AAAA subdomains for each bot which resolves to their public IPs, and configure a PTR record for each IP which resolves to the relevant subdomain. Test that each bot's IP address can be validated using DNS.

Prepare a comprehensive list of all your bots' IP addresses. Format it as plaintext for admins that prefer that. Make another formatted as JSON for those that prefer that. Make another formatted as CIDR subnets for those that prefer that. Make it as easy as possible for others to identify legitimate traffic from your bots so that they can easily block malicious bots that might eventually begin spoofing your User-Agent.

Set up an email through which people can report unwanted behaviour and abuse, preferably on your own infrastructure so that such reports don't end up accessible to Gmail or Outlook admins.

Once all this is ready, publish your bot's about page at the URL mentioned in its User-Agent. Include that amazing introduction you wrote. Clearly mention its UA string and the domain that can be used for DNS validation. Provide examples of how to do so using common tools (host on Unix systems, nslookup on Windows, web-based tools like mxtoolbox.com for people that aren't comfortable on the command line). Link to the IP lists, and prominently display your address ranges.

Once all that information is publicly available you'll be ready to start crawling. Some site admins use robots.txt to permit bots by default, and add exceptions for those they don't like. Others take the opposite approach, banning all bots by default and adding exceptions for the ones they trust. You'll need to wait for this latter group to notice your User-Agent string, access your bot's about page, and add an exception to their robots.txt file if they decide that you deserve their trust.

The good news is that if you've done all the steps above, you'll already appear more trustworthy than most of the other bots on the Internet.

What site operators can do

I'd like to try to keep this list of recommendations much shorter than those for bot operators, because the onus really should be on them to ensure their bots behave well. Still, there are a few different ways site operators can get involved, depending on their level of technical expertise.

As with my advice for bot operators, I'll start by suggesting that website admins give some serious consideration to the purpose of their site. Are you hosting it as part of your business, or just for yourself? Do you have an RSS feed so people can subscribe for updates, or is search the main way people discover your articles? Do you want strangers to be able to find it, or do you host things so you have a link to share with your peers? Will you lose anything if you show up in fewer search results?

If you are are lucky enough that your income and sense of self-worth are independent of the web search ecosystem, then I'd suggest using robots.txt to block bots by default. There are far more bad bots than there are good ones, and in most cases they need you more than you will ever need them.

If you are able to do so, set up a script like mine which monitors for new User-Agent strings and emails a daily digest of new ones. I've learned about a few cool projects because of this, but it's also revealed some suspicious behaviour where there were suddenly far more agents than on a normal day. In cases where I do learn about some new bot, the first thing I do is check which URLs it tried to access. Those that adhere to the rules in the relevant robots.txt file generally get left alone, while those that ignore it may find themselves on the receiving end of some creative punishment.

When you learn about a new, seemingly benign crawler, see if they offer an automatable way to validate their bots' identity, as I described above. Some of the bots which I'd previously suspected of bad behaviour turned out to have been spoofed once I started digging into those methods. Still, at this point the only ones I permit are those that provide some method of validation. After all, if there's no way to tell the difference between your someone's bot and a bad bot, then theirs might as well be a bad bot too. And if the only way to validate their bot is inconvenient for me, then it will be entirely their problem when I block them.

Conclusion

Aside from all these technical measures mentioned above, I think it's still important to do a quick vibe-check for the websites of different crawlers. Many of the evaluations stick to those technical details, like whether they obey robots.txt, but I haven't seen any comprehensive overviews summarizing all the other things which might matter, like whether a search engine actively collaborates with fascists.

I learned about IBOU.io through my logs, and while they don't advertise it on their bot's page it is possible to validate their IPs using DNS. Their bot behaves politely, and I came to their site with an open mind, but then I saw that they link to their account on Twitter X.com, and prominently display a notice prompting you to subscribe to their Substack account. Since both those platforms are owned and largely populated by nazis (or those that aren't particularly bothered by coexisting with them), I can only assume the worst about IBOU's politics and about how they might behave in the future.

Mojeek has an even better info page for their bot, but they gave an arguably worse impression by providing a dedicated page for searching Substack:

Screenshot of Mojeek's dedicated page for searching Substack. A text input field has the placeholder "Search for Substack articles"

I had hopes for the EU's OpenWebSearch project, but then I learned that they explicitly intend for the index they're building to be made available for training LLMs:

The Open Web Index (OWI) is a European open source web index pilot that is currently in Beta testing phase. The idea: Collaboratively and transparently secure safe, sovereign and open access to the internet for European organisations and civil society. The index stores well structured open web data, making it available for search applications and LLMs.

At the time that I'm writing this, the only bot allowed to crawl this website belongs to Marginalia. They provide information about their crawler, including a small IP address range (193.183.0.162-174) which can be used to validate crawlers with a marginalia UA. I'd rather they supported the DNS method, but their apparent values seem much closer to mine than those of any other search engine I've seen.

I'd love to see more good crawlers (and therefore good options for web search), not to replace Marginalia, but to complement them. If you're aware of any that satisfy my requirements then please do let me know. Alternatively, if you are curious about how one might measure up to my standards but lack the expertise to judge for yourself, feel free to mention them. My information is in this site's footer. 👇🏻