Trap bots on your server: (Maurycy's blog)

6 min read Original article ↗
(Programming) (Software)

Here's the code for my infinite nonsense crawler trap:

What follows is an explanation of how to set it up.

Training the Markov chain:

First, you'll want to find three long-ish sources of text, between 1000 and 50,000 words. I used old ebooks from Project Guttenberg, but long blog posts or Wikipedia articles will also work.

Save the text in files named book1.txt, book2.txt and book3.txt. Remove any page numbers, headings, copyright notices and similar.

If your texts are somewhat short (less then 3000 words), change the context_length variable in process.py to 1. This results in less repetitive text, but for long bodies of text, setting it to 2 produces more coherent results.

With the text prepared, run the python script:

$ python3 process.py

You should now have 3 files named chainX.txt full of fragmented text.

The server:

Configure your web server to forward requests under an unused path to localhost:[some unused port]. Here's how to do that in NGINX:

# Proxy traffic to the Markov babbler
# Place inside a server { } block
location /babble/ {
	proxy_pass http://localhost:1414/;
}

Next, open the server code and change PORT to the chosen port number and URL_PREFIX to the path that your server is forwarding:

...
#define PORT 1414
...
// Must begin and end with "/"s
const char* URL_PREFIX = "/babble/";
...

With the configuration done, compile the garbage server:

$ gcc -O3 -o babble babble.c -lm

... and run it in the same directory as the chainX.txt files:

$ ./babble
[*] Loading files
    Loading chain1.txt...
    Loading chain2.txt...
    Loading chain3.txt...
[*] Creating socket
[*] Serving garbage!

It may take a few seconds to parse the files if they are particularly big. Once finished, you should be able to visit a URL under the configured path and get garbage:

https://your.domain.example.com/babble/test
Make sure to check the links

If you see a 5XX (500, 502...) error, make sure you set the right path and port number in your web server's configuration. If the page loads but the links are broken, check the URL_PREFIX and recompile the code.

If everything works, you'll want the program to run in the background. With systemd on Linux, create a file called:

/etc/systemd/system/babbler.service

... with the following content: (don't forget to change the file paths)

[Unit]
Description=Markov chain based crawler trap
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
User=nobody
WorkingDirectory=/path/to/markov/chains/
ExecStart=/path/to/binary/babble

[Install]
WantedBy=multi-user.target

(Make sure the user can access the markov chain files)

... and run these commands as root:

$ systemctl daemon-reload
$ systemctl start babble
$ systemctl enable babble

If you run some other init system, you know enough to figure this out yourself.

Catching wild bots:

You don't really need any bot detection: for my site, just linking to the garbage from your main website worked fine.

Because each page links to five more garbage pages, the crawler's queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.

If you don't want your site to disappear from search results, exclude the garbage URLs in /robots.txt:

User-agent: Googlebot Bingbot Kagibot
Disallow: /babble/*

If you want prevent bots from ever seeing any of your content, ai.robots.txt has nice lists of User-Agents. Just change the rules to rewrite or redirect instead of blocking.

Here's what that looks like NGINX:

# Send scrapers garbage
# Regex from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf
if ($http_user_agent ~* "(AddSearchBot|AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|Andibot|anthropic\-ai|Applebot|Applebot\-Extended|Awario|bedrockbot|bigsur\.ai|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\ Agent|ChatGPT\-User|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|CloudVertexBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Datenbank\ Crawler|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-CloudVertexBot|Google\-Extended|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|LinerBot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|netEstate\ Imprint\ Crawler|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Panscient|panscient\.com|P
erplexity\-User|PerplexityBot|PetalBot|PhindBot|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Thinkbot|TikTokSpider|Timpibot|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|wpbot|YaK|YandexAdditional|YandexAdditionalBot|YouBot)") {
	rewrite ^(.*)$ /babble/$1;
}
# Also send garbage to Firefox's AI summaries
if ($http_x_firefox_ai ~* "1") {
	rewrite ^(.*)$ /babble/$1;
}

Some bots do spoof user agents, but they share URLs with ones that don't: Once the generated links get in the crawling queue, they will be served garbage no matter how well they hide.

User agent filtering also ensure that AI chatbots and summarizers only ever get garbage... after all, the easiest thing to turn into garbage is garbage: We're only helping!

The cost:

On my machine, generating a page takes anywhere from 50 micro­seconds. Unless you have a very slow server, CPU load should be minimal: NGINX spends more time handling encryption then it takes to generate garbage.

Just be careful if you're behind any cloud service that bills by request: My server is hovering around ~20 requests/second. It's not a problem for my VPS, but it adds up to 50 million per month.

How it works:

The python script breaks the source material into overlapping word pairs (if context_length is 1) or triplets (if context_length is 2). Then, the server's able to string the chunks together to produce plausible, if nonsensical text:

A glass is not impossible to make the file and so deepen the original cut.
Now heat a small spot on the glass, and a candle flame to a clear singing note.
context_length = 2. The source material is a book on glassblowing.

Capitalization is normalized to lower case, periods are treated as a word (END), and commas kept as part of the word. To prevent the generator from running into dead ends, the source material is padded with periods.

When multiple words are possible, one is chosen pseudo randomly based on the requested path. This makes the server appear to have trillions of static files.

Each line in the chainX.txt files stores the word pairs beginning with a particular word. For example, the pairs "and the", "and a", "and then", "and therefore" are packed together like this:

...
and the a then therefore
...

Once the generators picks a word, it looks up that word's line and repeat the process.

When context_lenght is set to two, instead of storing words, the files store hyphenated word pairs, in reversed order:

...
difficulty-the experienced-difficulty
...

... is "the difficulty experienced". The generation process is identical except that the output is truncated at the hyphen.

The python script places more common groupings earlier in each line, which are picked more often by the pseudo-random number generator. This ensures that the generated text is statistically identical to the source material.