Here's the code for my infinite nonsense crawler trap:
- process.py: Text preprocessor
- babble.c: Garbage server
What follows is an explanation of how to set it up.
Training the Markov chain:
First, you'll want to find three long-ish sources of text, between 1000 and 50,000 words. I used old ebooks from Project Guttenberg, but long blog posts or Wikipedia articles will also work.
Save the text in files named , and .
Remove any page numbers, headings, copyright notices and similar.
If your texts are somewhat short (less then 3000 words), change the variable in to 1.
This results in less repetitive text, but for long bodies of text, setting it to 2 produces more coherent results.
With the text prepared, run the python script:
$python3 process.py
You should now have 3 files named full of fragmented text.
The server:
Configure your web server to forward requests under an unused path to .
Here's how to do that in NGINX:
# Proxy traffic to the Markov babbler # Place inside a server { } block location /babble/ { proxy_passhttp://localhost:1414/ ; }
Next, open the server code and change to the chosen port number and to the path that your server is forwarding:
... #define PORT 1414 ... // Must begin and end with "/"s const char*URL_PREFIX ="/babble/"; ...
With the configuration done, compile the garbage server:
$gcc -O3 -obabble babble.c -lm
... and run it in the same directory as the files:
$./babble [*] Loading files Loading chain1.txt... Loading chain2.txt... Loading chain3.txt... [*] Creating socket [*] Serving garbage!
It may take a few seconds to parse the files if they are particularly big. Once finished, you should be able to visit a URL under the configured path and get garbage:
https://your.domain.example.com/babble/test
If you see a (, ...) error, make sure you set the right path and port number in your web server's configuration.
If the page loads but the links are broken, check the and recompile the code.
If everything works, you'll want the program to run in the background. With systemd on Linux, create a file called:
/etc/systemd/system/babbler.service
... with the following content: (don't forget to change the file paths)
[Unit] Description =Markov chain based crawler trap After =network.target StartLimitIntervalSec =0 [Service] Type =simple Restart =always RestartSec =1 User =nobody WorkingDirectory =/path/to/markov/chains/ ExecStart =/path/to/binary/babble [Install] WantedBy =multi-user.target
(Make sure the user can access the markov chain files)
... and run these commands as root:
$systemctl daemon-reload $systemctl startbabble $systemctl enablebabble
If you run some other init system, you know enough to figure this out yourself.
Catching wild bots:
You don't really need any bot detection: for my site, just linking to the garbage from your main website worked fine.
Because each page links to five more garbage pages, the crawler's queue will quickly fill up with an exponential amount of garbage until it has no time left to crawl your real site.
If you don't want your site to disappear from search results, exclude the garbage URLs in :
User-agent :Googlebot Bingbot Kagibot Disallow :/babble/ *
If you want prevent bots from ever seeing any of your content, ai.robots.txt has nice lists of User-Agents. Just change the rules to rewrite or redirect instead of blocking.
Here's what that looks like NGINX:
# Send scrapers garbage # Regex from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/nginx-block-ai-bots.conf if ($http_user_agent ~*"(AddSearchBot|AI2Bot|Ai2Bot\-Dolma|aiHitBot|Amazonbot|Andibot|anthropic\-ai|Applebot|Applebot\-Extended|Awario|bedrockbot|bigsur\.ai|Brightbot\ 1\.0|Bytespider|CCBot|ChatGPT\ Agent|ChatGPT\-User|Claude\-SearchBot|Claude\-User|Claude\-Web|ClaudeBot|CloudVertexBot|cohere\-ai|cohere\-training\-data\-crawler|Cotoyogi|Crawlspace|Datenbank\ Crawler|Devin|Diffbot|DuckAssistBot|Echobot\ Bot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Gemini\-Deep\-Research|Google\-CloudVertexBot|Google\-Extended|GoogleAgent\-Mariner|GoogleOther|GoogleOther\-Image|GoogleOther\-Video|GPTBot|iaskspider/2\.0|ICC\-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo\ Bot|LinerBot|meta\-externalagent|Meta\-ExternalAgent|meta\-externalfetcher|Meta\-ExternalFetcher|MistralAI\-User|MistralAI\-User/1\.0|MyCentralAIScraperBot|netEstate\ Imprint\ Crawler|NovaAct|OAI\-SearchBot|omgili|omgilibot|Operator|PanguBot|Panscient|panscient\.com|P erplexity\-User|PerplexityBot|PetalBot|PhindBot|Poseidon\ Research\ Crawler|QualifiedBot|QuillBot|quillbot\.com|SBIntuitionsBot|Scrapy|SemrushBot\-OCOB|SemrushBot\-SWA|Sidetrade\ indexer\ bot|Thinkbot|TikTokSpider|Timpibot|VelenPublicWebCrawler|WARDBot|Webzio\-Extended|wpbot|YaK|YandexAdditional|YandexAdditionalBot|YouBot)" ) {rewrite ^(.*)$ /babble/$1 ; }# Also send garbage to Firefox's AI summaries if ($http_x_firefox_ai ~*"1" ) {rewrite ^(.*)$ /babble/$1 ; }
Some bots do spoof user agents, but they share URLs with ones that don't: Once the generated links get in the crawling queue, they will be served garbage no matter how well they hide.
User agent filtering also ensure that AI chatbots and summarizers only ever get garbage... after all, the easiest thing to turn into garbage is garbage: We're only helping!
The cost:
On my machine, generating a page takes anywhere from 50 microseconds. Unless you have a very slow server, CPU load should be minimal: NGINX spends more time handling encryption then it takes to generate garbage.
Just be careful if you're behind any cloud service that bills by request: My server is hovering around ~20 requests/second. It's not a problem for my VPS, but it adds up to 50 million per month.
How it works:
The python script breaks the source material into overlapping word pairs (if context_length is 1) or triplets (if context_length is 2). Then, the server's able to string the chunks together to produce plausible, if nonsensical text:
A glass is not impossible to make the file and so deepen the original cut. Now heat a small spot on the glass, and a candle flame to a clear singing note.context_length = 2. The source material is a book on glassblowing.
Capitalization is normalized to lower case, periods are treated as a word (), and commas kept as part of the word.
To prevent the generator from running into dead ends, the source material is padded with periods.
When multiple words are possible, one is chosen pseudo randomly based on the requested path. This makes the server appear to have trillions of static files.
Each line in the files stores the word pairs beginning with a particular word.
For example, the pairs "", "", "", "" are packed together like this:
... and the a then therefore ...
Once the generators picks a word, it looks up that word's line and repeat the process.
When is set to two, instead of storing words, the files store hyphenated word pairs, in reversed order:
... difficulty-the experienced-difficulty ...
... is "".
The generation process is identical except that the output is truncated at the hyphen.
The python script places more common groupings earlier in each line, which are picked more often by the pseudo-random number generator. This ensures that the generated text is statistically identical to the source material.