Generate spam for web scrapers

3 min read Original article ↗

One or more text files are processed offline to construct a Markov chain, which is compiled into a binary image. The image is then loaded by the main daemon, spamgend(8), which will use it to generate pseudo-random HTML pages on demand.

When a page is requested to the server, a hash of the URI path is used as a seed for a xorshift DRNG. The obtained pseudo-random sequence is used to traverse the Markov chain and generate the whole page content. Each generated page embeds a number of pseudo-random links to other pseudo-random pages, thus forming an Eternal Garbage Braid (EGB).

Bots monitoring

Real-world observations indicate that it is a common tactic for scrapers to utilize numerous hosts — often in ridiculous numbers — to evade detection while conducting unauthorized web scraping activities.

spamgend(8) identifies individual actors by embedding a reasonably unique identifier within the generated page links. If the requested path lacks an identifier, a new one is created by hashing the IP address of the requesting peer. The same identifier being used by multiple hosts implies that all of them are taking part in the same scraping operation.

A similar technique is used to track the depth of the scraping operation, as each page includes a depth value in its outbound links. This value is derived by incrementing the depth of the current page by one. If a page’s path does not include a depth value, it is implicitly assigned a depth of 0.

spamgend(8) does not keep logs, but aggregated data is made available in the form of a histogram by the info endpoint (see below).

The recommended setup consists of forwarding requests from a reverse proxy to the spam endpoint. The advantages are:

  • Seamless integration with an existing website

  • Pages can be served via HTTPS (TLS is not implemented by spamgen(8))

  • Bots monitoring data becomes available in the reverse proxy access log, making it possible to take broader countermeasure for identified scrapers (e.g. ban via firewall, or redirect their requests for regular pages to even more garbage!)

Note

Depending on the configuration, the reverse proxy might prefix each request with a partial path. See the spam_ep.uri_prefix configuration below.

The purpose of spamgend(8) is to mess with greedy AI bots that violate the netiquette. It is therefore highly recommended to list the URI path leading to the spam endpoint in your robots.txt, so that legitimate scrapers are not poisoned.

User-agent: *
Disallow: /spam/