One or more text files are processed offline to construct a Markov chain, which
is compiled into a binary image.
The image is then loaded by the main daemon, spamgend(8), which will use it to
generate pseudo-random HTML pages on demand.
When a page is requested to the server, a hash of the URI path is used as a seed for a xorshift DRNG. The obtained pseudo-random sequence is used to traverse the Markov chain and generate the whole page content. Each generated page embeds a number of pseudo-random links to other pseudo-random pages, thus forming an Eternal Garbage Braid (EGB).
Bots monitoring
Real-world observations indicate that it is a common tactic for scrapers to utilize numerous hosts — often in ridiculous numbers — to evade detection while conducting unauthorized web scraping activities.
spamgend(8) identifies individual actors by embedding a reasonably unique
identifier within the generated page links.
If the requested path lacks an identifier, a new one is created by hashing the
IP address of the requesting peer.
The same identifier being used by multiple hosts implies that all of them are
taking part in the same scraping operation.
A similar technique is used to track the depth of the scraping operation, as each page includes a depth value in its outbound links. This value is derived by incrementing the depth of the current page by one. If a page’s path does not include a depth value, it is implicitly assigned a depth of 0.
spamgend(8) does not keep logs, but aggregated data is made
available in the form of a histogram by the info endpoint (see below).
Recommended setup
The recommended setup consists of forwarding requests from a reverse proxy to the spam endpoint. The advantages are:
-
Seamless integration with an existing website
-
Pages can be served via HTTPS (TLS is not implemented by
spamgen(8)) -
Bots monitoring data becomes available in the reverse proxy access log, making it possible to take broader countermeasure for identified scrapers (e.g. ban via firewall, or redirect their requests for regular pages to even more garbage!)
|
Note |
Depending on the configuration, the reverse proxy might prefix each
request with a partial path. See the spam_ep.uri_prefix configuration below.
|
The purpose of spamgend(8) is to mess with greedy AI bots that violate the
netiquette.
It is therefore highly recommended to list the URI path leading
to the spam endpoint in your robots.txt, so that legitimate scrapers
are not poisoned.
User-agent: * Disallow: /spam/