Quixotic

6 min read Original article ↗

Table of Contents

  1. Overview
  2. Download
  3. Quixotic Installation and Configuration
  4. Linkmaze Installation and Configuration
  5. License

Overview

Quixotic is a program that will feed fake content to bots and robots.txt-ignoring LLM scrapers. It has no server-side dependencies and is ideal for static website operators. It works by way of a simple Markov Chain text generator. It will slightly modify your content with nonsense, replacing around 20% of the words, by default. You can train the Markov generator on any text content, but training it on your website content is probably the easiest thing to do. Quixotic will transpose some of the images on your site (around 40%, by default,) leaving the alt and caption content as-is (i.e., incorrectly describing the image being referenced.)

There is also a companion web server – linkmaze – which dynamically generates nonsense pages on the fly. These pages are 100% nonsense and will include several links to even more nonsense (which will in turn include links to more nonsense, in a never-ending cascade of BS.)

If you want to see what the generated content looks like, set your user agent to ‘QuixoticTest’ and reload this page, or take a look at the screenshot below.

Caption: Quixotic Altering This Page

Download Quixotic

Quixotic 0.12: quixotic-0.12.tar.gz

SHA-256: e136db1addc1f7967e69b07459cf4136ac7dfdd621b6bfb5e1cd334182a37aca

SHA-512: ec7204e05b9c40314858cdde68afd79ec1bc7ca25f9289a342dad6ae486053742de5e644bf11e65acda621ab4202b423a75c510af2a3e9647151bf2ab8a69f19

GPG Signature: quixotic-0.12.tar.gz.sig (Key)

You can also just clone the Git repo.

Quixotic Installation and Configuration

Building Quixotic

After downloading Quixotic, build it:

Note this can be done on whichever machines you build your web content on. It doesn’t have to go on your webserver.

Install Quixotic

$ cargo install --path . --bin quixotic

Running Quixotic

Once installed, generating your altered content is simple and can be added to your site deployment workflow.

$ quixotic --input <your web content> --output <directory for fake content> [--percent 0.20] [--scramble-images 0.40] [--embed-linkmaze] [--linkmaze-path /linkmaze/path ]

The first two options: input, and output are mandatory, and control the content to modify and where to write the modified content, respectively. Input can be a single file or a directory. If it is a directory, it will be walked and any .txt or .html files will be read in. By default, the input content will also be used to train the Markov generator.

The options --train, --scramble-images, --percent, -embed-linkmaze, and --linkmaze-path are optional:

  • --train allows you to specify alternate content to use to train the Markov generator. Like --input, this can be a text/html file, or a directory. If it is a directory, any .txt or .html files will be read and used for training.
  • --scramble-images controls the percent of images to substitute. Quixotic will select images at random to be substituted with another random image from your input directory. Pass --scramble-images 0 to disable this completely.
  • --percent controls the percent of content to modify. It is a fraction and defaults to 20%. So, to modify 50% of your website’s content with the Markov generator, you’d pass in --percent 0.50.
  • --embed-linkmaze controls whether or not html files will be modified to include links to the link maze.
  • --linkmaze-path: a URI prefix to reach your link maze. So, if you host the linkmaze server at https://example.com/linkmaze, you’d specify --linkmaze-path /linkmaze.

Directing bots to your modified content

For static Quixotic-generated content, I mostly use User-Agent maps to serve different content.

Apache 2.x mod_redirect example

This Apache recipe relies on your fake content being in a subdirectory of your real content:

<VirtualHost example.com>
    <If "%{HTTP_USER_AGENT} =~ /curl|Go-http-client/">
      RewriteEngine on
      RewriteCond         "/path/to/real/content/fakecontent/%{REQUEST_URI}"  -f
      RewriteRule "^(.*)$" "/fakecontent/%{REQUEST_URI}"  [L]
    </If>
    DocumentRoot /path/to/real/content
</VirtualHost>
Nginx example

Nginx is easier to deal with, and the content can be in two completely different directories:

map $http_user_agent $root {
    "~*curl"                   "/path/to/fake/content";
    "~*Go-http-client"         "/path/to/fake/content";
    default                    "/path/to/real/content";
}

server {
  server_name example.com;
  root $root;
}
Bot User-Agent strings

You probably already know the bots that hit your server from your logs, but the ai.robots.txt project has a list of AI scrapers. Some of these, in my experience, respect robots.txt and some don’t, but I’ve seen all of the User-Agent strings in my logs over the past few years.

Of course, but from a practical standpoint, a lot of bots don’t. Ultimately, you’ll have to decide what signals you want to use to redirect to the alternate content. Some other things you might want to look for:

  • seemingly legitimate User Agents that load html content, but not style sheets referenced in your qpages.
  • visitors from cloud compute IP ranges
  • requests for content that is completely foreign to what you serve (see below for an example of how to intercept requests for PHP pages if you don’t use PHP.)

Linkmaze Installation and Configuration

Linkmaze operates as a webserver that dynamically generates nonsense pages, and will respond to any request made to it in the form http://server/randomfilename.html with more nonsense. It won’t respond to URIs with directory prefixes, for example http://server/directory/randomfilename.html will 404. It is meant to have requests proxied over to it by your main webserver, although in principle, it could directly serve clients.

The linkmaze binary and (if you’re using Systemd) the systemd unit file need to be copied to your webserver. The systemd unit file also needs to be edited with the path to the content you want to use to train the Markov generator (probably your real website content directory.)

[Unit]
Description=Linkmaze

[Service]
ExecStart=/usr/local/bin/linkmaze --train /PATH/TO/CONTENT # << Change this path to point to your training content
DynamicUser=yes
PrivateTmp=yes

[Install]
WantedBy=multi-user.target

There are a few other options you probably want to set as well in the unit file:

  • --linkpath to set the uri prefix of generated link maze links to match your webserver configuration. For example, if your webserver serves the linkmaze at https://example.com/linkmaze/*, you’d configure --linkpath /linkmaze
  • --listen-port to set the listen port for the server. It defaults to 3005.
$ sudo cp target/release/linkmaze /usr/local/bin
$ sudo cp linkmaze.service /etc/systemd/system
$ sudo systemctl enable --now linkmaze

Webserver configuration – Nginx

I use proxy_pass to serve the link maze content:

server {
    server_name example.com;

    location ~* ^/linkmaze/ {
        proxy_set_header X-Real-IP $remote_addr;
        rewrite ^/linkmaze/(.*) /$1 break;
        proxy_pass http://127.0.0.1:3005;
    }
}

In addition to generating link maze entry links with Quixotic with the --embed-linkmaze option, I also redirect any requests to .php files – since I don’t serve any PHP content on my site – directly to the maze:

server {
    server_name example.com;
    location ~ \.php|\.php?.*$ {
        proxy_set_header X-Real-IP $remote_addr;
        rewrite ^/(.*)/ foo.php break;
        proxy_pass http://127.0.0.1:3005;
    }
}

License

Quixotic and Linkmaze are licensed under the MIT License.