Settings

Theme

Block AI bots crawlers on Nginx

github.com

1 points by kurren a year ago · 12 comments

Reader

kurrenOP a year ago

I've been tinkering with a combined solution to block AI bots from crawling my website, it's a robots.txt file (I know, they don't care about it) and a Nginx directive (it works for Ngnix web servers, not if you are on Apache).

  • LinuxBender a year ago

    An additional thing one can add is to block any connection that is not HTTP/2.0. Many bots still only use HTTP/1.1 or lower. I just drop with 444 but others may wish to give a friendly terse error message using 403.

        if ($server_protocol != HTTP/2.0) { return 444; }
    
    Here's another one that will block most non-browsers (except for headless chrome of course)

        if ($http_sec_fetch_mode !~ (cors|no-cors|navigate) ) { return 444; }
    
    These should be tested extensively on a non revenue impacting site.

    Another bot-blocking method is to drop any TCP SYN packets with an MSS outside of a sensible range. Here is an example using netfilter on IPv4 in the "raw" table (to keep them out of the CPU impacting state table):

    -A PREROUTING -i eth0 -p tcp -m tcp -d {your_wan_ip} --syn -m tcpmss ! --mss 1220:1460 -j DROP

    • kurrenOP a year ago

      mmm, blocking everything not http/2.0 is also blocking legit browser, while blocking http/1.0 does not block bots (at least not ChatGPT); blocking non-browsers with $http_sec_fetch_mode works as expected.

      • LinuxBender a year ago

        Do you have a proxy in front of your site that is changing the protocol version? I have been using that on a dozen sites for years without issue. What browser are you using? Do your access logs show a HTTP/2.0 request? If you have something like Caddy or HAProxy in front of NGinx that is changing the proto version then you can create a similar rule at that outer layer. Or perhaps NGinx in front of NGinx doing a proxy pass?

        • kurrenOP a year ago

          Using Chrome on a mac, access logs say http 1.1 is accessing the domain. Nothing in front of Ngnix, but I'm wondering if I have the http 2 module on Nginx...

          • LinuxBender a year ago

            It's probably compiled in but your config would look something like:

                server {
                 listen 443 ssl backlog=1536 so_keepalive=58s:58s:5 deferred reuseport;
                 http2 on;
                # [snip...]
            
            This is on nginx/1.26.2. Older versions looked a little different.
    • kurrenOP a year ago

      Thanks. I'll give it a try with the != HTTP/2.0 method.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection