Block AI bots crawlers on Nginx
github.comI've been tinkering with a combined solution to block AI bots from crawling my website, it's a robots.txt file (I know, they don't care about it) and a Nginx directive (it works for Ngnix web servers, not if you are on Apache).
An additional thing one can add is to block any connection that is not HTTP/2.0. Many bots still only use HTTP/1.1 or lower. I just drop with 444 but others may wish to give a friendly terse error message using 403.
Here's another one that will block most non-browsers (except for headless chrome of course)if ($server_protocol != HTTP/2.0) { return 444; }
These should be tested extensively on a non revenue impacting site.if ($http_sec_fetch_mode !~ (cors|no-cors|navigate) ) { return 444; }Another bot-blocking method is to drop any TCP SYN packets with an MSS outside of a sensible range. Here is an example using netfilter on IPv4 in the "raw" table (to keep them out of the CPU impacting state table):
-A PREROUTING -i eth0 -p tcp -m tcp -d {your_wan_ip} --syn -m tcpmss ! --mss 1220:1460 -j DROP
mmm, blocking everything not http/2.0 is also blocking legit browser, while blocking http/1.0 does not block bots (at least not ChatGPT); blocking non-browsers with $http_sec_fetch_mode works as expected.
Do you have a proxy in front of your site that is changing the protocol version? I have been using that on a dozen sites for years without issue. What browser are you using? Do your access logs show a HTTP/2.0 request? If you have something like Caddy or HAProxy in front of NGinx that is changing the proto version then you can create a similar rule at that outer layer. Or perhaps NGinx in front of NGinx doing a proxy pass?
Using Chrome on a mac, access logs say http 1.1 is accessing the domain. Nothing in front of Ngnix, but I'm wondering if I have the http 2 module on Nginx...
It's probably compiled in but your config would look something like:
This is on nginx/1.26.2. Older versions looked a little different.server { listen 443 ssl backlog=1536 so_keepalive=58s:58s:5 deferred reuseport; http2 on; # [snip...]
Thanks. I'll give it a try with the != HTTP/2.0 method.
My bad, yes I didn't declare it:
listen[::]443 ssl http2;
thank you.
That's the older syntax. What version of NGinx are you on? Only curious because there have been a few bug fixes.
nginx -v1.23.1
Here [1] are the summarized release notes that describe the bugs fixed in the latest versions. If you update the http2 flag/syntax changes to what I listed.
[1] - https://nginx.org/news.htmlhttp2 on; # instead of being in the listen line.Updated, thanks again.