Using lots of little tools to aggressively reject the bots

lambdacreate.com

211 points by archargelod 3 days ago


miladyincontrol - 2 days ago

A lot of less scrupulous crawlers just seem to imitate the big ones. I feel a lot of people make assumptions because the user agent has to be true, right?

My fave method is still just to have bait info in robots.txt that gzip bombs and autoblocks all further requests from them. Was real easy to configure in Caddy and tends to catch the worst offenders.

Not excusing the bot behaviours but if a few bots blindly take down your site, then an intentionally malicious offender would have a field day.

ThomW - 2 days ago

It ticks me off that bots no longer respect robots.txt files at all. The authors of these things are complete assholes. If you’re one of them, gfy.

immibis - 2 days ago

I consider the disk space issue a bug in Gitea. When someone downloads a zip, it should be able to stream the zip to the client, but instead it builds the zip in temporary space, serves it to the client, and doesn't delete it.

I solved it by marking that directory read-only. Zip downloads, obviously, won't work. If someone really wants one, they can check out the repository and make it theirself.

If I really cared, of course I'd fix the bug or make sure there's a way to disable the feature properly or only enable it for logged-in users.

Also I server-side redirect certain user-agents to https://git.immibis.com/gptblock.html . This isn't because they waste resources any more but just because I don't like them, what they're doing is worthless anyway, and because I can. If they really want the data of the Git repository they can clone the Git repository instead of scraping it in a stupid way. That was always allowed.

8 requests per second isn't that much unless each request triggers intensive processing and indeed it wasn't a load on my little VPS other than the disk space bug. I blocked them because they're stupid, not because they're a DoS.

atomman747 - 2 days ago

I get that the author “doesn’t care anymore”, but I saw Google, ripe.net, and semrush in the banned IPs.

Of these, I certainly wouldn’t ban Google, and probably not the others, if I wanted others to see it and talk about it.

Even if your content were being scraped for some rando’s AI bot, why have a public site if you don’t expect your site to be used?

Turning the lights off on the motel sign when you want people to find it is not a good way to invite people in.

vachina - 2 days ago

I’ve turned off logging on my servers precisely because it’s growing too quickly due to these bots. They’re that relentless, and would fill every form, even access APIs otherwise accessible only by clicking around the site. Anthropic, openAI and Facebook are still scraping to this day.

Proofread0592 - 2 days ago

It is nice that the AI crawler bots honestly fill out the `User-Agent` header, I'm shocked that they were the source of that much traffic though. 99% of all websites do not change often enough to warrant this much traffic, let alone a dev blog.

reconnecting - 2 days ago

Creator of tirreno [1] here.

While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.

We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.

Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.

Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.

[1] https://github.com/tirrenotechnologies/tirreno

[2] https://play.tirreno.com/login (admin/tirreno)

nickpsecurity - 2 days ago

My sites get 10-20k requests a day. Mostly AI scrapers. One thing I noticed is many look for specific, PHP pages. If you dont use PHP, you might be able to autoblock any IP requesting PHP pages. If you have PHP, block those requesting pages you dont have.

Some of us are happy to train AI's but want to block overload. For instance, I'm glad they're scraping pages about the Gospel and Biblical theology. It might help to put anything large that you dont want scraped into specifi directories. Then, upon detecting a bot, block the IP from accessing those.

In my case, I also have a baseline strategy to deal with a large number of requests. That's text only, HTML/CSS presentation, other stuff externally hosted, and BunnyCDN with Perma-Cache ($10/mo + 1 penny / GB). The BunnyCDN requests go to $5/mo. VM's on Digital Ocean. I didnt even notice AI scrapers at first since (a) they didn't affect performance and (b) a month of them changed my balance from $30 to $29.99.

(Note to DO and Bunny team members that may be here: Thanks for your excellent services.)

fluidcruft - 2 days ago

I've been wondering if there's anything like "page knocking" like you open a series of pages in a certain order and that grants access?

For example maybe everything 404s unless you start with a specific unlisted url

xelxebar - 2 days ago

My little server just hums along, and I haven't checked fail2ban status recently.

    sudo fail2ban-client status | sed -n '/Jail list:/{s/^.*://; s/,//g; p}' | xargs -n1 sudo fail2ban-client status | awk '/jail:/{i=$5}; /Total failed:/{jail[i]=$5}; END{for(i in jail) printf("%s: %s\n", i, jail[i])}' | column -t
    sshd-ddos:        0
    postfix:          583
    dovecot:          9690
    postfix-sasl:     4227
    nginx-botsearch:  1421
    nginx-http-auth:  0
    postfix-botnet:   5425
    sshd:             202157
Yikes! 220,000 banned IPs...
AStonesThrow - 2 days ago

Chris Siebenmann has shared many thoughts about crawlers, aka spiders.

https://utcc.utoronto.ca/~cks/space/?search=spider

https://utcc.utoronto.ca/~cks/space/?search=crawlers

It's interesting how in 2005 he was advocating for playing nicely, because yes, search engines were hypothetically driving traffic to sites they crawled. And there don't seem to be a lot of complaints about performance hits. Though typically, his comments and observations are restricted to his blog's corner of his department.

BLKNSLVR - 2 days ago

Should we be moving to a push / submit methodology for centralised indexing such as AI and search?

No scraping, if I want you to read it I'll send it to you.

hananova - 2 days ago

Back in the 90's, me as a dumb kid got a phone call from my ISP telling me that my system had been enlisted in a botnet and that they were turning off my connection until I fixed it.

Maybe it's time to once again block entire ASN's of ISP's that tolerate being used as residential proxies, and this shit might end.

loloquwowndueo - 2 days ago

Nice - I like that most of the ai scraper bot blocking was done using Nginx configuration. Still, once fail2ban was added to the mix (meaning: additional service and configuration), I wonder if considering something like Anubis (https://anubis.techaro.lol/) would have been more automatic. I’ve seen Anubis verification pages pop up more frequently around the web!

RamRodification - 2 days ago

I could be wrong, but the comment for the rate limiting (1r/s) doesn't seem to line up with the actual rate limit being set (5r/s).

  # Define a rate limit of 1 request per second every 1m
  limit_req_zone $binary_remote_addr zone=krei:10m rate=5r/s;
tasuki - 2 days ago

I wonder, why do people maintain personal git forges?

Is it so others can submit issues and pull requests? Do others actually bother to create accounts at your personal git forge so they can contribute?

rovr138 - 2 days ago

Great article and sleuthing to find the information.

I know you're processing them dynamically as they come in and break the rules. But if you wanted to supplement the list, might be worth sourcing the ones from https://github.com/ai-robots-txt/ai.robots.txt at some frequency.

- 2 days ago
[deleted]
aswegs8 - 2 days ago

Ah, good old botkilling. Gotta love those.

mleonhard - 2 days ago

The website blocks the privacy proxy (VPN) service I use.

nullc - 2 days ago

I can't use mouser or digikey anymore thanks to anti-crawler crusading.

kazinator - 2 days ago

Over the past while, I have settled on a four component approach. I place less emphasis on banning; it has turned out to be a nuisance for legit users.

1. User-Agent pattern for various known bots. These gets sent to a honeypot (currently implementing zip bombs).

2. Arithmetic captcha page (called "captxa") protecting most of my /cgit tree, which is the primary target for scrapers. Solve a formula, get a persistent cookie that grants you access.

3. Static IP-level bans. The bulk of the unwanted traffic has been from Alibaba Cloud LLC addresses, so I got a list of their blocks and made rules against them all.

4. Log-scan-and-ban system: now mainly just targeting excessive activity over a short period.

About 2: I define the same cookie value given to everyone right inside the Apache httpd.conf. This is passed via en environment variable to the captcha page, a CGI script. When the captcha is solved, and the client has that cookie, it will be the Apache rules themselves which check it. The RewriteRule which redirects to the captcha page is guarded by a RewriteCond which checks the value of %{HTTP_COOKIE} for the presence of the required cookie.

I use the Apache <Macro ...> feature to define the anti-bot rules. Then the rules can be included in any VirtualHost block easily by invoking the macro.

The skeleton of the system looks like this:

  Define captxa_secret SecretGoesHere

  <Macro nobots>
    # Push captxa secret into environment, for the CGI script.
    SetEnv CAPTXA_SECRET ${captxa_secret}

    # These agents get sent to /honeypot
    RewriteCond %{HTTP_USER_AGENT} .*(ezoom|spider|crawler|scan|yandex|coccoc|github|python|amazonbot[Pp]etalbot|ahrefsbot|semrush|anthropic|Facebot|meta|openai|GPT|OPR/|Edg/).* [NC]
    RewriteCond %{REQUEST_URI} !honeypot
    RewriteRule ^.*$ /honeypot [R,L]

    # Clients which have the cookie with the captxa_secret get a pass
    RewriteCond %{HTTP_COOKIE} "^(.*; ?)?captxa=${captxa_secret}(;.*)?$"
    RewriteRule ^.*$ - [L]

    # Rules for no-cookie clients:

    # Root level /cgit site: ok for all
    RewriteRule ^/cgit/$ - [L]

    # First level of /cgit: the project pages: OK
    RewriteRule ^/cgit/([^/]+)/?$ - [L]

    # Certain project sub-pages like about, spanshot, README: Pass for hassle-free landing.
    RewriteRule ^/cgit/([^/]+)/(about|snapshot|refs|tree/RELNOTES|tree/README|tree/README\.md)/?$ - [L]

    # Some distros fetch TXR tarball snapshots from CGIT; make it easy for them.
    RewriteRule ^/cgit/txr/snapshot/txr-([0-9]+)\.tar\. - [L]
 
    # Everyone else off to captxa script: which will install the cookie
    # for those clients who solve it. $0 is the entire URL;
    # it gets passed to the script as a parameter, so the
    # script can redirect to that URL upon a successful solve.
    RewriteRule ^/cgit/?(.*)$ /captxa/?quiz=$0? [R,L,NE,QSA]
  </Macro>
Inside a VirtualHost I just do

  RewriteEngine on
  Use nobots   # include above macro
dakiol - 2 days ago

OpenAI is the biggest offender according to my server’s logs. Nevertheless we keep praising them over here. We are so stupid.

curtisszmania - 2 days ago

[dead]

datavirtue - 2 days ago

[flagged]

sneak - 2 days ago

You don’t have to fend off anything, you just have to fix your server to support this modest amount of traffic.

Everyone else is visiting your site for entirely self-serving purposes, too.

I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

If you publish data to the public, expect the public to access it. If you don’t want the public (this includes AI startups) to access it, don’t publish it.

Your website is not being misused when the data is being downloaded to train AI. That’s literally what public data is for.