Our local GitLab server has been under attack by Anthropic Google OVH and more
twitter.comI saw them try to read some static files I posted here but they were instantly blocked by a combination of nftables and nginx.
bzcat *mirror*access*bz2 | grep -c " 200 "
283
bzcat *mirror*access*bz2 | grep -c " 444 "
3607
That's what made it past nftables TCP MSS and TCP window rules. The 200's were members of HN. The 444's were bots.Does Gitlab front-end with Nginx or Haproxy?
Both - first haproxy, then nginx.
The first thing I would look for is if real users both browsers and API clients are capable of doing HTTP/2.0 and if they default to that. If so that an easy win. Block anything lower than HTTP/2.0 and that will nuke most bots outside of headless Chrome. If any real clients are using HTTP/1.1 then make a separate listener/URL for those and limit access by known good CIDR blocks with a firewall assuming this is a corporate GitLab server. Or block this on HAProxy and give trusted networks a way to reach NGinx directly such as a VPN or firewall rule.
If there are archived access logs that would be a good place to try to figure this out.
In NGinx the block looks like this [1] or change it to a redirect to a static landing page.
If this is not an option then restrict repo access to approved SSH clients.
If this is not an option then put authentication on the repos hit hardest and a page that explains what the user/password is along with an acceptable use policy for using the authentication. If AI are trained to learn the authentication they will be violating the AUP written by your lawyers. Make the AI vendors give you enough money to upgrade your infrastructure to handle their load.
TL;DR find the differences between bot behavior and real people then make rules that will break the bots. There's always a difference. When all else fails block the CIDR blocks of all the known AI networks and play whack-a-mole for anything outside of their networks. Not perfect, nothing is but it will lower the load.
If going the blocking route, add all their CIDR blocks and IP's into a text file that gets read by a startup script to
That will prevent HAproxy from being able to complete the handshake and is a much lower CPU and memory load on the server than using firewall rules.ip route add blackhole "${CIDR} 2>/dev/null[1] - https://mirror.newsdump.org/nginx/inc.d/40_https2_stuff.conf...