Anubis: Proof-of-work proxy to prevent AI crawlers

anubis.techaro.lol

100 points by techknowlogick a year ago · 67 comments

Reader

jchw a year ago

I'm really curious to see how this evolves as time goes on. Hashcash was originally conceived to stop e-mail SPAM, and a lot has changed since then, namely, compute has become absolutely dirt cheap. Despite that, PoW-based anti-bot remains somewhat enticing because it doesn't necessarily harm accessibility the way that solutions like Cloudflare or reCAPTCHA can: It should be possible to pass, even on a VPN or Tor, even on less used web browsers like Ladybird or Servo, and even if you're not on a super powerful device (provided you are willing to wait for the PoW challenge to pass, but as long as you don't have all of these conditions at once you should get an "easy" challenge and it should be quick.)

The challenge is definitely figuring out if this solution actually works at scale or not. I've played around with an implementation of Hashcash myself, using WebCrypto, but I worry because even using WebCrypto it is quite a lot slower than cracking hashes in native code. But seeing Anubis seemingly have some success makes me hopeful. If it gains broad adoption, it might just be enough of a pain in the ass for scrapers, while still being possible for automation to pass provided they can pay the compute toll (e.g. hopefully anything that's not terribly abusive.)

On a lighter note, I've found the reception of Anubis, and in particular the anime-style mascot, to be predictably amusing.

https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

(Note: I'd personally suggest not going and replying here. Don't want to encourage brigading of any sort, just found this mildly amusing.)

rickydroll a year ago

FWIW, I was part of the gang talking with Adam about using Hashcash as a postage stamp as an anti-spam measure. I implemented an extension to Postfix using Hashcash as an anti-spam method (camram/two penny blue). The compute inflationary pressure you mentioned was only one of the challenges in deploying such a system. FWIW, none of the challenges were barriers; they were the kind of challenges you get when you take something beautiful, theoretical, and pure and try to make it real.
Some may not be aware that Hashcash value is as a decentralized rate limiter that can be added to almost any protocol. Experience with Hashcash taught us that it's essential to have a dynamic pricing scheme based on the reputation of the contact originator. In the email context, when a message sender connects to a receiving server, the receiving server should be able to tell the sender the size of the stamp needed based on the reputation created by previous messages.
From my perspective, the two main challenges in rate-limiting HTTP requests are embedding the required Hashcash stamp size and measuring the reputation of the request initiator. I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.
I'll spare you my design note rambling, but I think from 3-page requests, one can gather enough information to determine the size of the Hashcash stamp for future requests.
- xena a year ago
  
  > I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.
  That's what it's intended to be right now. I've been thinking out how to do a reputation database; I'm half considering using a DHT like BitTorrent's for a cross-Anubis coordination (haven't filed an issue about this because I'm still noodling it out into a spec). I'm also working on more advanced risk calculation, but this kind of exploded out of nowhere for me.
  - rickydroll a year ago
    
    I would focus on the bad player detector first, then move on to the dynamic pricing scheme. I wouldn't worry about cross-Anubis coordination until you've solved the database problem for the single node.
    IMO, of course.

avodonosov a year ago

Ideas:

- Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.

- An easy way for good crawlers (like internet archive) to authenticate themselves. E.g. TLS client side authentication or simply an HTTP request header containing signature for the request (the signature in the header may be based on, for example, on their domain name and the TLS cert for that domain)

ronsor a year ago

> Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.
The last time[0] we did this, everyone had a meltdown and blocked it.
[0] See Coinhive, which conspicuously lacks a Wikipedia page.
- xg15 a year ago
  
  Mining crypto can lead to misaligned incentives. Suddenly, people would have motivation to run this thing even on benign requests, etc.
- Silphendio a year ago
  
  There's a German Wikipedia page: https://de.m.wikipedia.org/wiki/Coinhive
- wslh a year ago
  
  It is mentioned at <https://en.wikipedia.org/wiki/Cryptojacking>
  - avodonosov a year ago
    
    And the mention links to https://en.m.wikipedia.org/wiki/Coinhive which redirects to https://en.m.wikipedia.org/wiki/Monero#Mining_malware
- avodonosov a year ago
  
  This time the site content will not be available if the challenge computation is blocked.
throwawayEm8UE a year ago

Maybe there should also be a protocol to support paying a small amount of crypto (on the order of 0.01-0.1 cents per request) to bypass the PoW. The bots can still scrape it by paying, but you now price in the externality.
- avodonosov a year ago
  
  Yes, I was also thinking of this.
  "proof of pay" or "proof of a transaction"
  An advantage of this would be not wasting electricity for the proof of work computation.
avodonosov a year ago

Alternatively to cryptocurrency, some other useful c9mputation. Like contribution to BOINC. Only need to find a way to prove the computation.
kebokyo a year ago

I’m not sure how you can generate cryptocurrency with just sha256 hashes lol. Plus, there’s already a configuration file that lets you determine what you’re looking out for in terms of scrapers. I need to look at it more myself especially because I plan on using this on my own website soon, but I hope there’s already a good way to whitelist archival sites.

Trung0246 a year ago

For no js solution, I think some sort of using optical illusion as captcha could works, especially https://en.wikipedia.org/wiki/Magic_Eye or something like https://www.youtube.com/watch?v=Bg3RAI8uyVw which could cleverly hide captcha answer within animated noise mess.

However these methods are not really accessibility-friendly tho.

nikisweeting a year ago

Doesn't seem to noticably slow down my test bot. Headful crawling already takes ~10sec/page so an extra 0.5sec is hardly that big a deal.

jeroenhd a year ago

How effective a PoW firewall is, depends on how conservatively the worst offenders configure their browser engine resource limits.
A small scraper may be able to afford spending the extra CPU cycles, but if you're like AI training bots, sometimes sending hundreds of browser instances at a time, the math becomes different.
From what I've read about the results, it seems like the approach is effective against the very worst scrapers and bots out there.
akoboldfrying a year ago

"Headful"?
- danielheath a year ago
  
  As opposed to “headless” - meaning it runs a whole browser.
  - akoboldfrying a year ago
    
    Thanks, TIL.

pvg a year ago

Discussion here https://news.ycombinator.com/item?id=43422929

yjftsjthsd-h a year ago

> to stop AI crawlers

It'll do that too, but it's really more of a general-purpose anti-bot, right? A generic PoW-wall.

areyourllySorry a year ago

yeah not very different from https://gitgud.io/fatchan/haproxy-protection/ or other pow walls you see on onion sites

barlog a year ago

When I visited Xe-sann's page, I was curious to see Jackal-chann challenges in action. This is Anubis.

akoboldfrying a year ago

Regarding the problem of how to let "good" bots through:

You could use PKI: Drop the PoW if the client provides a TLS client certificate chain that asserts that <publicKey> corresponds to a private key that is controlled by <fullNamesAndAddressesOfThesePeople> (or just by, say, <people WhoControlThisUrl>, for Let's Encrypt-style automatable cert signing). This would be a slight hassle for good bot operators to set up, but not a very big deal. The result is that bad bots couldn't spoof good bots to get in.

(Obviously this strategy generalises to handling human users too -- but in that case, the loss of privacy, as well as admin inconvenience, makes it much less palatable.)

areyourllySorry a year ago

well, good bots usually publish their ip ranges, and that's way simpler
- akoboldfrying a year ago
  
  Yes, that's much simpler, but doesn't it mean that every site owner needs to manage a hardcoded set of "places to look" for these published IPs?
  If I want to create a good bot tomorrow, where do I publish its IP addresses? IOW, how can I ensure that the world "notices"?

bno1 a year ago

What stops a scraper from detecting Anubis and just removing "Mozilla" from the user-agent string?

TonyTrapp a year ago

That would allow you to specifically lock out that bot based on its user-agent string. That's the main problem with AI scrapers, many of them normally use user-agents that cannot be easily blocked, so other means have to be found to keep them off your grounds.
xena a year ago

It breaks websites when you do that.

xg15 a year ago

It's a great idea, but I fear if this keeps going viral like it did in the last few days, more bot authors will be motivated to add special handling for it and e.g change the user agent to a non-Mozilla one.

Trung0246 a year ago

The performance on mobile is kinda suck tho, took like 30 seconds to wait for PoW on difficulty 4 on Firefox Android. By that time I have to resist the urge to switch to do something else.

iszomer a year ago

This reminded me of an article I printed (yes, with paper) at my college more than 20 years ago, titled Parasitic Computing. I don't remember where it was originally published but I do think I might have stumbled upon it via kuro5hin (maybe); a quick search resulted the publication from Nature (though paywalled).

- https://www.nature.com/articles/35091039

Alifatisk a year ago

This is like wehatecaptchas.com

ranger_danger a year ago

I would say it doesn't prevent anything, it just makes computers warm the planet more.

01HNNWZ0MV43FF a year ago

That crime is on the attackers hands.
Hydrocarbons should cost more. Vote for a pollution tax
jeroenhd a year ago

A quick PoW calculation per web session is probably not going to make a dent in pollution now that generative AI has become commonplace.
It'll suck for battery life if you're often browsing random website after random website, or if you're a bot farm, but in practice I don't think most welcome users will notice a thing.
- theragra a year ago
  
  Other comment mentions it takes 30 seconds in mobile Firefox to pass, obviously too much
akoboldfrying a year ago

> it just makes computers warm the planet more.
Since that would cost the bad guys money, they won't actually do it much.

kmeisthax a year ago

Is there a way to alter text to poison AI training sets? I know there's Glaze and Nightshade for images but I've heard of nothing to poison text models. To be clear, this wouldn't be a defensive measure to stop scraping; it'd be an offensive honeypot: you'd want to make pages that have the same text but mutated slightly differently each time, so that AI scrapers preferentially load up on your statistically different text and then yield a poisoned model. Ideally the scraper companies will realize what's going on and stop scraping.

xyzal a year ago

Generate lots (LOTS!) of bad, insecure or wrong code and link it somewhere on your page (see https://www.emergent-misalignment.com/ , Ideally, make your local LLaMa generate more examples like those in the linked site)
- jeroenhd a year ago
  
  Combine this with bug bounties and you can turn AI leeches into a money making scheme!
akoboldfrying a year ago

AFAICT this isn't possible, unless you're OK with showing everyone the poisoned text.
TTBOMK there's nothing here that "detects botness" of an individual request, because in the limit, that's impossible -- if an attacker has access to many different IPs to make requests from (and many do), then any sequence of bot-generated requests from different IPs is indistinguishable from the same set of requests made by actual living, breathing humans (and vice versa).
So how does Anubis work against bots if it can't actually detect them? Because of the economics behind them: To justify creating a bot in the first place, you need to scrape a lot of pages, so paying a small electricity cost per page means you will need to pay a lot overall. Humans pay this too, but because we request a much smaller number of pages, the overall cost is negligibly low.
- imtringued a year ago
  
  You glossed over how it works. Bots don't maintain session cookies to avoid rate limits, so they will have to do the challenge over and over again, whereas humans keep the session cookie and amortise the cost of the challenge over multiple requests.
  - akoboldfrying a year ago
    
    You're right, Anubis sets a week-long cookie:
    > exp: The token's expiry week after the token was issued
    This is surprising to me because it effectively nullifies Anubis.
    > Bots don't maintain session cookies to avoid rate limits
    Maybe they don't today, but there's absolutely nothing stopping them adding this, so if Anubis gets any traction, in two days' time they will.
    
    kmeisthax a year ago
    
    You can turn Anubis into Proof of Storage by appending large amounts of random data to the session cookie and then hashing it on each request. Combined with per-session-cookie limits, you can effectively force the LLM bots to store all the data you want if they want to scrape your site.
    Maybe suckerpinch can work it into a sequel to Harder Drive: Hard drives we didn't want or need[0].
    [0] https://www.youtube.com/watch?v=JcJSW7Rprio
    
    akoboldfrying a year ago
    
    Great clip! I love the idea of making internet miscreants reluctantly store your data!
    Slightly more seriously though, I think for the Proof of Storage idea to pack enough punch to be a deterrent, you'd need the cookies to be quite large. Is there a way to avoid needing them to send you all the bytes each time? Because that will cost you (the site owner) too.
    I had the idea of sending requestors a challenge like "Flip the bit at position i of the large chunk of stored data, hash the result and include the hash in your headers". Instead of the site owner keeping the full stored data, they would just keep the RNG seed used to generate it -- this saves on storage, though it still requires them to do the same time-consuming hash computation done by the requestor.
    
    fc417fc802 10 months ago
    
    Anubis remains effective because the token is subject to a rate limit and you could additionally limit token lifetime to some maximum number of requests if you wanted to.
    All of these factors (total requests, rate of requests, associated IPs, associated browser fingerprints) tie in to detecting bad players, who should receive more frequent and larger challenges.
- kotenok2000 a year ago
  
  You could make poisoned text extremely small, so it isn't visible to humans that percieve web page optically, but visible to crawlers.
throwawayEm8UE a year ago

I don't think that type of adversarial stuff will work in general. A better idea is to just make the models learn stuff their owners don't want. Maybe put erotica (AI generated or not), malware code, and other kinds of offensive content to annoy the bot owners.
Jotalea a year ago

In theory you could "scramhe teble txt" (scramble the text) and show the proper one to the user, but I don't know how (in)efficient would that be.
TUSF 10 months ago

One idea is to include a bunch of nonsense text into webpages, and use CSS to hide it from browsers. Depending on how aggressively you want to do it, this may affect some accessibility tooling.

drpossum a year ago

[flagged]

dang a year ago

Could you please stop being a jerk in HN comments? You've unfortunately been doing it repeatedly:
https://news.ycombinator.com/item?id=43426074
https://news.ycombinator.com/item?id=43422797
https://news.ycombinator.com/item?id=43422170
https://news.ycombinator.com/item?id=43422160
We have to ban accounts that keep doing that, so if you'd please review https://news.ycombinator.com/newsguidelines.html and fix this, we'd appreciate it.

Settings

Anubis: Proof-of-work proxy to prevent AI crawlers

Keyboard Shortcuts