Settings

Theme

Anubis: Proof-of-work proxy to prevent AI crawlers

anubis.techaro.lol

100 points by techknowlogick 9 months ago · 67 comments

Reader

jchw 9 months ago

I'm really curious to see how this evolves as time goes on. Hashcash was originally conceived to stop e-mail SPAM, and a lot has changed since then, namely, compute has become absolutely dirt cheap. Despite that, PoW-based anti-bot remains somewhat enticing because it doesn't necessarily harm accessibility the way that solutions like Cloudflare or reCAPTCHA can: It should be possible to pass, even on a VPN or Tor, even on less used web browsers like Ladybird or Servo, and even if you're not on a super powerful device (provided you are willing to wait for the PoW challenge to pass, but as long as you don't have all of these conditions at once you should get an "easy" challenge and it should be quick.)

The challenge is definitely figuring out if this solution actually works at scale or not. I've played around with an implementation of Hashcash myself, using WebCrypto, but I worry because even using WebCrypto it is quite a lot slower than cracking hashes in native code. But seeing Anubis seemingly have some success makes me hopeful. If it gains broad adoption, it might just be enough of a pain in the ass for scrapers, while still being possible for automation to pass provided they can pay the compute toll (e.g. hopefully anything that's not terribly abusive.)

On a lighter note, I've found the reception of Anubis, and in particular the anime-style mascot, to be predictably amusing.

https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

(Note: I'd personally suggest not going and replying here. Don't want to encourage brigading of any sort, just found this mildly amusing.)

  • rickydroll 9 months ago

    FWIW, I was part of the gang talking with Adam about using Hashcash as a postage stamp as an anti-spam measure. I implemented an extension to Postfix using Hashcash as an anti-spam method (camram/two penny blue). The compute inflationary pressure you mentioned was only one of the challenges in deploying such a system. FWIW, none of the challenges were barriers; they were the kind of challenges you get when you take something beautiful, theoretical, and pure and try to make it real.

    Some may not be aware that Hashcash value is as a decentralized rate limiter that can be added to almost any protocol. Experience with Hashcash taught us that it's essential to have a dynamic pricing scheme based on the reputation of the contact originator. In the email context, when a message sender connects to a receiving server, the receiving server should be able to tell the sender the size of the stamp needed based on the reputation created by previous messages.

    From my perspective, the two main challenges in rate-limiting HTTP requests are embedding the required Hashcash stamp size and measuring the reputation of the request initiator. I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.

    I'll spare you my design note rambling, but I think from 3-page requests, one can gather enough information to determine the size of the Hashcash stamp for future requests.

    • xena 9 months ago

      > I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.

      That's what it's intended to be right now. I've been thinking out how to do a reputation database; I'm half considering using a DHT like BitTorrent's for a cross-Anubis coordination (haven't filed an issue about this because I'm still noodling it out into a spec). I'm also working on more advanced risk calculation, but this kind of exploded out of nowhere for me.

      • rickydroll 9 months ago

        I would focus on the bad player detector first, then move on to the dynamic pricing scheme. I wouldn't worry about cross-Anubis coordination until you've solved the database problem for the single node.

        IMO, of course.

avodonosov 9 months ago

Ideas:

- Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.

- An easy way for good crawlers (like internet archive) to authenticate themselves. E.g. TLS client side authentication or simply an HTTP request header containing signature for the request (the signature in the header may be based on, for example, on their domain name and the TLS cert for that domain)

  • ronsor 9 months ago

    > Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.

    The last time[0] we did this, everyone had a meltdown and blocked it.

    [0] See Coinhive, which conspicuously lacks a Wikipedia page.

  • throwawayEm8UE 9 months ago

    Maybe there should also be a protocol to support paying a small amount of crypto (on the order of 0.01-0.1 cents per request) to bypass the PoW. The bots can still scrape it by paying, but you now price in the externality.

    • avodonosov 9 months ago

      Yes, I was also thinking of this.

      "proof of pay" or "proof of a transaction"

      An advantage of this would be not wasting electricity for the proof of work computation.

  • avodonosov 9 months ago

    Alternatively to cryptocurrency, some other useful c9mputation. Like contribution to BOINC. Only need to find a way to prove the computation.

  • kebokyo 9 months ago

    I’m not sure how you can generate cryptocurrency with just sha256 hashes lol. Plus, there’s already a configuration file that lets you determine what you’re looking out for in terms of scrapers. I need to look at it more myself especially because I plan on using this on my own website soon, but I hope there’s already a good way to whitelist archival sites.

Trung0246 9 months ago

For no js solution, I think some sort of using optical illusion as captcha could works, especially https://en.wikipedia.org/wiki/Magic_Eye or something like https://www.youtube.com/watch?v=Bg3RAI8uyVw which could cleverly hide captcha answer within animated noise mess.

However these methods are not really accessibility-friendly tho.

nikisweeting 9 months ago

Doesn't seem to noticably slow down my test bot. Headful crawling already takes ~10sec/page so an extra 0.5sec is hardly that big a deal.

  • jeroenhd 9 months ago

    How effective a PoW firewall is, depends on how conservatively the worst offenders configure their browser engine resource limits.

    A small scraper may be able to afford spending the extra CPU cycles, but if you're like AI training bots, sometimes sending hundreds of browser instances at a time, the math becomes different.

    From what I've read about the results, it seems like the approach is effective against the very worst scrapers and bots out there.

  • akoboldfrying 9 months ago

    "Headful"?

pvg 9 months ago

Discussion here https://news.ycombinator.com/item?id=43422929

yjftsjthsd-h 9 months ago

> to stop AI crawlers

It'll do that too, but it's really more of a general-purpose anti-bot, right? A generic PoW-wall.

barlog 9 months ago

When I visited Xe-sann's page, I was curious to see Jackal-chann challenges in action. This is Anubis.

akoboldfrying 9 months ago

Regarding the problem of how to let "good" bots through:

You could use PKI: Drop the PoW if the client provides a TLS client certificate chain that asserts that <publicKey> corresponds to a private key that is controlled by <fullNamesAndAddressesOfThesePeople> (or just by, say, <people WhoControlThisUrl>, for Let's Encrypt-style automatable cert signing). This would be a slight hassle for good bot operators to set up, but not a very big deal. The result is that bad bots couldn't spoof good bots to get in.

(Obviously this strategy generalises to handling human users too -- but in that case, the loss of privacy, as well as admin inconvenience, makes it much less palatable.)

  • areyourllySorry 9 months ago

    well, good bots usually publish their ip ranges, and that's way simpler

    • akoboldfrying 9 months ago

      Yes, that's much simpler, but doesn't it mean that every site owner needs to manage a hardcoded set of "places to look" for these published IPs?

      If I want to create a good bot tomorrow, where do I publish its IP addresses? IOW, how can I ensure that the world "notices"?

bno1 9 months ago

What stops a scraper from detecting Anubis and just removing "Mozilla" from the user-agent string?

  • TonyTrapp 9 months ago

    That would allow you to specifically lock out that bot based on its user-agent string. That's the main problem with AI scrapers, many of them normally use user-agents that cannot be easily blocked, so other means have to be found to keep them off your grounds.

  • xena 9 months ago

    It breaks websites when you do that.

xg15 9 months ago

It's a great idea, but I fear if this keeps going viral like it did in the last few days, more bot authors will be motivated to add special handling for it and e.g change the user agent to a non-Mozilla one.

Trung0246 9 months ago

The performance on mobile is kinda suck tho, took like 30 seconds to wait for PoW on difficulty 4 on Firefox Android. By that time I have to resist the urge to switch to do something else.

iszomer 9 months ago

This reminded me of an article I printed (yes, with paper) at my college more than 20 years ago, titled Parasitic Computing. I don't remember where it was originally published but I do think I might have stumbled upon it via kuro5hin (maybe); a quick search resulted the publication from Nature (though paywalled).

- https://www.nature.com/articles/35091039

Alifatisk 9 months ago

This is like wehatecaptchas.com

ranger_danger 9 months ago

I would say it doesn't prevent anything, it just makes computers warm the planet more.

  • 01HNNWZ0MV43FF 9 months ago

    That crime is on the attackers hands.

    Hydrocarbons should cost more. Vote for a pollution tax

  • jeroenhd 9 months ago

    A quick PoW calculation per web session is probably not going to make a dent in pollution now that generative AI has become commonplace.

    It'll suck for battery life if you're often browsing random website after random website, or if you're a bot farm, but in practice I don't think most welcome users will notice a thing.

    • theragra 9 months ago

      Other comment mentions it takes 30 seconds in mobile Firefox to pass, obviously too much

  • akoboldfrying 9 months ago

    > it just makes computers warm the planet more.

    Since that would cost the bad guys money, they won't actually do it much.

kmeisthax 9 months ago

Is there a way to alter text to poison AI training sets? I know there's Glaze and Nightshade for images but I've heard of nothing to poison text models. To be clear, this wouldn't be a defensive measure to stop scraping; it'd be an offensive honeypot: you'd want to make pages that have the same text but mutated slightly differently each time, so that AI scrapers preferentially load up on your statistically different text and then yield a poisoned model. Ideally the scraper companies will realize what's going on and stop scraping.

  • xyzal 9 months ago

    Generate lots (LOTS!) of bad, insecure or wrong code and link it somewhere on your page (see https://www.emergent-misalignment.com/ , Ideally, make your local LLaMa generate more examples like those in the linked site)

    • jeroenhd 9 months ago

      Combine this with bug bounties and you can turn AI leeches into a money making scheme!

  • akoboldfrying 9 months ago

    AFAICT this isn't possible, unless you're OK with showing everyone the poisoned text.

    TTBOMK there's nothing here that "detects botness" of an individual request, because in the limit, that's impossible -- if an attacker has access to many different IPs to make requests from (and many do), then any sequence of bot-generated requests from different IPs is indistinguishable from the same set of requests made by actual living, breathing humans (and vice versa).

    So how does Anubis work against bots if it can't actually detect them? Because of the economics behind them: To justify creating a bot in the first place, you need to scrape a lot of pages, so paying a small electricity cost per page means you will need to pay a lot overall. Humans pay this too, but because we request a much smaller number of pages, the overall cost is negligibly low.

    • imtringued 9 months ago

      You glossed over how it works. Bots don't maintain session cookies to avoid rate limits, so they will have to do the challenge over and over again, whereas humans keep the session cookie and amortise the cost of the challenge over multiple requests.

      • akoboldfrying 9 months ago

        You're right, Anubis sets a week-long cookie:

        > exp: The token's expiry week after the token was issued

        This is surprising to me because it effectively nullifies Anubis.

        > Bots don't maintain session cookies to avoid rate limits

        Maybe they don't today, but there's absolutely nothing stopping them adding this, so if Anubis gets any traction, in two days' time they will.

        • kmeisthax 9 months ago

          You can turn Anubis into Proof of Storage by appending large amounts of random data to the session cookie and then hashing it on each request. Combined with per-session-cookie limits, you can effectively force the LLM bots to store all the data you want if they want to scrape your site.

          Maybe suckerpinch can work it into a sequel to Harder Drive: Hard drives we didn't want or need[0].

          [0] https://www.youtube.com/watch?v=JcJSW7Rprio

          • akoboldfrying 9 months ago

            Great clip! I love the idea of making internet miscreants reluctantly store your data!

            Slightly more seriously though, I think for the Proof of Storage idea to pack enough punch to be a deterrent, you'd need the cookies to be quite large. Is there a way to avoid needing them to send you all the bytes each time? Because that will cost you (the site owner) too.

            I had the idea of sending requestors a challenge like "Flip the bit at position i of the large chunk of stored data, hash the result and include the hash in your headers". Instead of the site owner keeping the full stored data, they would just keep the RNG seed used to generate it -- this saves on storage, though it still requires them to do the same time-consuming hash computation done by the requestor.

        • fc417fc802 9 months ago

          Anubis remains effective because the token is subject to a rate limit and you could additionally limit token lifetime to some maximum number of requests if you wanted to.

          All of these factors (total requests, rate of requests, associated IPs, associated browser fingerprints) tie in to detecting bad players, who should receive more frequent and larger challenges.

    • kotenok2000 9 months ago

      You could make poisoned text extremely small, so it isn't visible to humans that percieve web page optically, but visible to crawlers.

  • throwawayEm8UE 9 months ago

    I don't think that type of adversarial stuff will work in general. A better idea is to just make the models learn stuff their owners don't want. Maybe put erotica (AI generated or not), malware code, and other kinds of offensive content to annoy the bot owners.

  • Jotalea 9 months ago

    In theory you could "scramhe teble txt" (scramble the text) and show the proper one to the user, but I don't know how (in)efficient would that be.

  • TUSF 9 months ago

    One idea is to include a bunch of nonsense text into webpages, and use CSS to hide it from browsers. Depending on how aggressively you want to do it, this may affect some accessibility tooling.

drpossum 9 months ago

[flagged]

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection