Sites scramble to block ChatGPT web crawler after instructions emerge

71 points by specto 3 years ago · 37 comments

Reader

Meekro 3 years ago

I have two sites that provide documentation for open source libraries I've created, and I definitely won't be blocking ChatGPT. It has already read my documentation and can correctly answer most StackOverflow-level questions about my libraries' use. This is seriously impressive and very helpful, as far as I'm concerned.

vouaobrasil 3 years ago

From the article:

> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.

I would rather leave the internet entirely if AI chatbots become a primary user interface.

taftster 3 years ago

What's interesting about your statement is that if you knew they were chatbots, then by definition, they wouldn't be AI. e.g. they wouldn't have passed the Turing test.
I know what you're saying, and totally agree. Unfortunately the term "AI" is now meaningless.
- vouaobrasil 3 years ago
  
  True enough, but one doesn't need to know whether they are Chatbots. In fact, one doesn't even have to have ever heard of AI/ChatGPT to know that something fundamnetally inhuman and strange were going on. If a person woke up from a coma after Chatbots had taken over the internet, they would immediately notice something very strange and machine-like about the whole experience.
- whelp_24 3 years ago
  
  That's not what AI means. Also chatgpt can beat the Turing test often enough.
- danielbln 3 years ago
  
  Complaining about overloading the word "AI" is like complaining about the term "Cloud". Might as well move past it.
tomjen3 3 years ago

Why? That would imply that they are pretty good for most things - better than what we have now.
- vouaobrasil 3 years ago
  
  Absolutely not. That would imply that Chatbots are an optimal solution where the optimality criteria are short-term financial gains for mega-enterprises and the maladaptive instincts of humanity in the unnatural situation of being forced into a highly technological world.
  Sugary cereals and desserts have taken over much of snacking today, doesn't mean it's a good thing.

8organicbits 3 years ago

I wonder if its worth poisoning the replies for scrapers that don't obey robots.txt. Send back nonsense, lies, and noise. This would be an adversarial approach like https://adnauseam.io/ uses for ad tracking.

vouaobrasil 3 years ago

If you (or others) come up with a way to build a system to poison AI/LLM/other models to make them useless, count me in to help.
- TheCaptain4815 3 years ago
  
  I’d imagine this is best possible via illegal methods such as mass hacking websites and inserting the appropriate poison
  - ajdude 3 years ago
    
    Years ago I came across an email crawler trap, where if the bot was unfortunate enough to come across it, it would generate (from the e-mail harvesting bot's point of view) an endless and nested tree of pages with randomly generated garbage emails. It was just a bit of PHP but I wouldn't be surprised if you couldn't hear something that the LLM thinks are comments but It's just randomly generated garbage.

JohnFen 3 years ago

> blocking GPTBot will not guarantee that a site's data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.

This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.

I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.

none_to_remain 3 years ago

I think bots are part of the public.
- JohnFen 2 years ago
  
  OK, then pretend that I said "open my site up to the human public", instead.
rootw0rm 3 years ago

There's never going to be a perfect solution, it's an arms race. I really doubt (hope?) that large entities are going to straight up emulate end-user browsers though.
I would think filtering based on user agent will be the sweet spot for effort and performance. You could do some awful JavaScript monstrosity to detect the tiny fraction of bots who are sneaky, but if they're determined to be sneaky they will succeed at scraping.
- JohnFen 2 years ago
  
  User agent matching isn't good enough. The stakes are high -- all it takes is one AI crawler to grab my site data, and that data is included in the training forever more.
  > if they're determined to be sneaky they will succeed at scraping.
  Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.
- rpgwaiter 3 years ago
  
  Browsers aren't really trusted platforms, the cool scraping is in emulating phones. Whether that be in actually running a virtual phone or sending traffic that emulates it
  Really just encourages phones to be even more locked down
CableNinja 3 years ago

I chose to use an nginx entry, because i also dont trust them to follow robots.txt. Throwing a 410 Gone should keep them from coming back too, theoretically, assuming they actually eject when receiving it, like it should.
`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`
Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.
- JohnFen 2 years ago
  
  That's relying on the user agent, though. That's not a trustworthy enough signal for me. For one, crawlers can use any user agent string they like. For another, I don't know what all the possible user agent strings are.

wildpeaks 3 years ago

This gives the illusion of being in control, but if enough people block the bot, they'll just scrape differently (if they don't already) because too much money is at stake, more than whatever fine they may get if they do get caught and can't settle out of court, not to mention they may consider it will be someone else's problem by then.

It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.

On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn't useful to show in search results, therefore search engines had an incentive to follow its rules.

xyzal 3 years ago

I think a strategy that might punish such misbehaved crawlers is trapping them (via invisible ahrefs perhaps ... ?) and feeding them truckloads of garbage/disinformational/esoteric/plain wrong text.
Or perhaps all crawlers regardless of respecting robots. Honestly I am not interested in improving some FAANGish algorithm with blogposts intended for my friends.

blibble 3 years ago

blocked it on every single site I manage

there is zero benefit to me in allowing OpenAI to absorb my content

it is a parasite, plain and simple (as is GitHub Copilot)

and I'll be hooking in the procedurally generated garbage pages for it soon!

tedunangst 3 years ago

How do you assess which robots benefit you?
amelius 3 years ago

What if it becomes the next Google? Are you really sure you want to be removed from their index?
- vouaobrasil 3 years ago
  
  Not the OP, but I already blocked it with my robots.txt. I am 100% sure I want to be removed from their index, even if they become "the next Google". I would rather have my website fade away into obscurity than increase the usefulness of their AI or any other AI model.
- stubish 3 years ago
  
  If something happens, then you can do something about it.
  In this particular case, if enough people block ChatGPT scraping then it cannot become the next google. Most notably, I imagine all commercial news organizations will block it because they need people to visit their actual website to pay for putting news up on their website. And it will remain that way until it can be demonstrated that ChatGPT drives more traffic to a website than it redirects traffic away from a website. The Microsoft chat in Edge is much closer to that in the way its summaries include clickable quotes from articles.
- OJFord 3 years ago
  
  As it is, as GP says there would be 'zero benefit' to being included in that new Google, to GP as content author.
- anotherhue 3 years ago
  
  You are promoting FOMO.

karaterobot 3 years ago

The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn't respect that, it doesn't truly respect `User-agent: GPTBot` either.

askvictor 3 years ago

I've been reading lots of datasheets and application notes in the embedded space recently. Most of these are only accessible after creating a (free) login. In one sense, it's a reasonably simple way to prevent scraping like this (at least until the AI-based scrapers can generate their own logins). On the other hand, a lot of that kind of material would be _really_ useful to be able to ask an LLM about.

CableNinja 3 years ago

For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.

Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx

salawat 3 years ago

...You realize user agent is operator set, right?
If they won't reapect robots.txt, they aren't interested in your consent.
- CableNinja 3 years ago
  
  Respecting robots.txt and setting UA are two different things. And yes, i know UA can be set to anything, however, the UA has been mentioned, and it shouldnt change drastically, under normal circumstances by a lot of these scrapers.
  Respecting the robots.txt has nothing to do with what the UA is set to. Yes, you can say this UA can do x in the robots.txt, but not respecting it, makes it moot.
  The method i put in place does not use robots.txt, so theres no need to worry about them not respecting it anymore.
  As someone else mentioned, like the world of spam and such, its an arms race. The solution may not be perfect, but its functional

Settings

Sites scramble to block ChatGPT web crawler after instructions emerge

Keyboard Shortcuts