Sites scramble to block ChatGPT web crawler after instructions emerge
arstechnica.comI have two sites that provide documentation for open source libraries I've created, and I definitely won't be blocking ChatGPT. It has already read my documentation and can correctly answer most StackOverflow-level questions about my libraries' use. This is seriously impressive and very helpful, as far as I'm concerned.
From the article:
> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.
I would rather leave the internet entirely if AI chatbots become a primary user interface.
What's interesting about your statement is that if you knew they were chatbots, then by definition, they wouldn't be AI. e.g. they wouldn't have passed the Turing test.
I know what you're saying, and totally agree. Unfortunately the term "AI" is now meaningless.
True enough, but one doesn't need to know whether they are Chatbots. In fact, one doesn't even have to have ever heard of AI/ChatGPT to know that something fundamnetally inhuman and strange were going on. If a person woke up from a coma after Chatbots had taken over the internet, they would immediately notice something very strange and machine-like about the whole experience.
That's not what AI means. Also chatgpt can beat the Turing test often enough.
Complaining about overloading the word "AI" is like complaining about the term "Cloud". Might as well move past it.
Why? That would imply that they are pretty good for most things - better than what we have now.
Absolutely not. That would imply that Chatbots are an optimal solution where the optimality criteria are short-term financial gains for mega-enterprises and the maladaptive instincts of humanity in the unnatural situation of being forced into a highly technological world.
Sugary cereals and desserts have taken over much of snacking today, doesn't mean it's a good thing.
I wonder if its worth poisoning the replies for scrapers that don't obey robots.txt. Send back nonsense, lies, and noise. This would be an adversarial approach like https://adnauseam.io/ uses for ad tracking.
If you (or others) come up with a way to build a system to poison AI/LLM/other models to make them useless, count me in to help.
I’d imagine this is best possible via illegal methods such as mass hacking websites and inserting the appropriate poison
Years ago I came across an email crawler trap, where if the bot was unfortunate enough to come across it, it would generate (from the e-mail harvesting bot's point of view) an endless and nested tree of pages with randomly generated garbage emails. It was just a bit of PHP but I wouldn't be surprised if you couldn't hear something that the LLM thinks are comments but It's just randomly generated garbage.
> blocking GPTBot will not guarantee that a site's data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.
This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.
I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.
I think bots are part of the public.
OK, then pretend that I said "open my site up to the human public", instead.
There's never going to be a perfect solution, it's an arms race. I really doubt (hope?) that large entities are going to straight up emulate end-user browsers though.
I would think filtering based on user agent will be the sweet spot for effort and performance. You could do some awful JavaScript monstrosity to detect the tiny fraction of bots who are sneaky, but if they're determined to be sneaky they will succeed at scraping.
User agent matching isn't good enough. The stakes are high -- all it takes is one AI crawler to grab my site data, and that data is included in the training forever more.
> if they're determined to be sneaky they will succeed at scraping.
Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.
Browsers aren't really trusted platforms, the cool scraping is in emulating phones. Whether that be in actually running a virtual phone or sending traffic that emulates it
Really just encourages phones to be even more locked down
I chose to use an nginx entry, because i also dont trust them to follow robots.txt. Throwing a 410 Gone should keep them from coming back too, theoretically, assuming they actually eject when receiving it, like it should.
`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`
Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.
That's relying on the user agent, though. That's not a trustworthy enough signal for me. For one, crawlers can use any user agent string they like. For another, I don't know what all the possible user agent strings are.
This gives the illusion of being in control, but if enough people block the bot, they'll just scrape differently (if they don't already) because too much money is at stake, more than whatever fine they may get if they do get caught and can't settle out of court, not to mention they may consider it will be someone else's problem by then.
It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.
On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn't useful to show in search results, therefore search engines had an incentive to follow its rules.
I think a strategy that might punish such misbehaved crawlers is trapping them (via invisible ahrefs perhaps ... ?) and feeding them truckloads of garbage/disinformational/esoteric/plain wrong text.
Or perhaps all crawlers regardless of respecting robots. Honestly I am not interested in improving some FAANGish algorithm with blogposts intended for my friends.
blocked it on every single site I manage
there is zero benefit to me in allowing OpenAI to absorb my content
it is a parasite, plain and simple (as is GitHub Copilot)
and I'll be hooking in the procedurally generated garbage pages for it soon!
How do you assess which robots benefit you?
What if it becomes the next Google? Are you really sure you want to be removed from their index?
Not the OP, but I already blocked it with my robots.txt. I am 100% sure I want to be removed from their index, even if they become "the next Google". I would rather have my website fade away into obscurity than increase the usefulness of their AI or any other AI model.
If something happens, then you can do something about it.
In this particular case, if enough people block ChatGPT scraping then it cannot become the next google. Most notably, I imagine all commercial news organizations will block it because they need people to visit their actual website to pay for putting news up on their website. And it will remain that way until it can be demonstrated that ChatGPT drives more traffic to a website than it redirects traffic away from a website. The Microsoft chat in Edge is much closer to that in the way its summaries include clickable quotes from articles.
As it is, as GP says there would be 'zero benefit' to being included in that new Google, to GP as content author.
You are promoting FOMO.
The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn't respect that, it doesn't truly respect `User-agent: GPTBot` either.
I've been reading lots of datasheets and application notes in the embedded space recently. Most of these are only accessible after creating a (free) login. In one sense, it's a reasonably simple way to prevent scraping like this (at least until the AI-based scrapers can generate their own logins). On the other hand, a lot of that kind of material would be _really_ useful to be able to ask an LLM about.
For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.
Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx
...You realize user agent is operator set, right?
If they won't reapect robots.txt, they aren't interested in your consent.
Respecting robots.txt and setting UA are two different things. And yes, i know UA can be set to anything, however, the UA has been mentioned, and it shouldnt change drastically, under normal circumstances by a lot of these scrapers.
Respecting the robots.txt has nothing to do with what the UA is set to. Yes, you can say this UA can do x in the robots.txt, but not respecting it, makes it moot.
The method i put in place does not use robots.txt, so theres no need to worry about them not respecting it anymore.
As someone else mentioned, like the world of spam and such, its an arms race. The solution may not be perfect, but its functional