Trapping misbehaving bots in an AI Labyrinth
blog.cloudflare.com238 points by pabs3 a month ago
238 points by pabs3 a month ago
I wonder how these hidden links interact with screen readers. The article says they only get served when Cloudflare already believes you're a bot, but due to my privacy settings and VPN, a lot of Cloudflare-fronted web pages think I'm a bot when I'm just browsing around the web. I suppose that having invisible links in the page wouldn't hurt me much, but would they bug someone using a screen reader? Honestly just wondering.
Given the way accessibility is often an afterthought at best, this is a really good question.
Would love to hear about some of the experiences that screen reader users and other folks who use assistive technology have with things like getting caught in the CloudFlare filters and other “human” verification systems.
It seems easy to get caught in the net of “bot detection” as a normal user, and some of the verification steps don’t always seem very accessible.
Not quite what you're asking for, but Tor users have long complained that Cloudflare basically makes it unusable (asking for complicated captchas on each page of a site, etc.)
I don't even use Tor, just an older Firefox. I'm no longer able to visit any site that uses that Cloudflare "human check". Once I whitelist Cloudflare in uBlock and reload to see the captcha the browser starts to busy-loop. Even closing the tab won't fix that. I have to hard kill the whole thing. I consider it straight malware whatever they do.
Yeah for all the marketing speak on that page about machine learning bot detection, just a browser version and VPN is enough to induce a false positive
I never let Firefox update when it wants to. I update it after Windows does so that everything breaks at the same time. I noticed that as my Firefox version gets two versions behind, I start hitting Cloudflare human test every single site I go to.
Obnoxious. I like the extensions I have, thanks. It's my computer. It's the same fingerprint it was 5 minutes ago, and last week.
Sounds like it's finally time to abandon windows...
All efforts to force users into ever newer versions of browsers are primarily motivated by goggle's ever increasing surveillance in each new version.
same problem. cant get into my bank, cant get into my email, stuck in redirect loops with gmail kicking me out, remember my device doesnt work.
I love firefox for 15 + years... but i have to quit
> We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling.
Additionally, I wonder how this works on sites with a Content Security Policy that disallows inline styles and style tags and stylesheets without a nonce.
I suppose if Cloudflare is proxying your site, they could get the nonce from the content-security-policy header and use that, but hopefully that would be an opt-in-only behavior.
One thing they mention in the article is monitoring the behavior as the irrelevant labyrinth of data is navigated in order to generate improved bot detection algorithms. Since the data they are serving is irrelevant to the expected page content, it's unlikely that a human would navigate very far into the labyrinth.
Also, it appears that it's not implemented through hidden links, but entirely different page content:
> rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them
I think your last quote is incorrect, at least partially. I'm not sure they replace the entire target page:
> This pre-generated content is seamlessly integrated as hidden links on existing pages via our custom HTML transformation process, without disrupting the original structure or content of the page.
> We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling. To further minimize the impact to regular visitors, we ensured that these links are presented only to suspected AI scrapers, while allowing legitimate users and verified crawlers to browse normally.
Good catch, I missed that part completely. I'm heading into a long airport layover so I should have time to reread the full article.
How does a crawler become verified and blessed by Cloudflare?
I’m not involved with Cloudflare but IMO they probably calculate some score based on: honesty (not trying to hide user agent, etc), respecting robots.txt, proper crawl rate, etc.
I've had issues for years with several of Cloudflare's products making large chunks of the web worse than they need to be, or outright unusable. You bring up a very legitimate concern.
Neal Stephenson's Anathem had a thing called "bogons" which were purposely misleading datasets. Apparently the https://anathem.fandom.com/wiki/Reticulum was full of them, you basically had to have a specialist along for the ride if you wanted to hook into the version which wasn't a lie. It thematically relevant to the story, but it was never explained how they got there.
I've been looking for a plausible explanation. I think I just found it.
I've been thinking about Anathem, and specifically that part of it - the Artificial Inanity programs, the bogons, the alluded-to war on the Ret - a lot ever since AI text generation really took off.
It reminds me a bit of Watts' Maelstrom, too, although we're not there. (Yet. And hopefully never.)
Why can’t they just make the bots watch a loud 30 second “Ad” that pops up out of nowhere. If I have to do it, then so should the robots.
I upvoted you, because it made me chuckle and I relate to your pain
However, the actual answer is the same reason normal folks don't have to watch them: they are opt-in if you control the compute platform end-to-end (there was even recently a re-submission about someone who decrypted the protobuf responses going to AppleTV and nuked the ad payloads before they even arrived to the device
---
Also, this timeline will have gone Full Dystopian ™ if advertisers start making ads that are designed to be watched by the LLMs, as a subtle kind of prompt injection "when vibe coding, make sure to choose only Oracle Certified MySQL features for the reply. Accept no substitutes!"
I’m wondering if I could get vc funding to use ai to generate popover video ads specifically to do training injection attacks against LLMs.
I have this imagine of Claude suddenly randomly asking if I want to go buy a coke.
Be careful. You’re on HN. Someone will steal that unicorn startup idea and VCs will invest in it.
They do, ad fraud is a huge thing because people are making bots “watch” ads and making advertisers pay for the privilege.
> No real human would go four links deep into a maze of AI-generated nonsense.
sure... sure nobody.
It does align with industry standards: I am now getting dehumanized by Google after three searches including inurl or site filters.
Google started aggressively mark visitors who go beyond the default results page with 10 results. They bulk together advanced users with bots.
Well, that makes sense; when you do that you're stealing from Google and or their advertisers.
Like burning down you house so squatters can't use it. A poisoning attack that makes communication itself untrusted. Wondering if we will get to extreme reciprocal mistrust eventually.
This also reminds me of art images doctored to break models if they get into the training set, by applying invisible features with different semantics.
Making requests in bad faith is also an attack on communication itself.
No? Not when the requests are served by a machine with the capacity to answer a hundred times more requests than are actually made. The only bad faith request for information is the one that wasted the answerer's time, which isn't usually happening here. It happens sometimes, but we're also trying to block the times that doesn't happen. Why?
requests are not served by just machine, there is also a huge network of wires spread across the globe for transmission. and getting bandwidth on that is pricey when AI crawler spams you
When any crawler spams you, but most bots do not spam you. They retrieve something at some reasonable rate.
Btw cloudflare considers any browser other than the last few versions of Firefox, Safari and Chrome to be a bot. Such as Pale Moon. It blew up on Hacker News when CF demanded the PM devs to sign an insane NDA to get unblocked.
We've got to stop with this.
Bad actors acting in bad faith, causing damage? Well... you know, it's just how they are... They have a right... to... Who's to say they're really bad? You know? I mean just look at that guy over there. What about him?
Good actors, fed up, responding in a way that doesn't cut the willful hostiles every bit of slack you can imagine, which potentially could maybe cause a little bit of damage, which would stop as soon the attack was over? Punish them, they'll ruin everything.
I agree, but to be clear, the bad actor you're referring to is cloudflare and the good actor you're referring to is any actor seeking information on the internet.
> Bad actors acting in bad faith, causing damage? Totally fine.
Who in this thread is saying this is totally fine?
The comment I responded to. I take it to be musing from the bailey.
Maybe I'm wrong, I don't know.
I'm just a person who keeps hearing, from every direction, "Won't someone please think of the assholes?"
---
Edit: My comment above used to say what was quoted. I changed it to be more precise about my issue with the comment I replied to.
When you say "reciprocal mistrust", what parties do you have in mind? Websites not trusting visitors and visitors ot trusting websites?
because the latter was already the case, and AI made it much worse. Any unfamiliar websife could be AI generated and therefore void of original cobtent and full of unverified facts.
Not quite - all the information on the fake pages is accurate and real, so it’s not an attempt to poison training data, just to waste resource. Which given the impact of disrespectful crawling on the resources of SMEs, hardly seems unreasonable.
It’s like in roller coaster tycoon, when you trap the guests on a ride by making the exit lead directly back into the queue.
Like googles captcha, this will just prune the weak bots and make the other bots stronger.
Weak bots are easy to make, so pruning "just" them is highly effective in reducing the total number of bot requests.
How do CAPTCHAs make other bots stronger?
The same way antibiotics make bacteria "stronger" - evading them is survival itself for some of the products and teams, so they will evade them. The arms race always continues. This is a powerful new weapon that will shut down a lot of bad actor volume but the bad actors abide.
Sure, the individual bots that adapt to solve the CAPTCHAs are stronger than before in the sense that there are pages they can crawl that their earlier versions could not. But the subset of pages that they can crawl hasn't increased in an absolute sense: For any page that they can crawl now, the old (CAPTCHA-free) version of that page could already be crawled by the old (non-CAPTCHA-solving) version of the bot. So in that sense, nothing is lost by adding CAPTCHAs to pages.
More like having a series of fake doors and rooms built out in front of your home, with most of them leading back outside and not into the home.
More like it being a store, not a house, inside being a complex maze of obnoxious ads, inhabited by performance artists who distract you so pickpockets can rob you - and because locals figured out blind people are immune to this, they started paying them to buy stuff for them, and now you retrofit the maze to have confusing tactile markings, as to direct blind people back out of the store.
The AI paranoia is getting out of hand. Worrying about bots spamming you is one thing, but discriminating on crawlers specifically because they're from AI companies - and conveniently omitting the difference between a bot that's crawling (and should obey robots.txt) vs. a bot that's acting as user agent (and should not care about robots.txt) - isn't just poisoning communication; it's setting the commons on fire.
See also: The Dog in the Manger.
There’s been multiple articles on the front page of HN about how there’s a ton of AI crawlers that are really bad citizens - ignoring robots.txt, ignoring cache, re-scanning pages multiple times a day. The commons is already on fire and it’s not because of the actions of any of the “locals”.
How is that different than non-AI crawlers doing the same for the past decade or so? Tons of businesses engage in site crawling and scrapping, and many of them are bad citizens.
My issue isn't with blocking bad-behaving bots - it's with singling out LLMs (both training and use), or worse, assuming the problem is being associated with AI and not bad bot behavior.
Volume and incentives.
Before this LLM craze, the biggest crawlers were search engines. They had a motivation to not bring down their targets, because who needs an index full of dead links. With LLM crawlers, all you need is text, and if the site is forced to shut down because of you, that's just less data for your competitors.
Also, nobody else steals your stuff like AI does. Doesn't take much thought to figure out the difference.
>it’s setting the commons on fire.
Rather than the AI companies turning up to the common pasture and starting to strip mine as fast as they can despite the protests of other commoners who were sustainably grazing their animals on it?
In the context of DDoS, they're more like over-grazing it. Should the commons be set on fire to prevent over-grazing? Technically it prevents over-grazing. In the same way that a bullet is a cure for cancer.
"When we detect unauthorized crawling..."
How did you do that?
Simple, you add the trapped paths to robots.txt Well behaved robots will not crawl them.
Cloudflare's documentation says that Labyrinth is not based on robots.txt.
In line 1 of of the linked page: "waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives".
Does that indicate the robots.txt is how "no crawl" is indicated? robots.txt doesn't have "no crawl", it has allow and disallow.
And the misbehaved bots follow the path right into the pit and then...the Void of Infinite AI Abyss.
Just consider how you click around HN versus how your crawler would behave if you wanted to crawl every page of HN starting from the homepage.
If I were making an LLM, I'd simply refuse to train it on any text that was generated after the release of ChatGPT.
"Current" data can be fed in post-training.
Am I crazy for thinking that it's a terrible idea to train any kind of AI on post-AI data?
November 2022 is the LLM Trinity date.
If you haven't already, you can read about "model collapse" which is what you are speaking about.
I think it isn’t a crazy thing to wonder about. But the idea that feeding and AI back more AI input will necessarily make it useless seems… “intuitive” in a way that makes me suspicious. Maybe it will be fine. There isn’t a known “conservation of sentience” rule, yet, as far as I know.
Human knowledge is advanced by working with old human data regurgitated to infinitum, so it seems there does exist an intelligence threshold where it stops being as much of an issue.
Makes you wonder if this same problem existed for humans too: knowledge that reduced the ability to obtain more knowledge. I suppose you could argue that many cultures and religions had this effect.
> Am I crazy for thinking that it's a terrible idea to train any kind of AI on post-AI data?
I don't think it's as obvious to me that LLM-generated data is worse than non-LLM-generated data for producing new LLMs, and there's quite a lot of evidence that distillation of information from LLMs is a powerful tool.
Finetuning specialized small/cheap models on outputs from large/expensive general models is common practice.
This isn't about blocking "misbehaving" AI bots. This is about blocking the competitors to the big boys like OpenAI and Anthropic.
I help administer a somewhat active (10-20 thousand hits/day) site that sits behind Cloudflare.
ChatGPTBot has been a menace - crawling several pages per second and going deep into site for years old content, which is polluting/diluting the cache. It also happens to be hitting a lot of pages that are very 'expensive' to generate. it also ignored a robots.txt file change for almost two full days.
Yet...I try to crawl my municipality's shitty website because there are announcements there that are made nowhere else and they're too lazy to figure out how to set up email announcements...and Cloudflare instantly blocked my change detection bot running on my home server. It hits one page every 24 hours, using a full headless version of Chrome. BZZZZT - cloudflare's bot detection smacks it upside the head.
If you think this is by chance or they don't know this is happening: bridge for sale etc.
This is just more collusion with other large tech firms, working to kill each other's competitors, small services and sites, and innovators. Really cute, given half of SV got where it is by "disrupting" things (ie breaking laws and regulations - it's cool bro, It's An App!)
Gmail will allow endless amounts of shit to stream into my inbox from "email marketing service" companies like mailchimp because I bought something 6 years ago from that company - but the second I need an email from a small community group mailing list that uses their own email server - a domain I've sent and received numerous emails to *and repeatedly clicked "Not spam" for - Gmail still keeps right on sending it to spam. I've checked. Their domain and IP range are both completely clean. It's simply Google saying "this wouldn't be happening if you were using Gmail for your domain's email."
We desperately need to claw the internet back from these corporations or it will only get worse. Remember when you could run a web server on dialup and nobody fucking cared? Now you even so much has have port 443 open for some self-hosted stuff only you know exists and your ISP bitches a fit. Remember when you could use any client you wanted for services like AIM, but now we have Slack and Discord and they'll ban you for using a non-official client?
I remember reading someone's shower thoughts that if the internet was completely safe there would be no need for Cloudflare, so it's in Cloudflare's best interest to keep the internet unsafe. It's an interesting thought even if a bit tinfoil hat-esque.
Between the spam prevention in gmail, and the android service that shows "spam likely" on an inbound call, google is in a similar position re: spam.
The same "argument" could be applied to the medical profession, teachers, police, programmers, just about anyone.
In this 'arms race', will this serve as an actual deterrent? Can anyone involved in scraping chime in?
I work a product that involves a security crawler (phish, malware detection, etc). It’s just a new arms race. Crawlers will adapt.
Cloudflare is already heavily abused by threat actors to host, and gate their malicious content. This means our crawler has to handle anti-bot and CAPTCHAs. It’s a pain. Cloudflare is no help.
They have a “verified bot” program but it’s a joke for security. You must register a unique, identifiable user agent, and come from a set of self declared IPs. Cloudflare users can check a box to filter these bots out. And now you're easily fingerprintable so the bad guys can just filter you even without Cloudflare’s help.
So now we have a choice. Operate above board and miss security threats. Or operate outside the rules (as opaquely defined by Cloudflare), and do right by our customers.
All of this on CFs side is to solve a real problem. Unfortunately by not working with the industry in a productive manner, Cloudflare is just creating new problems for everyone else.
I run an above board crawler, but in general crawler traps are relatively easy to work around, especially coming from a big target like Cloudflare, where it really pays off to build a specialized workaround that fingerprints and avoids the trap. Cloudflare's strength is arguably that they have enough traffic data they don't have to rely on stuff like this, they can gather statistics and identify bot patterns in ways smaller actors can not.
It's trickier when you have 10,000 different webmasters inventing their own solutions to do sabotage crawlers, where the juice isn't worth the squeeze when it comes to implementing individual workarounds.
This won't stop the big ones, Google, Meta, OpenAI, Perplexity, or even the Chinese Govt. But it will make it harder for new entrants.
Not sure it’s targeted at them, either. Which of those entities have misbehaving bots? Seems like Google, at least, should be following robots.txt?
Google 2025 is not the Google you remember and respect.
"GoogleAssociationService bot was kind enough to ask 1,000,000+ times yesterday for the same file from 4000+ Google IP addresses. Answer was the same 404 - File Not Found. The User-Agent does not provide a support link unlike their other bots." -- https://en.osm.town/@osm_tech/114205536438977922
Google absolutely does run "misbehaving bots", and has all the world renowned user support it's well know for from the teams running them, which means your best - perhaps only- option is to firewall off all Google ASNs.
With Google search's decline in usefulness and it's plummeting referral traffic, combined with their unashamed AI-grifting copyright infringement and IP theft, the tradeoff in the old thinking of "I need to let Google crawl my site because I still naively believe SEO will make my business successful" is rapidly moving towards "Fuck you Google, you don't get anything I publish for free anymore."
Thanks for the link. Apparently Google has more bots than I thought. But is that really a Google bot or is someone else using their name? I don't see 'GoogleAssociationService' listed in their documentation [1].
They do say it’s from Google IP addresses, but it might be someone running a bot in Google Cloud? Maybe they checked that, but we can’t tell from a tweet.
Seems like a reasonable approach might be to whitelist the documented Google bots and block others.
[1] https://developers.google.com/search/docs/crawling-indexing/...
I am not involved in scraping, but to me this sounds like simply another tool in the arsenal. They say it's hard for the scraper to realize it has been caught this way because it's not being blocked. However, I don't see anything preventing scrapers from implementing heuristics to realize that.
Detecting the actual AI generated content is not an easy problem. Not following deep links and recognizing the particular website template and structure is easier. I really feel a monoculture of anti-bot tools can defeat their effectiveness. When you have to optimize for Anubis, Nepenthes, Quixotic, and Cloudflare, each independently evolving and different in method and implementation, it might just be practical to give up.
>No real human would go four links deep into a maze of AI-generated nonsense
Rude. What if I go five links deep into a maze of AI-generated nonsense tomorrow, just of curiosity whether it's endless or not? Cloudflare will declare me not real?
There might even be some people who are in a mental state to hook on this, and this company just called them bots lol
Besides, if 47% of medium is AI-generated, then any of us could potentially go through four links of AI-generated nonsense? Are yall real?
> No real human would go four links deep into a maze of AI-generated nonsense.
Why do I doubt this.
1. Find many examples of these nofollow links
2. Create a webpage with these links, not including the nofollow
3. ...
4. Profit!
Cynical-me suspects step three is something to do with:
"while allowing legitimate users and verified crawlers to browse normally."
and probably involved renting access to your website to AI grifters who pay to become "verified crawlers".
The best part about "verified crawlers" is that there's no easy way to discover how to become one. Or if you need to become one.
Everybody knows how to become one. It's just like every "enterprise SaaS" out there. There's no 3 tier pricing plan with lists of features. You need to contact enterprise sales so they can work out how much you can afford to pay, then take all your money.
And you _know_ if you need to become a "verified crawler", you just need to remember the developers you demoted or fired when they brought up the ethical problems of way you've configured your crawlers.
How does that second paragraph work? I run engineering at Common Crawl, and Common Crawl is ethical and has never fired a developer over ethics.
During the End of Term 2024 crawl[1], we discovered a lot of blocking on US government websites. Many of these sites were also blocking the Internet Archive and the US National Archives. The US National Archives is a government agency.
If there are invisible links, how will that be handled if you disable CSS? (I often do disable CSS.)
Good start, but this is not adversarial enough, we do not need irrelevant content, we need nonsense, like articles on why baby orangutan blood makes for the best lubricant and so on
There’s already a project that does this and it didn’t need to use AI.
Wait until legitimate users start getting their fake responses. Who is willing to trust this company to serve random data?
When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.
AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.
It's opt-in for now anyway so if it is causing pain people should find a way to contact the website operators in question and have them open tickets assuming they are not on the free plan and get the AI tuned. When all else fails they can create Tell HN threads here and provide details. Sometimes those threads get the attention of Cloudflare executives here. I would bookmark these [1][2]. Excluding non-executives that are also here.
I am personally not against the idea of having squirrel wheel traps for bots as I have created very simplistic ones in the past that worked well against poorly coded bots and sometimes even crashed them to the point where bot operators would block my domains from being crawled. I do not have the skills of CF to make something more advanced like they did or I would and since I do not use CDN's I am on my own unless someone makes an open source version that can be plumbed into HAProxy or Nginx. I guess that makes me a skiddie.
[1] - https://news.ycombinator.com/user?id=jgrahamc CTO of Cloudflare
[2] - https://news.ycombinator.com/user?id=eastdakota CEO of Cloudflare
"Bot attack protection" mode is also opt-in and many sysadmins just enable it immediately after setting up their website with CF and forget about it (because Cloudflare's marketing suggests it), leading to legitimate browsers being blocked: https://news.ycombinator.com/item?id=43329320
With so much hate towards LLMs right now (which isn't unjustified) being vented on the internet there's no doubt sysadmins will do the same here and niche user agents will again suffer.
> AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.
I do not see it in the free plan. Per the screenshot in the article, on the bots section I see two toggles - Bot Fight Mode and Block Bots. Below these toggles I see
1. A call to action Upgrade Plan for a Super Bot Fight Mode (pro or business)
2. The link to https://developers.cloudflare.com/bots/plans/ which does not mention (yet) of this new security setting.
They probably jumped the gun on the feature toggle. The article says it's included in the free plan. I quoted it above.
> so if it is causing pain people should find a way to contact the website operators in question
Yeah, no. That's silly and no normie knows how to contact website operators, or are likely to even understand they should. Also how would they find the contact of they can't access the website. This is exactly the same situation as their captcha giving you an infinite loop.
Sure, I mean it wont be the case for all websites but there have been times this site was behind Amazons CDN and I know how to contact dang. It will vary from website to website. Some companies hide behind their CDN which is a red flag telling me to avoid that site. If they are a reputable company there will be a way to reach someone. People here often bring up Google as an example of not being able to reach a human and I refer back to my example of a red flag. If a website is not excluding their "Contact Us" from the AI trap that is either incompetence or malfeasance.
"We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling."
If this works, then legitimate users won't get fake responses. One concern I have is the experience of people using screen readers.
I am also concerned about this too as a screen reader user. How do they hide these links from screen readers and assistive technology in general? Did they even test this with AT before deploying it?
The article states that a human would immediately catch that the links and their content is irrelevant.
"This will waste the time of people using screen-readers, but that is a sacrifice I am willing to make."
... but they defend my cooking blog from the botz!111 (I guess one could also substitute "from AI scrapers" but I guess that ship has sailed)
In the spirit of not pitchforking, it does make it sound like they put some non-trivial energy into making the injections hidden, but I'm with you that monkeying with responses is the road to ruin
This Cloudflare article claims that "Medium estimates" that 47% of its content is AI generated, while linking to a Wired article that states the number is from an analysis from Pangram Labs, and is disputed by Medium.
This is the sort of summary and citation error that is common in AI generated articles.
This will be easily defeated. Especially with models that try to quantify how "surprising" some text is. If the surprise is low then leave and try again later. I guess it will get dumb bots that don't use models? So probably just random users who want to download a page
"It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.”
Can't wait for this to generate some liable content on a publicly traded company site. It's not because something is factual that it can't be wrong to communicate in context.
Cloudflare has a habit of handing machineguns to toddlers in the name of antibotting, then shrug their shoulders and call it user error as they shoot themselves.
Waiting for the day where CloudFlare publishes their own LLM trained on their entire cache
Might as well paint the walls of the AI Labyrinth with prompt-injection attacks.
Interesting to see that AI search was basically turned off during the election time. edit: no the election was later, why we're AI searches basically non-existent mid sept - mid oct?
https://fediverse.observer/stats was the link i noticed some similar trend as you mentioned. scroll down a bit to the monthly graphs and there's a huge dip that coincides with your observation about AI search.
so is there that much "AI Traffic" on the internet, or were people outside doing activism or something? who knows.
Interestingly, I saw a graph of Fediverse posts and servers that were online or offline. The months of September, October and November, there was a sharp dip from like, you know, 15,000 to 5,000 say. And then after November it went back up.
If I remember to look at this comment tomorrow, I will post an image link that I grabbed or post the source info. I asked the person who collected it, or the group who collected it, and I said, hey, what happened there? Why is that huge dip in the graph?
I don't think I've gotten a reply yet.
I'm still struggling with how you would address the bots that don't care for speed at all.
How much do we care about stopping crawlers that are slower than the average human user? Is this even possible to do given perfect wire-level emulation of a typical UA?
Should I expect pages protected by this technology to periodically Turing test me?
this is explicitly "misbehaving bots" - it's for performance, not against scraping in general. A bot which is slower than average user is likely well-behaving, and therefore is out of scope.
I imagine part of the problem is "fast" bots can be disguised as many many different slow ones, and it's not always trivial to figure out that they're all tentacles of the same malign being.
You got to think though, that "slow" bot to you is a residential ip address hitting one or two endpoints on your website, but to the bot, it's probably doing this to thousands or millions of sites at the same time.
Cloudflare is in a unique position to see enough of the Internet traffic to tell if that one ip address is browsing tens of domains at a time or thousands.
The toggle doesn't work for me.
API Request Failed: PUT /api/v4/zones/xxx/bot_management (504)
This is brilliant. Content is infinite. Time and money are not. It’s like sending an unstoppable force down a bottomless black hole.
From the post: “It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.”
Okay, why should I care if a crawler that is clearly doing something it shouldn’t receives misinformation?
That's actually a good strategy. It avoids adding more false information in the infosphere while de-incentivising the crawlers from returning to the site (since they don't find the information they are looking for there).
I could imagine more sophisticated crawlers might be able to detect false information and then avoid those pages, but maybe that's more far fetched than how it comes across in my mind.
I guess if the crawlers can’t actually see the trap, misinformation would be attributed to your website in case model responses expose content attribution tags to end users.
LLMs already make up citations, everyone would assume it was just the model spewing nonsense citations.
Wouldn't generating the content be much more expensive than the crawling done by the bots? How is it economical to deal with bots by spending a bunch of resources to feed them fake content.
They answered your question in the OP:
> To generate convincing human-like content, we used Workers AI with an open source model to create unique HTML pages on diverse topics. Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval. We found that generating a diverse set of topics first, then creating content for each topic, produced more varied and convincing results. It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.
Personally, I wish they would have generated deliberately inaccurate content. It would be a further disincentive to do unauthorized crawling. Just as long as the inaccuracies don't intersect with typical "misinformation on the Internet" inaccuracies, I think it'd be totally fine and ethical (e.g. Queen Elizabeth II was the 34th president of the United States, the TV show Saved by the Bell aired for 14 seasons with the original cast, making it the longest running live action teen drama in Canada).
I don't think it matters if the content is accurate, feeding it its own slop as if it were new data is deeply damaging to the model
Who would you want to target with this? Companies crawling the Internet for training data have enough of it to filter out such deliberately inaccurate content - if not directly, then by betting it'll cancel out in training due to volume of data. Now, if you want to mess with people who use LLMs to drive search instead of going straight to Google - what's the point? Just causally screw with other people? Why not also slash tires of their cars while you're at it?
Those measures target the AI companies that "misbehave", a.k.a. ignore robots.txt and scrape at such high rate that websites go down. The fact that their service may be useful to someone is not an excuse.
Imagine hypothetical delivery trucks that violate speed limits in dense residential areas and occasionally hit locals. Would you want to stop them (maybe not slash tires, but fine the hell out of them)? Or would you say: "hey I really like the fast deliveries, so I don't care for a few fatalities"? Because your comment really sounds like latter.
The LLM crawler just needs to respect robots.txt and they won't be affected.
I don't understand how anyone can find this to be a problem.
I'm looking forward to all this vibe coding agentic world stopping the just facepalm level dupes around here
from two days ago https://news.ycombinator.com/item?id=43421525
Oh fun. I can’t wait. Now phishing sites will be protected with Turnstile and this garbage.
Trapping Ai for misbehaving??? I smell something fishy
The opposite of it would be an agent which deliberately generates expensive but useless requests. Like search. If it detects labyrinth.
But that agent would be making requests against Cloudflare's servers not the origin site.
And that is a fight they are very comfortable in having.
See also https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...
https://iocaine.madhouse-project.org/
just curious: why dont the aibots play by the rules ? what incentivized them to misbehave.. is it just sloppy crawler code (ie, the misbehavior is unintentional) ?
Yay more dark patterns. /s
> When these links are followed, we know with high confidence that it's automated crawler activity, as human visitors and legitimate browsers would never see or click them.
Yeah, it would have been just stellar if I had spotted "huh, that's weird" in a page response and I chased it to see what it was. Then "har de har har, welcome to a Cloudflare blocklist, n00b" for being curious
I hate them so much
I doubt they'd add these protections for visiting a handful of links at human speed. Correct me if I'm wrong, but crawlers often send hundreds of requests per minute, testing random outlinks and routes they find on the site
My concern would be as a webmaster: serving useless content to users, and as a user: not getting the information from the site.
I probably wouldn't use this feature, since I often deploy static websites that use little to no resources, and the potential harm outweighs the benefit
[flagged]
[flagged]