Trapping misbehaving bots in an AI Labyrinth

blog.cloudflare.com

238 points by pabs3 a month ago


karaterobot - a month ago

I wonder how these hidden links interact with screen readers. The article says they only get served when Cloudflare already believes you're a bot, but due to my privacy settings and VPN, a lot of Cloudflare-fronted web pages think I'm a bot when I'm just browsing around the web. I suppose that having invisible links in the page wouldn't hurt me much, but would they bug someone using a screen reader? Honestly just wondering.

__MatrixMan__ - a month ago

Neal Stephenson's Anathem had a thing called "bogons" which were purposely misleading datasets. Apparently the https://anathem.fandom.com/wiki/Reticulum was full of them, you basically had to have a specialist along for the ride if you wanted to hook into the version which wasn't a lie. It thematically relevant to the story, but it was never explained how they got there.

I've been looking for a plausible explanation. I think I just found it.

darth_avocado - a month ago

Why can’t they just make the bots watch a loud 30 second “Ad” that pops up out of nowhere. If I have to do it, then so should the robots.

daralthus - a month ago

> No real human would go four links deep into a maze of AI-generated nonsense.

sure... sure nobody.

visarga - a month ago

Like burning down you house so squatters can't use it. A poisoning attack that makes communication itself untrusted. Wondering if we will get to extreme reciprocal mistrust eventually.

This also reminds me of art images doctored to break models if they get into the training set, by applying invisible features with different semantics.

d4rkn0d3z - a month ago

"When we detect unauthorized crawling..."

How did you do that?

LeoPanthera - a month ago

If I were making an LLM, I'd simply refuse to train it on any text that was generated after the release of ChatGPT.

"Current" data can be fed in post-training.

Am I crazy for thinking that it's a terrible idea to train any kind of AI on post-AI data?

November 2022 is the LLM Trinity date.

KennyBlanken - a month ago

This isn't about blocking "misbehaving" AI bots. This is about blocking the competitors to the big boys like OpenAI and Anthropic.

I help administer a somewhat active (10-20 thousand hits/day) site that sits behind Cloudflare.

ChatGPTBot has been a menace - crawling several pages per second and going deep into site for years old content, which is polluting/diluting the cache. It also happens to be hitting a lot of pages that are very 'expensive' to generate. it also ignored a robots.txt file change for almost two full days.

Yet...I try to crawl my municipality's shitty website because there are announcements there that are made nowhere else and they're too lazy to figure out how to set up email announcements...and Cloudflare instantly blocked my change detection bot running on my home server. It hits one page every 24 hours, using a full headless version of Chrome. BZZZZT - cloudflare's bot detection smacks it upside the head.

If you think this is by chance or they don't know this is happening: bridge for sale etc.

This is just more collusion with other large tech firms, working to kill each other's competitors, small services and sites, and innovators. Really cute, given half of SV got where it is by "disrupting" things (ie breaking laws and regulations - it's cool bro, It's An App!)

Gmail will allow endless amounts of shit to stream into my inbox from "email marketing service" companies like mailchimp because I bought something 6 years ago from that company - but the second I need an email from a small community group mailing list that uses their own email server - a domain I've sent and received numerous emails to *and repeatedly clicked "Not spam" for - Gmail still keeps right on sending it to spam. I've checked. Their domain and IP range are both completely clean. It's simply Google saying "this wouldn't be happening if you were using Gmail for your domain's email."

We desperately need to claw the internet back from these corporations or it will only get worse. Remember when you could run a web server on dialup and nobody fucking cared? Now you even so much has have port 443 open for some self-hosted stuff only you know exists and your ISP bitches a fit. Remember when you could use any client you wanted for services like AIM, but now we have Slack and Discord and they'll ban you for using a non-official client?

momojo - a month ago

In this 'arms race', will this serve as an actual deterrent? Can anyone involved in scraping chime in?

gloosx - a month ago

>No real human would go four links deep into a maze of AI-generated nonsense

Rude. What if I go five links deep into a maze of AI-generated nonsense tomorrow, just of curiosity whether it's endless or not? Cloudflare will declare me not real?

There might even be some people who are in a mental state to hook on this, and this company just called them bots lol

Besides, if 47% of medium is AI-generated, then any of us could potentially go through four links of AI-generated nonsense? Are yall real?

seasluggy - a month ago

> No real human would go four links deep into a maze of AI-generated nonsense.

Why do I doubt this.

ccgreg - a month ago

1. Find many examples of these nofollow links

2. Create a webpage with these links, not including the nofollow

3. ...

4. Profit!

zzo38computer - a month ago

If there are invisible links, how will that be handled if you disable CSS? (I often do disable CSS.)

mipsum544664 - a month ago

Good start, but this is not adversarial enough, we do not need irrelevant content, we need nonsense, like articles on why baby orangutan blood makes for the best lubricant and so on

callamdelaney - a month ago

There’s already a project that does this and it didn’t need to use AI.

zlagen - a month ago

Wait until legitimate users start getting their fake responses. Who is willing to trust this company to serve random data?

ceroxylon - a month ago

This Cloudflare article claims that "Medium estimates" that 47% of its content is AI generated, while linking to a Wired article that states the number is from an analysis from Pangram Labs, and is disputed by Medium.

This is the sort of summary and citation error that is common in AI generated articles.

casey2 - a month ago

This will be easily defeated. Especially with models that try to quantify how "surprising" some text is. If the surprise is low then leave and try again later. I guess it will get dumb bots that don't use models? So probably just random users who want to download a page

PeterStuer - a month ago

"It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.”

Can't wait for this to generate some liable content on a publicly traded company site. It's not because something is factual that it can't be wrong to communicate in context.

Cloudflare has a habit of handing machineguns to toddlers in the name of antibotting, then shrug their shoulders and call it user error as they shoot themselves.

tom1337 - a month ago

Waiting for the day where CloudFlare publishes their own LLM trained on their entire cache

__MatrixMan__ - a month ago

Might as well paint the walls of the AI Labyrinth with prompt-injection attacks.

wyes - a month ago

Interesting to see that AI search was basically turned off during the election time. edit: no the election was later, why we're AI searches basically non-existent mid sept - mid oct?

- a month ago
[deleted]
bob1029 - a month ago

I'm still struggling with how you would address the bots that don't care for speed at all.

How much do we care about stopping crawlers that are slower than the average human user? Is this even possible to do given perfect wire-level emulation of a typical UA?

Should I expect pages protected by this technology to periodically Turing test me?

theyknowitsxmas - a month ago

The toggle doesn't work for me.

API Request Failed: PUT /api/v4/zones/xxx/bot_management (504)

jtwoodhouse - a month ago

This is brilliant. Content is infinite. Time and money are not. It’s like sending an unstoppable force down a bottomless black hole.

windhaven - a month ago

From the post: “It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.”

Okay, why should I care if a crawler that is clearly doing something it shouldn’t receives misinformation?

thayne - a month ago

Wouldn't generating the content be much more expensive than the crawling done by the bots? How is it economical to deal with bots by spending a bunch of resources to feed them fake content.

mdaniel - a month ago

I'm looking forward to all this vibe coding agentic world stopping the just facepalm level dupes around here

from two days ago https://news.ycombinator.com/item?id=43421525

gwittel - a month ago

Oh fun. I can’t wait. Now phishing sites will be protected with Turnstile and this garbage.

grizz18bear20 - a month ago

Trapping Ai for misbehaving??? I smell something fishy

MoonGhost - a month ago

The opposite of it would be an agent which deliberately generates expensive but useless requests. Like search. If it detects labyrinth.

amai - a month ago

See also https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...

https://iocaine.madhouse-project.org/

https://zadzmo.org/code/nepenthes/

See also: https://tldr.nettime.org/@asrg/113867412641585520

bwfan123 - a month ago

just curious: why dont the aibots play by the rules ? what incentivized them to misbehave.. is it just sloppy crawler code (ie, the misbehavior is unintentional) ?

everfrustrated - a month ago

Yay more dark patterns. /s

chimyy - a month ago

[flagged]

chimyy - a month ago

[flagged]