Facebook's Fascination with My Robots.txt

blog.nytsoi.net

91 points by Ndymium 2 months ago · 54 comments

Reader

> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.

If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.

goodmythical 2 months ago

>they can't be bothered to fix the broken shit
Surely it's more likely that it's just cheaper to pay for the errors than to pay to fix the errors.
Why fix 10k worth of errors if it'll cost me 100k to fix it?
- DanielHB 2 months ago
  
  The orgs are not ruthless like that, anything less than a certain % of the org revenue is not worth bothering unless it creates _more_ work to the person responsible for it than fixing it does.
  Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.
- Nextgrid 2 months ago
  
  In my opinion, if something isn’t actually an error, you modify your logging to not log it as an error. Your error logging/alerting pipeline should always stay clean.
  If something shows up in there, you should only have 2 options: 1) it’s an actual error and you fix it and make sure it never happens again, or 2) it’s not an error and then you fix it by adjusting the log level to make sure it isn’t one.
  If someone suggests an “error budget” on my watch they get the door. You can have a warning budget (and the resources to adjust the log levels or remediation protocols to fix said “errors”) but actual errors should remain errors - otherwise they’re delivering broken software and that’s not what I’m paying them for.
  Of course, companies who have the common sense to do this already do it and nobody in their right mind would suggest an “error budget”, but for those that don’t they have a serious problem that needs to be rectified.
  The danger otherwise is that you’re making your observability pipeline useless if “errors” no longer actually mean errors. That’s really bad because now it opens the door to actual errors being ignored until it’s too late and then remediation is more costly.
darepublic 2 months ago

At Facebook a full outage is accompanied by "first time?" Memes. Unless you are on the specific team responsible you would indeed not really have any reason to care
nazgulsenpai 2 months ago

In my 3rd year of enterprise now and learned that there are many engineers who will purposefully not fix/improve their problematic applications as a weird sort of job security. It kind of blew up in their faces last year when we moved most of the affected on-premise applications to cloud. Seems like when you introduce tons of friction on-premise it makes the cloud look even better to the suits.
Nifty3929 2 months ago

It's not a matter of "can't be bothered." Engineers are constantly fixing things and rolling out new features. "Error budgets" are an acknowledgement of the tradeoff between these two things, and making a conscious choice about the balance between them, according to the business requirements of the application in question.
Keep in mind that "fixing things" is essentially a Sisyphean task - no matter how much you do there's always more you can do. Just like adding features. You have to have some kind of guideline on when enough is enough.

NdymiumOP 2 months ago

For some reason, Facebook has been requesting my Forgejo instance's robots.txt in a loop for the past few days, currently at a speed of 7700 requests per hour. The resource usage is negligible, but I'm wondering why it's happening in the first place and how many other robot files they're also requesting repeatedly. Perhaps someone at Meta broke a loop condition.

antonyh 2 months ago

As facebookexternalhit is listed in the robots.txt, it does look like it's optimistically rechecking in the hope it's no longer disallowed. That rate of request is obscene though, and falls firmly into the category of Bad Bot.
- RobotToaster 2 months ago
  
  My guess is it's dutifully obeying it, not storing anything from the site and then exiting, without clearing the site from the crawl queue.
- mghackerlady 2 months ago
  
  That is probably the dumbest yet most genius solution to getting your scraper blocked I've ever seen

xg15 2 months ago

Facebook just decided that instead of loading the robots.txt for every host they intend to crawl, they'll just ignore all the other robots.txt files and then access this one a million times to restore the average.

Vinnl 2 months ago

Ah yes, robots_georg.txt.

tananaev 2 months ago

Maybe they’re trying to DDoS it, and once an error is returned, they assume that no robots.txt file exists and then crawl everything else on the site?

NdymiumOP 2 months ago

While 7700 per hour sounds big, pretty much any dinky server can handle it. So I don't think it's a matter of DDoS. At this point it's just... odd behaviour.
- mghackerlady 2 months ago
  
  especially for a txt file. I don't know anything really about webdev but I'm pretty sure serving up 7700 plaintext files with roughly 10 lines each an hour isn't that demanding

talkingtab 2 months ago

1. Put a note in robots.txt that says

"By accessing this file more than one time per second you agree to pay a fee of $0.1 per access plus an additional $0.1 for each previous access each day. This fee will be charged on a per access basis."

2. Run a program that logs the number for Facebook requests and prints a summary and bill.

2. Then get a stamp, envelope and write out a bill for the first day, call it a demand for payment and send it to:

Facebook, Inc. Attn: Security Department/Custodian of Records 1601 S. California Avenue Palo Alto, CA 94304 U.S.A.

You can optionally send this registered mail, where someone has to sign for it.

Corporations such as FaceBook are used to getting their way in court because they can afford lawyers and you cannot. So they have gotten lazy and do not worry about what is fair or legal.

So take them to court when you have a legitimate legal issue. The courts are there to provide redress when you are aggrieved. Right? Use the courts. You can file a small claims action easily. Just make sure you have 1) a legitimate case, 2) evidence 3) have sent them a demand for payment.

Cthulhu_ 2 months ago

You can't just make a one sided contract agreement like that though. Just like I can't tell anyone that by reading this comment they agree to pay me (as another commenter pointed out).
blell 2 months ago

Great point. By the way, by reading this comment, you agree to pay me one gorillion dollars.
quesera 2 months ago

Why do you think this would be anything other than a time-consuming and slightly costly exercise?
Do you think there's a contract created by your robots.txt comment?
pavel_lishin 2 months ago

It this functionally different from posting the "I DO NOT GRANT FACEBOOK PERMISSION..." copypasta to your Facebook page?

dormento 2 months ago

Has anyone done research on the topic of trying to block these bots by claiming to host illegal material or talking about certain topics? I mean having a few entries in your robots like "/kill-president", "/illegal-music-downloads", "/casino-lucky-tiger-777" etc.

pousada 2 months ago

Yea I can’t see how that could backfire in any way
DetroitThrow 2 months ago

FB crawler is used for national security reasons at times. The first would probably make it more active.

13pixels 2 months ago

Facebook is honestly the least interesting crawler misbehaving right now. The real shift is GPTBot, ClaudeBot, PerplexityBot and a dozen other AI crawlers that don't even identify themselves half the time.

I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.

The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.

XCSme 2 months ago

This matches what I've been noticing. A lot of AI crawler traffic just doesn't show up clearly in typical analytics dashboards, especially when tools aggressively filter or sample.
Part of why I built UXWizz was to avoid black-box filtering and keep control over how traffic is classified. When you own the analytics stack, you get to decide what’s "valid" instead of inheriting someone else's definition.

lloydatkinson 2 months ago

I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.

Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.

From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.

Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.

toast0 2 months ago

> this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.
Ugh, such a weird design. At least my experience has been you are better off setting Apache to always run the same number of instances, and tuning that number as appropriate rather than having the instance count fluctuate under load.
- lloydatkinson 2 months ago
  
  Yeah, I did suggest nginx in the past for this project and I think I'll consider it after this saga. For now I've set pretty low limits on Apache.
  - toast0 2 months ago
    
    Apache itself is fine. Just need to set MaxClients/MaxRequestWorkers equal to StartServers and MaxSpareThreads (assuming prefork), so that all the forking happens at start up. Similar things for other mpms, I assume.
mrweasel 2 months ago

For some reason it seems really important to these AI companies to get the very latest version of your pages as well, so they'll do anything in their power to avoid hitting any caching you may try to set up.
lithos 2 months ago

Forum/Wiki content probably more likely to be old enough to be from preAI days, meaning they get to avoid the AI inbreeding problem.
Git content likely to have code for the bot to train on.

evv 2 months ago

Have you considered serving a zip bomb to this user agent?

delecti 2 months ago

I'm sure their crawler can handle a zip bomb. Plus it might interpret that as "this site doesn't have a robots.txt" and start scraping that OP is trying to prevent with their current robots.txt.
- marginalia_nu 2 months ago
  
  Pretty sure every crawler can. You kinda have to go out of your way not to, given how the gzread API looks.
  https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LS...
- 1e1a 2 months ago
  
  Could allow only the path to the zip bomb for this user agent.
  - FartyMcFarter 2 months ago
    
    That will work once at most and then quickly get fixed.
    
    xp84 2 months ago
    
    Yeah it seems like this team takes a really tough stance on obvious bugs
    
    esseph 2 months ago
    
    Are you so sure? :)

VladVladikoff 2 months ago

My bet is this is a threading bug rather than just a broken loop. Somehow the threads are failing to communicate with each other, or some sort of race condition, so it keeps putting in the same task to the queue but missing the result. Something like that.

petee 2 months ago

Do crawlers follow/cache 301 permanent redirects? I wonder if you could point the firehouse back at facebook, but it would mean they wouldn't get your robots.txt anymore (though I'd just blackhole that whole subnet anyway)

slickdifferent 2 months ago

Probably unrelated, but there was an old exploit on facebook using notes where you could paste an image tag with a URL + dynamic parameter an unlimited amount of times and facebook crawlers would hit that link however many times it was pasted in the note, essentially allowing you to DDOS a site with their servers. https://www.radware.com/blog/security/a-flaw-worth-fixing-ho...

_slih 2 months ago

the facebook bug is funny but the real issue is that robots.txt is a gentleman's agreement from 1994 and half the crawlers in 2026 don't care what it says.

prmoustache 2 months ago

Is there any downside in just blocking the whole META IP range? I mean they aren't even running a search engine AFAIK. Why would I want them to crawl my website?

johneth 2 months ago

The only downside I can think of is that links to your website that are shared on Facebook / WhatsApp / Messenger won't be able to show a rich preview because they won't be able to access the OpenGraph tags on the shared page.
Rich previews are known to cause higher clickthroughs than non-rich previews (if you care about that).

matja 2 months ago

Did you try adding a Cache-Control response header?

mrweasel 2 months ago

Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?
Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.
Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.
- bot403 2 months ago
  
  It's not the "same" crawler. Probably each thread or each cluster machine instance of the crawler hitting it independently.
  - OliverGuy 2 months ago
    
    That's still the same crawler system though. And it's lazy engineering to not build in something to track when you last requested a url.
    And it's quite a trivial feature at that.
  - mrweasel 2 months ago
    
    I sincerely doubt that search engines run their crawlers on a single machine and they got it figured out.
NdymiumOP 2 months ago

Forgejo does set "cache-control: private, max-age=21600", which is considerably more than one second, but I grant it uses the "private" keyword for no reason here.

mghackerlady 2 months ago

>my extreme LibreOffice Calc skillz

How does one learn these skills, I can see them being useful in the future

OhMeadhbh 2 months ago

Your name isn't Jackson Roykirk, is it?

Settings

Facebook's Fascination with My Robots.txt

Keyboard Shortcuts