Robots.txt as a Security Measure?
cdsrc.comI was hoping this would be about putting an orphan path in your robots.txt and then black-listing clients who tried to fetch it -- nobody should know about it except robots who are told not to go there, so anyone who visits the link is an adversary.
As one small data point:
I've been running this experiment (another comment). While bots continuously hammer on port 22 (ssh), and repeatedly try to get things like /wp-* (I don't even run PHP), they don't bother fetching robots.txt in the first place, and my honeypot hasn't a single hit.
Definitely do not try to "secure" your site this way, but bots are either not sophisticated enough to analyze the .txt, or it might already be a known technique. Seems many other commenters come up with the same idea.
If you're an adversary trying to snoop on port 22, why would you bother to respect the conventions of robot.txt to begin with?
Not necessarily the same bot. And they're not snooping so much as brute-forcing default/common/random(?) usernames & passwords.
Funny experiment and perhaps also useful, but there are crawlers with good intentions[1] that still may ignore the disallows. I don't know of anyone else than the internet archive though.
[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
Those crawlers can almost always be recognized by the UA.
Yes, no doubt.
Definitely an interesting idea, I should check the scene to figure out how many 'adversaries' are actually scanning robots.txt files.
Ooo that's a good idea, I will definitely start implementing this
"I don't always expose my production database on a public URL, but when I do, I put a 'Disallow' in my robots.txt for it."
It reminds me of that scene from Spaceballs.
"The combination is... 1-2-3-4-5."
"That's amazing! I've got the same combination on my luggage!"
There's a running joke among web pentesters about robots.txt being the first place you look when hitting a new site.
Meanwhile over in .gov I’ve had to explain to a pentester that it wasn’t a security problem that robots.txt was accessible without authentication, based on a very big vendor’s scanner having badly regurgitated the OWASP advice.
The "security" world has an unusually high level of total incompetence. It is scary.
This is common any time there’s so much demand: in the late 90s it was not uncommon to be in a room full of people who were ostensibly web developers and didn’t understand how the web or their backend servers worked but were certain they were about to become rich.
Security is especially bad because so many large organizations are under pressure to improve but the market is tight and the pool of experts is limited. Also, many places have outsourced to large contracting companies who don’t want to admit they don’t have enough qualified staff and will hope that you’ll be satisfied with whoever they deliver.
Yeah no doubt it is a phase.
It's just a really nasty phrase right now.
I always think of this:
https://medium.com/@djhoulihan/no-panera-bread-doesnt-take-s...
A few years ago I purposefully put a couple of "interesting" paths in the robots.txt as a honeypot to test/capture bot conformance and malicious actors. Not one hit ever.
They just found a path further up and compromised you via that instead of bothering with the rest of the robots.txt :D
A while back I wrote a Python script to watch for links posted on Twitter and then scrape their /robots.txt file [1]. The requests are routed through Tor for privacy purposes.
It's been incredibly enlightening. One thing that sticks out immediately is that you can identify the underlying HTTP framework in many cases due to the defaults. Sometimes even the exact version.
And, yes, people do use the robots file to "protect" or "hide" endpoints and they can effectively be used to enumerate potential endpoints worth investigating further (from a pentesting perspective).
[1] https://gist.github.com/wybiral/20c20ccf00b6c93506b8acdc6ccb...
Silly old me always starts with / in a browser. Then I click on links. Not all sites leak information like a sieve with the wire bit removed but many do. There is sometimes no need to do anything clever like look for robots.txt.
It’s like walking through an office and seeing an unlocked door with a “Do not enter” sign.
In addition to the obvious that is literally a list of places where admins don't want to look, it is also often useful in backend technology enumeration.
It’s very literally the second bullet point on my enumeration list for web apps, right behind looking at the DNS records for the domain.
It's far from a joke.
Honestly, I think it couldn't hurt, if done appropriately. If crawlers are indexing those pages, then they're publicly available anyway, and could be crawled by a determined attacker - so nothing in robots.txt ought to be truly sensitive. But if there's pages that ought to be secure, but might contain an exploitable vulnerability, putting their path in robots.txt at least limits their exposure to those determined enough to look, rather than any lazy script kiddie using Google to search your site.
Obviously you shouldn't rely on it, but defense in depth as always.
If you want that as an additional safeguard, set the noindex header on that path at your edge so you’re not calling attention to it:
https://developers.google.com/search/reference/robots_meta_t...
I’d also strongly recommend pairing this with outside monitoring which alerts if something accidentally becomes reachable since it’s really easy not to notice something working from more places than intended.
But then archive.org will ignore it (but point the crawler at the directories you helpfully linked it to) and those caches will show up in Google
It would even be a million times better to place the sensitive files inside /TOP_SECRET_FOLDER and disallow the entire path, avoiding to explicitly name the paths at least.
This is the only way to use robots.txt for semi-sensitive info, and obviously not for info so sensitive that it would be awful for it to get out. URLs can leak through proxy logs and shared browser history.
Just put a fake /wp/admin/login URL (or similar) in the disallow rules, then just IP ban everyone trying to access it for 24 hours. That's how you do robots.txt Security.
Why in the world would you put things that shouldn't be downloaded ON THE INTERNET to begin with.. If you then proceed to also tell the whole wide world that you did it.. It's difficult to feel any empathy.
You can also use robots.txt as a honeypot. Simply add some realistic looking url's to the Disallow pattern and create rules in haproxy, nginx, or your own custom scripts to catch anyone hitting those URL's and put them in a hamster wheel. i.e. give then a static "Having problems?" page, or just outright block them. On my own personal systems, I use "silent-drop" with a stick table in haproxy.
On a similar note, tools like [grabsite](https://github.com/ArchiveTeam/grab-site) wisely use robots.txt as a method of finding additional paths to archive when crawling sites.
Storytime!
In my ex-job we were developing e-commerce system, which was super-old|big|messy 0-test PHP trash. After two years of actively working on it, I still couldn't form a clear picture about the details of its subsystems in my head.
One day there is a call from a client, saying that he is missing many of his orders. The whole company is on its feet and we are searching for what went wrong. We are examining the server logs just to find out that someone is making thousands of requests to our admin section and tries to linearly increment order IDs in the URL. Definitely some kind of attack.. Our servers are managed by different company so we are opening a ticket to blacklist that IP. Quick search told me that the requests are coming from the AWS servers, and the IP leads me to an issue on GitHub for some nginx "bad bots" blocking plugin, saying that this thing is called Maui bot and we are not the first one experiencing it. Nice. Anyway, this thing is still deleting our data and we can't even turn off the servers because of SLAs and how the system was architected. So we are trying to find out how is it even possible, that unauthorized request can delete our data. We are examining our auth module, but everything looks right. If you are not logged in and visit the order detail (for example) you are correctly redirected to login screen. So how? We are reading the documentation of the web framework that the application is using. There is it. $this->redirect('login');. According to the documentation it was missing return before that statement. So without the return, everything after that point was still executed. And "everything" in our case, was the action from the URL. No one ever noticed, because there were no tests, and when you tried it in the browser, you were "correctly" presented with the login screen. Unfortunately, with side-effect.. Guy that wrote that line did it 5-6 years before this incident, and was out of the company for many years even before I joined. I don't blame him..
Fix. Push. Deploy. No more deleted orders.
POST MORTEM:
The Maui bot went straight to the disallowed /admin paths in robots.txt and tried to increment numbers (IDs) in paths.
I remember, that because Maui bot actions were (to the system) indistinguishable from the normal user actions, someone had to manually fix the orders in the database just by using server logs and comparing them somehow.
Sorry for my English, and yeah, (obviously) don't use robots.txt as security measure of any kind...
i move my not-to-be-indexed stuff around a lot: renaming, archiving, etc. & i've a bit of shell scripting and a commonlisp program that automatically add things to robots.txt so that barely any of the things listed won't 404, and the ones that won't are protected via htaccess.
not sure why i did this aside from that it was fun!