Facebook's robots.txt

facebook.com

40 points by sander 12 years ago · 22 comments

Reader

perryh2 12 years ago

http://disqus.com/humans.txt

glomph 12 years ago

http://www.last.fm/robots.txt
tux 12 years ago

Also has a funny error 404 when you remove "s"
Uh oh... Something didn't work. > http://disqus.com/human.txt
usaphp 12 years ago

What's "bmw" doing on the top of his head ?

viana007 12 years ago

http://www.google.com/robots.txt

joshguthrie 12 years ago

Chrome user here. When I open it, the tab is automatically closed.
Tried to curl it, exact content, no 302 towards a "<script>window.close</script>",... Got anything?
easy_rider 12 years ago

/* would have sufficed
- darkmighty 12 years ago
  
  https://www.google.ca/search?q=google+search+domain&oq=googl...

kr1m 12 years ago

You don't scrape Facebook, Facebook scrapes you!

jgalt212 12 years ago

In the US, you catch a cold. In Soviet Russia, cold catches you!
http://en.wikipedia.org/wiki/Russian_reversal

yalogin 12 years ago

So what does it mean by facebook whitelisting a scraping service? Do they actively block scrapers?

dblacc 12 years ago

I could be wrong but I believe that the the default is that spiders are blocked and only the "User-Agents" listed are allowed to scrape (but not the disallow pages).
- elbear 12 years ago
  
  You are correct.

pdfcollect 12 years ago

Is there a way to replace this robots.txt with a null robots.txt? :)

toomuchtodo 12 years ago

You just ignore the robots.txt file, crawl slowly, and from distributed virtual machines.
Not that you should do that. Robots.txt is a nicety though, the client doesn't have to respect it, and the server doesn't have to allow your HTTP requests.

bibstha 12 years ago

What is a User Agent: Yeti?

unfunco 12 years ago

It's the crawler for Naver, a south Korean search engine.

decasteve 12 years ago

Even Facebook's robots.txt has a hatred for my pseudo-anonymous browser settings. Facebook gives me this (for any page): "Sorry, something went wrong. We're working on getting this fixed as soon as we can."

startling 12 years ago

robots.txt isn't enforced.
- easy_rider 12 years ago
  
  Maybe they should be. Gentleman's agreements do not apply to robots.
  - cheald 12 years ago
    
    And how exactly do you propose verifying that the user agent purporting to be Googlebot or Firefox is actually who they are? They're inherently unenforceable.
    robots.txt is basically a list of rules that lay out "This is how we'd like you to crawl us. We might stop serving you if you don't comply", rather than a hard-and-fast set of directives that specify how a webcrawler will be guaranteed to behave.
    
    easy_rider 12 years ago
    
    You can implement some strict enforcing in Apache using some crafty mod_rewrite stuff: http://andthatsjazz.org/defeat.html
    User-agent is to easily spoofed, but we could check if the robots are indeed Google (whitelisted) and not some other crawler that just wants to scrape your content.
    In the realm of mail servers we have something called SPF: http://en.wikipedia.org/wiki/Sender_Policy_Framework
    Just thinking out of the box here, but other than checking IP ranges: Maybe a hash being sent as a header inside the GET request by the crawler to verify if they are who they say they are.

Settings

Facebook's robots.txt

Keyboard Shortcuts