Settings

Theme

Hello Facebook Crawler

mwmeyer.com

72 points by mwmnj 13 years ago · 36 comments

Reader

whalesalad 13 years ago

This reminds me of a recent experience I had with the Bing bot.

This most recent YC round, my co-founder and I used Skydrive to edit our application. Skydrive integrates pretty nicely with Word, even on a Mac, to allow for collaborative editing. It's like the best parts of Sharepoint, minus all the crap, and inside of a modern UI. I'm a diehard Apple user, but I also subscribe to the "right tool for the job" principle ... in this case it worked pretty well.

Anyway, inside the document were links to some private areas of our website that contained demo materials for YC. As requested, they were not password protected, but also not linked from anywhere else. While submitting I ensured that our nginx logs would capture visits to these URL's in a separate log, so we'd know when it was being looked at (sidenote, seeing visitors coming from inside justin.tv + the rincon hill towers is kind of exhilarating).

What surprised me was that almost immediately after we began working on the document, the Bing bot was going apeshit exploring the domain and the 'private' URL's. I had to quickly add a robots.txt to deny all on the root. I thought it was pretty interesting. At first I felt almost violated. But then it seems logical that they'd be indexing every URL in every document stored in their datacenter, why not?

  • rpm4321 13 years ago

    Eh, I'm pretty sure you should still feel violated. The fact that they are parsing your private documents for information that they can use to help another business unit is really sketchy. It would make me wonder what else they are scanning my data for.

    Personally, I'll never use an MS cloud service because of this anecdote - not that it was that likely to begin with.

  • jacquesm 13 years ago

    I'd feel extremely violated.

    I use google docs, very sparingly. One of the spreadsheets there contains a URL that is not linked from anywhere else and impossible to guess. If that URL ever gets tripped it will send me an email and the day that happens is the day I'll stop using google services (so far so good, and of course I should say 'google drive' now instead of 'google docs').

    • tekromancr 13 years ago

      How would you know it was google indexing your document vs., say, your browser prefetching the link?

      • jacquesm 13 years ago

        Because my browser has never looked at the document with the link in it. Obviously that would defeat the purpose.

  • eli 13 years ago

    You assume they were indexing skydrive documents. It could well be that one of the people who visited the link had a Bing toolbar installed.

    Either way, all publicly accesible documents will get indexed sooner or later.

    • whalesalad 13 years ago

      This was before the document had been sent to anyone. It was still being edited, only my friend and I were working on it. Also, the documents were not public.

      • eli 13 years ago

        I would be surprised if Microsoft is intentionally indexing links in private documents, but my point stands: Google et al are remarkably good at indexing the web. If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

        • drivebyacct2 13 years ago

          >If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

          Which only blocks bots that respect the file...

    • rpm4321 13 years ago

      You may be right, but I can't help but smirk at the thought of PG or Buchheit downloading and installing the Bing Toolbar ;)

cddotdotslash 13 years ago

Why is this even news? Facebook has been crawling links for ages every time you post on the site. The crawler is how the link you paste gets a title, description, and sometimes a thumbnail.

  • mars 13 years ago

    +1. not sure how a post like this can make it to the front page.

    • whalesalad 13 years ago

      There was a period where Hacker News consisted primarily of people on the right-hand side of the spectrum. People who were working inside of startups or had lots of experience with the web and our industry. Pretty much everyone knew what sharding was, and MongoDB wan't very popular.

      These days we've got a lot more people and they show up all across the board.

      Clearly if this is on the homepage, it was voted there by your peers. This kind of knowledge is completely obvious to many of us, but not everyone is on your level. Cut 'em some slack.

      • omarchowdhury 13 years ago

        Even so, for those who are up to that point, that headline could give the implication that Facebook is getting into the search business.

        • jacquesm 13 years ago

          That's why you should read the articles and not just the headlines. Headlines more often than not give a wrong impression.

      • mars 13 years ago

        i'm just not so sure if the direction this is heading to is good. and also the headline of that article is miserable (as omarchowdhury already pointed out).

    • egfx 13 years ago

      Hmm, I thought people mainly knew about this. But I researched this a lot while making http://2fb.me and actually witnessed this myself with google docs but didn't think twice about it. I'm sure the same thought was going through the heads of everybody reading this that knew anything about how the Facebook sharer worked. For those that found this news, it's beneficial to put open graph meta tags on your pages to control the crawler and you can also invoke the crawler (http://developers.facebook.com/tools/debug) when your page changes before Facebook updates it automatically every 2 weeks or so.

  • k3n 13 years ago

    Do note, and I'm not sure if this differs from what you're referring to, but that the link was never even posted to the site; rather, it was placed into a chat box but never sent. Small difference but I think it's an interesting point.

    > To my bemusement, not only was the friend I was messaging away, I also hadn't even sent the link; I pasted it into the chat window but forgot to hit enter.

    • cddotdotslash 13 years ago

      You're right, but it also happens when you paste in the status button before hitting "Post." Try this: go to Facebook, post a link in your status box. Wait a few seconds. Notice that it will populate the link information fields even before you submit the post. It has to get that info from somewhere. That's where the crawler comes in.

      • k3n 13 years ago

        Yeah that's expected, but it's not expected (at least for me) to happen in a chat. But, I never use FB chat so I don't know -- does it also create thumbnails for links and such?

        • cddotdotslash 13 years ago

          Hmm, I see what you're saying; no it doesn't appear to create the thumbnail in chat. It would be interesting if Facebook uses different crawlers depending on whether the link is posted in the status box or in a chat. That could lead to some interesting analytics such as "your website was chatted about x number of times and shared via a status update x times."

  • TannerLD 13 years ago

    "What to Submit

    On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity."

    • lostlogin 13 years ago

      Wow I'm sick of people posting claims of off topic. And wow it's funny that someone has the patience to reply with the posting rules.

maxjaderberg 13 years ago

By looking at the headers you now have a great way of writing some analytics tools to see how much your website is shared on Facebook...

edouard1234567 13 years ago

I'm surprised this post makes it to the homepage... They've been doing that for ever, no need to look at your logs to figure this out. How else would they find and display an image form the page you're providing a link to.

justinph 13 years ago

12 lines of code instead of:

  tail -f /var/log/apache2/access.log
eli 13 years ago

I would imagine they're checking the URL for malware as well.

  • nwh 13 years ago

    Probably, I've seen then ban whole domains (droplr.com) previously for distributing malware.

spyder 13 years ago

Also it would be smart to run malware check on these urls if they don't already doing it.

slajax 13 years ago

I wish I had enough karma to down vote this.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection