Use Node.js to Extract Data from the Web

80 points by johnrobinsn 13 years ago · 34 comments

Reader

STRML 13 years ago

Don't forget streams, the more `node.js` way to parse HTML:

    var http = require('http');
    var tr = require('trumpet')();
    var request = require('request');
    request.get('http://www.echojs.com")
      .pipe(tr.createReadStream("article > span"))
      .pipe(process.stdout);

That's it! See https://github.com/substack/node-trumpet and their tests for more.

substack 13 years ago

You probably meant:

    var tr = require('trumpet')();
    tr.createReadStream('article > span')
      .pipe(process.stdout);
    
    var request = require('request');
    request.get('http://www.echojs.com').pipe(tr);

Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in trumpet@1.5.6.

kanzure 13 years ago

And then there's hyperquest because maybe you want to do more than five simultaneous requests:
https://github.com/substack/hyperquest
- ssafejava 13 years ago
  
  True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.

zenocon 13 years ago

I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.

MrBlue 13 years ago

PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)
enscr 13 years ago

I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.
techaddict009 13 years ago

Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?
- klibertp 13 years ago
  
  Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/

nodesocket 13 years ago

Have you played around with node.io? https://github.com/chriso/node.io

Encapsulates all this functionality in an easy to use interface.

httpteapot 13 years ago

Last commit 3 months ago. Do you know if this project is still alive?
- nacs 13 years ago
  
  Haven't used node.io but 3 months isn't that old.
  Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.
- chrisohara 13 years ago
  
  Author here.
  Still active, although development has slowed down.
  If you have any questions or issues just submit an issue @ Github and I'll help asap.
MrBlue 13 years ago

Node.io is pretty much dead.

nostrademons 13 years ago

There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

https://github.com/karlwestin/node-gumbo-parser

It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

aroman 13 years ago

Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.

There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.

If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee

victorhooi 13 years ago

Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

http://docs.casperjs.org/en/latest/quickstart.html

However, I'm wondering how this (Cheerio) compares - anybody have any experiences?

premasagar 13 years ago

See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).

dfrodriguez143 13 years ago

I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: http://danielfrg.github.io/blog/2013/08/20/relevant-content-...

chatman 13 years ago

Isn't scrapy easier to use than this?

hackula1 13 years ago

Cheerio is really easy for anyone familiar with jQuery (most node.js devs I would imagine).
level09 13 years ago

its probably more organized and easier to read than a huge number of nested callbacks
AsymetricCom 13 years ago

there's a lot of better ways to do this. Most of them involve documented standards so your code doesn't break the moment someone changes something.

mholt 13 years ago

This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: http://smartystreets.com/products/liveaddress-api/extract)

babby 13 years ago

Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.
Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.

greenido 13 years ago

Similar to what I wrote a week ago: http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-w... :)

tommoor 13 years ago

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at http://pagemunch.com

shospes 13 years ago

We also used cheerio and node.js and built an click & extract interface around it: http://www.site2mobile.com/.

garyjob 13 years ago

Interesting, I encounter the same set of problems as well last year when working on two side projects. Ended up building a webscraping service with a point and click interface on top of it : https://krake.io

level09 13 years ago

here is how I like to do it :

  from pyquery import PyQuery as pq
  doc = pq('http://google.com')
  print doc('#hplogo')

tectonic 13 years ago

Remember to use SelectorGadget (http://selectorgadget.com) to help generate your CSS selectors.

zerni 13 years ago

nice!

I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067

Unfortunately jsdom had a memory leak so the crawler died after a while...

cheeaun 13 years ago

If you want to fix the memory leak, I remember you need to do `window.close()` after the job is done.
- zerni 13 years ago
  
  thanks mate!

Settings

Use Node.js to Extract Data from the Web

Keyboard Shortcuts