Scrappy - Simple Perl Scraping Framework

33 points by alnewkirk 15 years ago · 16 comments

Reader

Most scrapers seem to break down on heavy dynamic/ajax pages. For example, anything made with GWT appears to provide little for the average scraper to grab (say, for automated daily tracking of android app downloads, for example). Short of reversing out the foreign pages api calls, has anyone encountered a solution to do more processing and then scrape the rendered page?

(well, and short of using Selenium to script a login, then scrape the rendered page via a controlled firefox... which works, but is clunky)

arkitaip 15 years ago

I was looking at several scraping solutions (e.g. imacros, selenium) that can handle DHTML for a project and they all have significant performance issues since they need to render the actual pages before processing them. A couple of thousands or rows isn't a problem but try anything more and you got a real performance bottleneck.
- odonnell 15 years ago
  
  DHTML is server-side. You mean AJAX. Also, think of the page as an interface to a more lightweight web service. You should probably be parsing that directly.
  - thirdusername 15 years ago
    
    He's referring to this: http://en.wikipedia.org/wiki/Dhtml I'm not sure what DHTML you are thinking of that would be server-side.
    
    odonnell 15 years ago
    
    Fuck, thinking of SHTML for some reason.
- chsonnu 15 years ago
  
  Have you tried Watir? I'm not sure if it'll solve your performance issue, but it's been at least twice as fast as Selenium for me.
odonnell 15 years ago

Use Charles or another proxy and find what feed the page is loading, then parse that. At least two requests need to be made anyway.

bravura 15 years ago

At the very least, Scrappy (Perl) should link to Scrapy (Python).

Otherwise, it seems remiss of this developer to pick a name that is easily confused that of an existing open-source project with similar purpose.

jonathansizz 15 years ago

Oh, at the very least!
Then the module author should prostrate himself at the feet of the all-mighty Python community, trembling in unreserved awe whilst acknowledging that the world quite deservedly revolves around them!
Module authors of other (obviously inferior) languages, take note: always check if there's any Python code with a similar name before you choose a title for your project, to avoid embarrassment!
- devinj 15 years ago
  
  I don't think that it was Python had anything to do with it. It's sort of like me calling my new programming language made for sysadmin work and bioinformatics "Pearl". It's just a confusing name considering that there's another programming language that's used for the exact same things, and spelled and pronounced identically.
  Now imagine if I said on my website, "By the way, feel free to call it Perl, just be aware that there's also another project that calls itself that". Now I'm giving permission for people to be even more confusing!
  - benatkin 15 years ago
    
    This isn't a great comparison. Perl is huge compared to these scraping tools. With how the name "Perl" came to be, it's not likely that two people developing a similar kind of thing would independently come up with a similar name. It's quite common to add a y to a concept to come up with a name (or the last consonant and y, like scrappy, or ly).
- bravura 15 years ago
  
  It's not about Python > Perl. It's about Scrapy is older (AFAIK) than Scrappy. If it were the other way around, I would have proposed that Scrapy rename or at least reference Scrappy.
  The point is to avoid confusion among users who might not know about both tools.
- epochwolf 15 years ago
  
  Like Wordpress picking "Django" as the name for a release? (Aside from the fact that Django is trademarked)
pyre 15 years ago

He/she may not have known that it existed, but a link/reference would be good.
- jleader 15 years ago
  
  http://search.cpan.org/dist/Scrappy/lib/Scrappy.pm#DESCRIPTI... says:
  "Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one."
  Looking at the previous versions on CPAN, it looks like it's said something similar since version 0.51 last August (a day after version 0.50, which appears to have been the first release).

Settings

Scrappy - Simple Perl Scraping Framework

Keyboard Shortcuts