Settings

Theme

Ask YC: How to web scrape 100s of sites/forms?

1 points by bdouglas 17 years ago · 5 comments · 1 min read


hi...

a potential research app requires the scraping of a few hundred websites/forms and diving into the child links to obtain the linked/parent structure ie. company->dept->title->name.

in this case, this would involve going 4 levels deep, and getting the required information.

so, does anyone know of a method/app/company that can be used to accomplish this. orm am i going to have to figure out how to get a number of cheap guys to write a bunch of python scripts!!

thanks

nreece 17 years ago

...cheap guys to write a bunch of python scripts

You know what will be 'cheap'. Writing it yourself.

qhoxie 17 years ago

Libraries like mechanize and hpricot are shrinking the curve for scraping tasks. That's not to say it is easy, but it should not take a bunch of people working on it. One good developer with proper experience would be ample in my opinion.

olefoo 17 years ago

Or get one expensive guy to write you a script that

writes the scripts to scrape the sites by scraping the sites to read the structure to write the scripts to scrape the sites.

Anon84 17 years ago

Check out the search.wikia.org project. They make their crawler (and crawl data) available. Maybe you can get away with using theirs. That would really be cheap!

gaius 17 years ago

I am unable to think of an application for this technology other than spamming. Care to provide more details before we shoot ourselves in the face by helping you?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection