Settings

Theme

Ask HN: Do you know a good resource for large data scraping job?

9 points by hugo31370 14 years ago · 10 comments · 1 min read


My company, Easy Vino (easyvino.com), is gearing up for beta release and we need to populate our database with wine lists. The job consists of extracting information from wine lists (which we have and are usually PDF, HTML or Pictures) to put it into our database.

We have a simple back office that connects to a wine API to search for wine info and we need help inputing the data. I'd rather have the same person (or team) doing this as the learning curve is significant.

Does anyone know a cheap resource for this type of task? Any help or reference is appreciated.

Thanks a lot!

devs1010 14 years ago

I'm not sure exactly what sort of answer you are expecting. Unless the data you want is in a standardized format (such as a standardized XML schema), any effort to extract data would require writing custom parsers for each set of data that has a different structure. I'm not sure if you are asking for advice on which technology stack to use for writing this or are looking for a pre-made tool that can extract this for you? There may be some tools that can "attempt" to do this without requiring you to write custom code but I am not sure how effective they would be.

  • hugo31370OP 14 years ago

    I believe it has to be a person. I've used Mechanical Turk in the past and it's great for easy, simple tasks. This one requires a little learning, which means sticking to one person/team would be best because they can quickly get faster and more efficient.

    I'm looking for advice on companies or people you've used in the past that you liked. Thanks!

ig1 14 years ago

The typical way of doing this is to use mechanical turk, there are some third party services (their name escapes me) which are built on top of mturk to provide reliability.

The typical way they do this is to have two different people enter the data and when there's a mismatch have a supervisor decide which is right.

  • hugo31370OP 14 years ago

    I've used mechanical turk in the past for easy tasks. This one requires a little learning and I feel people get a lot faster even after 1 day. My concern with Mturk is having different people all the time, which is a lot less efficient. To give you a number, right now it takes me 1-2 minutes to add a line, whereas for someone new it takes him 5-8 minutes. That's the kind of learning curve I'm hoping for if I hire the same person to do this for 2-3 weeks.

    Do you know if Mturk can offer this? Thanks a lot!

polyfractal 14 years ago

You might have good luck just hiring some cheap Virtual Assistants to do this work for you. oDesk or elance are pretty good for these types of administrative tasks

  • hugo31370OP 14 years ago

    Thanks! Do you know anyone in particular?

    • polyfractal 14 years ago

      Alas, I don't have any personal experience with hiring a VA - I've just listened to Rob Walling talk about them a lot.

      I've used oDesk several times though for other things and it works fairly well. My suggestion is to write up a short "test" project and hire out five or six competent looking VAs. Give them a hard deadline (two hours tops, etc) and see who completes the job in a satisfactory manner.

      Some will totally suck, some will never get back to you, and a few will be awesome. If your entire group sucks, ditch them and move on to a new group of candidates.

      Once you find someone that is good and in your price range, give them a larger task and see how it goes.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection