Ask HN: Do you know a good resource for large data scraping job?
My company, Easy Vino (easyvino.com), is gearing up for beta release and we need to populate our database with wine lists. The job consists of extracting information from wine lists (which we have and are usually PDF, HTML or Pictures) to put it into our database.
We have a simple back office that connects to a wine API to search for wine info and we need help inputing the data. I'd rather have the same person (or team) doing this as the learning curve is significant.
Does anyone know a cheap resource for this type of task? Any help or reference is appreciated.
Thanks a lot! I'm not sure exactly what sort of answer you are expecting. Unless the data you want is in a standardized format (such as a standardized XML schema), any effort to extract data would require writing custom parsers for each set of data that has a different structure. I'm not sure if you are asking for advice on which technology stack to use for writing this or are looking for a pre-made tool that can extract this for you? There may be some tools that can "attempt" to do this without requiring you to write custom code but I am not sure how effective they would be. I believe it has to be a person. I've used Mechanical Turk in the past and it's great for easy, simple tasks. This one requires a little learning, which means sticking to one person/team would be best because they can quickly get faster and more efficient. I'm looking for advice on companies or people you've used in the past that you liked. Thanks! The typical way of doing this is to use mechanical turk, there are some third party services (their name escapes me) which are built on top of mturk to provide reliability. The typical way they do this is to have two different people enter the data and when there's a mismatch have a supervisor decide which is right. I've used mechanical turk in the past for easy tasks. This one requires a little learning and I feel people get a lot faster even after 1 day. My concern with Mturk is having different people all the time, which is a lot less efficient. To give you a number, right now it takes me 1-2 minutes to add a line, whereas for someone new it takes him 5-8 minutes. That's the kind of learning curve I'm hoping for if I hire the same person to do this for 2-3 weeks. Do you know if Mturk can offer this? Thanks a lot! I believe you can create a custom group of qualified workers on mturk thanks You might have good luck just hiring some cheap Virtual Assistants to do this work for you. oDesk or elance are pretty good for these types of administrative tasks Thanks! Do you know anyone in particular? Alas, I don't have any personal experience with hiring a VA - I've just listened to Rob Walling talk about them a lot. I've used oDesk several times though for other things and it works fairly well. My suggestion is to write up a short "test" project and hire out five or six competent looking VAs. Give them a hard deadline (two hours tops, etc) and see who completes the job in a satisfactory manner. Some will totally suck, some will never get back to you, and a few will be awesome. If your entire group sucks, ditch them and move on to a new group of candidates. Once you find someone that is good and in your price range, give them a larger task and see how it goes. thanks! I'll try that