Settings

Theme

Ask HN: What is the best tech stack nowadays for mass scraping?

4 points by imnotreallynew 2 years ago · 5 comments · 1 min read


I’m in the design phase for a new project that will involve quite a bit of web scraping, formatting that data as appropriate and saving to a DB.

The sources to be scraped are quite varied.

I’ve got a little bit of experience with Node/Express and Ruby/Rails. I’m more than happy to pick up Go or Python/Django or Elixir or something else if those are more appropriate. I think my hesitation with going back to Node is that I slightly prefer static typed languages, but happy to use the best tool for the job.

My concern is computing/bandwidth costs as the various scrapers will be running and alternating quite frequently.

I’m hoping you all could give some recommendations for a stack that makes it easy to run mass scheduled web scraping jobs with little overhead in order to reduce server costs. Thanks!

awesomegoat_com 2 years ago

I have built my web scraping system ( https://awesomegoat.com ) on Ruby on Rails. And while I spent this Christmas-break exploring Elixir/Phoenix, I am so far staying with Ruby on Rails.

While it seems I could have built a slightly more (CPU & memory) efficient system in elixir, I am afraid the development of new features would be a bit slower and my time is more precious than the machine's.

Also, CPU & memory are likely not the constraints in the scraping exercise. What you will likely find later on that you will get blocked by Cloudflare on week 2 and superb backend won't make a difference.

  • awesomegoat_com 2 years ago

    Today, I woke up feeling that elixir/phoenix is the best platform for rewrites.

    I mean, when you know the problem domain well, you can build a master piece in elixir/phoenix. I still feel that putting together the first prototype has to be faster in ruby on rails.

    • MainlyMortal 2 years ago

      It's quite ironic, or rather unfortunate, that recently we're seeing the opposite problem in the Elixir community.

      A lot of the big famous companies used in case studies about how Elixir and Phoenix are amazing, save money, save resources, save development time etc. are starting to abandon the stack for technically worse solutions. And for no good reason other than coming from management it seems.

      I agree that it's a great platform for rewrites in that once you have a working solution, and you know the bottlenecks, then you understand how to break it up to make it concurrent, parallel and distributed with minimal effort.

      I also think that it's a great prototype language too, though. You can get up and running just as fast as Ruby on Rails for like 99% of projects. Or at least used to be able to. I have a rant about the last five years of Phoenix churn being responsible for the low adoption of Elixir but that's for another day.

accrual 2 years ago

Maybe TypeScript for a typed, familiar and easy to read/write language, and either an internal scheduler or an external one (e.g. cron). A file with per-website rules/scraping hints stored as JSON on disk or in a database, unless it's supposed to be dynamic/one ruleset to rule them all. If you need to go faster, you could retrieve the data (wget/curl/some lib) and pass it to some binary (C/Rust) for processing into a database at core speed.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection