Ask HN: Why do people use Puppeteer for webscraping instead of SaaS services?
Hi HN
As in the title, I am wondering what are the reasons anyone would use Puppeteer/Selenium/other-browser-emu for web scraping if there are already tens if not hundreds of SaaS services offering "scraping-as-a-service". Except for JS execution.
A handful of examples: Scrapehero, Webrobots, Apify, Scrapingbee, Scrapinghub, Promptcloud
Except for the ones that require setup fee, or have ridiculous pricing models. Why would anyone want to setup Puppeteer/Selenium/other scraping bots instead of using one of the "scraping-as-a-service" platforms? Probably because people who are doing web scraping aren't professional scrapers, they're just programmers who need some data quickly. And since they're already familiar with Selenium, they think that's the state of the art. I've never seen an ad for a scraping service, so I also didn't know that they existed. That's true when it's a one-time job: pull the data and disappear. I also see how this is the case for most freelancers on Fiverr or Freelancer. This is the tool they know, so they use it. However I imagine there is a number of companies that strongly rely on continous data scraping - let it be for price comparison - and I've seen one heavily using Puppeteer my main concern is pricing. many websites use anti-scraping technologies. scraping the html doesn't work anymore. need to load everything and execute js. for example, I have seen some can detect headless / puppeteer mode too. I ended up creating my own scraping infra using vanilla chrome... current saas platforms charge by request count. If I need to load everything, the cost will be too high. I thought about it too but when you consider cost of running headless Puppeteer (lets say on AWS) and the cost of a good proxy that is charged per GB its often as expensive (if not more) as some of these SaaS-es. This is the case especially for websites with some heavyweight JS/CSS/img assets.