Ask HN: Has anyone ever crawled over a billion pages? How much did it cost?

10 points by outpan 9 years ago · 14 comments · 1 min read

I'm really curious to find out how much it'll cost to crawl a billion pages. Doesn't really matter if you used a SaaS solution or built your own crawler, any info would be really useful.

mtmail 9 years ago

There's a discussion about a 2 billion page crawl on the frontpage right now. https://news.ycombinator.com/item?id=12486631

Here's the author's comment on hardware https://news.ycombinator.com/item?id=12487003 and later he says it costs 300 Euro/month to run the service.

outpanOP 9 years ago

That post is what triggered my Ask post.
The problem is the huge contrast with https://www.quora.com/How-much-would-it-cost-to-crawl-1-bill...
Even taking into account the drop in prices on AWS. Also, if you take a quick look at companies that provide such services the prices are orders of magnitude higher than deusu's costs.
- mtmail 9 years ago
  
  Deusu's crawl servers are located at https://www.hosteurope.de/en/Server/Root-Server/ while the website points to his home broadband ISP. Two servers at his specs would be 200 Euro/month total, with 5x more bandwidth than he currently uses. I'd say that's much cheaper that AWS. Of course crawl companies charge more: they run a business, pay system administrators, have more backup and redundancy.
  - outpanOP 9 years ago
    
    I'm not sure how he manages to crawl with this speed using such low amount of resources.
    We did a benchmark on Nutch and couldn't really pass the 10-14 M(B)ps on a $1200/month machine. Even though we hired a professional to optimize the setup. The same is roughly true about Heritrix.
    Just wondering if there is something missing in his setup, such as domain/ip rate limiting.
    
    detaro 9 years ago
    
    You can check his source if you are curious how it works ;)

AznHisoka 9 years ago

I've crawled over a billion pages over a stretch of 3 years or so. Crawling is the easy task and just crawling a billion pages wouldn't cost more than a few thousand a month. Add a couple more thousand for storing these pages in a search index and database.

theviajerock 9 years ago

Do you have a company that does this ? Can you advice me about this? I like the crawling thing, I would like to know how to monetize this.
Thanks!
outpanOP 9 years ago

Would you be able to share what your stack was? and the resources it took? Thanks a lot.
- AznHisoka 9 years ago
  
  Ruby and Sidekiq as the messaging queue
  Postgres to store the data
  Elasticsearch as a search index.
  My ES cluster has around 10 nodes, 64 GB RAM, quad-core.
  Postgres database cluster is 4 nodes, 1 TB, 64 GB RAM, quad-core.
  800 crawler threads distributed across 10 dedicated servers.
  - outpanOP 9 years ago
    
    Thanks a lot! This sounds reasonable. Did you guys look into professional services for this?
    
    AznHisoka 9 years ago
    
    Nope. We have lots of custom needs.
- usgroup 9 years ago
  
  Just in case you don't know common-crawl makes available a huge crawl dataset
  - outpanOP 9 years ago
    
    Common Crawl is great! however, some use cases require larger crawls with a higher frequency.

cdnsteve 9 years ago

I think it would be valuable to have an open dataset of a raw crawl index. It could be distributed via academic torrents or partner with a hosting provider.

The real innovation won't be in crawling but in working on the index, filtering it, organizing it, trying sort algorithms and learning.

If this was available and gained popularity I could see competition in search again.

Settings

Ask HN: Has anyone ever crawled over a billion pages? How much did it cost?

Keyboard Shortcuts