Settings

Theme

Ask HN: Storing millions and billions of URLs?

12 points by gerenuk 8 years ago · 10 comments · 1 min read

Reader

Hello Everyone!

Currently, using ElasticSearch for storing the meta data and other raw data information but it is a very small scale around 500,000 domains.

I have been tasked to scale it to 20-40 million domains and storing their internal/external links while building a page rank/domain authority score for each domain which we are adding to our database.

What do you guys suggest/recommend for storing this data at a very large scale as web page internal links/external links will be stored which will lead it over 100M-1B links database?

Any kind of feedback/suggestion would be appreciated.

Thanks.

nik736 8 years ago

I don't think that any proper database technology will have issues with that amount of data. It all depends on how you use it.

sharemywin 8 years ago

Found this:

https://dba.stackexchange.com/questions/38793/which-database...

There's a nice little triangle diagram here: https://stackoverflow.com/questions/2794736/best-data-store-...

girishso 8 years ago

I personally have used CouchDb to store tens of millions of documents. If you can find a way get the data you want using CouchDb views, the number of documents simply doesn’t matter with CouchDb (may be just the disc usage grows with additional documents/views). And that too with excellent performance.

drizzle87 8 years ago

Elasticsearch should be easily able to handle your scaling needs. Why do you think that it would not? What are your concerns?

jjirsa 8 years ago

The answer will depend primarily on how you expect to query it.

Cassandra can do many orders of magnitude more than 1B, but would limit you in your query patterns.

mr__y 8 years ago

Have you considered sharding the data to multiple independent ES instances? Each of them could handle amount of data that does not cause problems?

cimmanom 8 years ago

We've found Elasticsearch to be quite performant with hundreds of millions of documents. What are your concerns with scaling it?

dchuk 8 years ago

Building an ahrefs/moz/majestic competitor?

  • gerenukOP 8 years ago

    BuzzSumo competitor with a different set of features.

    • dchuk 8 years ago

      I'm actually interested in hearing more about this if you're willing to share it.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection