Settings

Theme

Ask HN: What are some of the most utilised patterns for querying large datasets?

7 points by extra_rice 7 years ago · 4 comments · 1 min read


I'm currently working on a software project where I need to query datasets that could be very large (maybe hundreds of thousands per single context), and then do some computations on the results. It's basically, find some sort of "median" from the set, but it could be a bit more complex than that, like find the smallest, most common value. My impression is that most modern databases should be able to handle queries like this with some built-in mechamism. However, one of the concerns is that, because the datasets could be very large, queries would end up taking very long. The data being queried is also highly dynamic, so caching maybe a little tricky.

I'm pretty sure this isn't something unique to this project, but I'm interested to know how other practitioners address this kind of situation. Also, to note, while I'm asking this in general terms, it'd be interesting to know how MongoDB users in particular handle this.

bjourne 7 years ago

Have you heard about the inverted index? It is the corner stone of all databases and information retrieval systems. Your question is quite fuzzy so it is hard to come up with a more precise answer.

  • extra_riceOP 7 years ago

    Sorry, I didn't know how to make the question a bit clearer. Basically, it's: how do you ensure that queries on very large, highly dynamic datasets return in acceptable amount of time (especially if clients call/poll it at regular short intervals)?

    • sethammons 7 years ago

      indexes, caching (pass through, LRU, etc), query read replicas, sharding, pre-fetching, sampling, maybe look into columnar storage ... Hard to answer not knowing more specifics.

      Something to always remember: if it is valuable, charge for it. If it is really valuable, you can spend all kinds of hardware on it. Give each customer their dedicated instance and rinse and repeat the strategies above.

snazzybazzy 7 years ago

also depends on your query pattern. Are you fetching many columns/rows or are you looking for one particular row? What do you mean by "large" as well. Are we talking GBs, TBs, or bigger?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection