Settings

Theme

Ask HN: Is there a case for small / medium data

2 points by pklee 5 years ago · 8 comments · 1 min read


A lot of innovation happens in the big data (data that needs distributed compute - spark etc.) like ability to blend data across sources, schemaless / schema on the fly, deploy analytical models to production etc. Is there a case for similar innovations in the small to medium data (working with ~10M dataset) blended across data sources, simple analytical models and such ? What percentage of usecases are in the bigdata realm vs. small/medium data.

temp234 5 years ago

This is an incredibly interesting question but I have no idea how you would ever be able to figure out the answer. What defines a data set? What about huge data sets that reference back to a relatively small mapping table, is that one big data set or two data sets of different size? Maybe a cloud hosting provider would have some insight into hosted data sets but even if the public had that information we still wouldn't know anything about data sets that are collected and stored on local machines. Similar problems arise for cataloguing models by their complexity. What is the broader question here, what are you trying to figure out?

There is definitely research being done on sparse data sets. Early stats methods were applied to what we would consider small data. Tukey did a lot of work on data viz and exploratory data analysis that was important and applies to small data sets. Many medical experiments use small data sets. Bayesian methods can apply to small data sets.

  • pkleeOP 5 years ago

    Yeah I am not just talking about analytics.. even basic merging across data-sources and simple visualizations based on the blended data and the ability to go across environments.

ploika 5 years ago

I'm kind of sad that the term "data mining" has fallen out of favour, because large datasets (as with mines) tend to contain a lot of worthless dirt that just has to be sifted through.

10 million rows of data is still pretty big, all the same. You can get away with invoking the Central Limit Theorem after about 30 observations, for instance (with all the usual assumptions and caveats). Sometimes all you're getting for the extra effort is a tighter confidence interval around something that could be pretty well estimated with a couple of hundred rows of data.

idoh 5 years ago

Yes, there are huge opportunities for small / medium data. Maybe 90%+ data problems are in the small size range. The biggest pain point is converting the insights from the analysis into an actionable plan that actually improves things.

Thinking this through, the main pain point I've experienced is convincing people to act on the data, follow through, and connect changes in product (however defined) into changes in the metrics.

  • pkleeOP 5 years ago

    Yes completely. What I am seeing more and more is people try to shovel everything into big-data, but it becomes a nuclear powered can-opener which I am not sure is needed. The UI for this oscillates between excel with a macro or here is SaaS. Absolutely nothing in between.

shoo 5 years ago

In some cases there can be a large improvement going from status quo (if status quo is rather lacklustre) to a simple model, and it may not be worth doing anything more complicated if the accuracy of the component in question is no longer a bottleneck of overall system performance.

Maybe a simple model with a well-chosen prior informed by domain knowledge does the job.

he11ow 5 years ago

Most commercial value lies in processing small/medium data.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection