Modeling Uncertainty with PyTorch

64 points by srom 4 years ago · 22 comments

Reader

The field of ML is largely focused on just getting predictions with fancy models. Estimating the uncertainty, unexpectedness and perplexity of specific predictions is highly underappreciated in common practice.

Even though it is highly economically valuable to be able to tell to what extent you can trust a prediction, the modelling of uncertainty of ML pipelines remains an academic affair in my experience.

marbletimes 4 years ago

When I was in academia, I used to fit highly sophisticated models (think many-parameters, multi-level non-linear mixed effect models) who were giving not only point estimate but also confidence and predictive intervals ("please explain to me the difference between the two" is one of my favorite interview questions and I still have not heard a correct answer).
When I tried to bring an "uncertainty mindset" over when I moved to industry, I found that (1) most DS/ML scientists use ML models that typically don't provide an easy way to estimate uncertainty intervals, (2) in the industry I was in (media) people who make decisions and use model prediction as one of the input for their decision-making are typically not very quantitative and an uncertainty interval, rather than give strength to their process, would confuse them more than anything else: they want a "more or less" estimate, more than a "more or less plus something more and something less" estimate. (3) When services are customer-facing (see ride-sharing) providing an uncertainty interval (your car will arrive between 9 and 15 minutes) would anchor the customer to the lower estimate (they do for the price of rides book in advance, and they need to do it, but they are often way off).
So for many ML applications, an uncertainty interval that nobody internally or externally would base their decision upon is just a nuisance.
- code_biologist 4 years ago
  
  Great answer. It prompts a bunch of followup questions!
  most DS/ML scientists use ML models that typically don't provide an easy way to estimate uncertainty intervals
  Not an DS/ML scientist but a data engineer. The models I've used have been pretty much "slap it into XGBoost with k-fold CV, call it done" — an easy black box. Is there any model or approach you like to estimate uncertainty with similar ease?
  I've seen uncertainty interval / quantile regression done using XGBoost, but it isn't out of the box. I've also been trying to learn some Bayesian modeling, but definitely don't feel handy enough to apply it to random problems needing quick answers at work.
  - marbletimes 4 years ago
    
    Correct, quantile regression is an option. Another is "pure" bootstrapping (you can see by googling something like uncertainty + machine learning + bootstrapping that this is a very active area of current research).
    The major problem with bootstrapping is the computational time for big models, since many models need to be fit to obtain a representative distribution of predictions.
    Now, if you want more "rigorous" quantification of uncertainty, one option is to go Bayesian using probabilistic programming (PyMC, Stan, TMB), but computational time for large models can be prohibitive. Another option is to "scale down" the complexity to models that might be (on average) a bit less accurate, but provide rigorous uncertainty intervals and good interpretability of results, for example Generalized Additive Models.
    A note here is that I saw certain quantification of uncertainty by people who were considered very capable in the ML community that gave me goosebumps, for example since the lower bound of the interval was a negative number and the response variable modeled could not be negative, the uncertainty interval was "cut" at zero (one easy way to deal with it, although it depends on the variable modeled and the model itself, is log-transforming the response—but pay attention to intervals when exp(log(y)) to get back to the natural scale. Another useful interview question.)
- antman 4 years ago
  
  That is really an effect of CS rather than math people dominating ML both in applications and management. My background is in engimeering but always hire a percentage people witb math and business background. In reality there are very few ML applications that don't need confidence estimation and estimation of monetary costs. Else each company will end up having the equivalent of the google graveyard of useless applications. It really is not that hard.
  - marbletimes 4 years ago
    
    I agree that statisticians would better than CS people appreciate the importance of uncertainty intervals--it is mostly cultural--but that "In reality there are very few ML applications that don't need confidence estimation and estimation of monetary costs" is empirically false.
    If ML application require uncertainty attached to point estimate, we would see plenty more uncertainty intervals attached to point estimates, but in industry, outside of niches (e.g., banking, bio, actuary to name a few), very few bother dealing with them.
    I am currently part of a large team (we are talking hundreds) of ML specialists, and I have yet to see a single presentation in which a point estimate was associate with some uncertainty interval. And in my previous company it was the same and when I interview candidates (dozens? hundreds?) I never get a satisfactory answer to the confidence interval vs predictive interval question I ask about.
    
    antman 4 years ago
    
    Let me rephrase your empirical observation in probabilistic terms. If the a random sample of data scientists from startups had the same distribution of mathematicians and CS people than a ramdom sample of data scientists from banking then we could compare empirically whether confidence intervals are equally useful in both industries.
    Given that historically regulators figured out that when playing with other people's assets you need to assess your confidence, the volatility of the outcomes in non banking industries that lack such oversight can be greatly attributed to people DunningKrugering after a couple of Andrew NG's courses.
    That is my claim and based on my experience working in projects accross many industries accross many countries.
- j7ake 4 years ago
  
  Although confidence and prediction intervals are slightly different, is there a an example where mistaking one for the other has led to real world consequences? I have a feeling it’s rare for it to matter.
  - marbletimes 4 years ago
    
    They are not slightly different, they measure something (totally) different.
    Confidence intervals are relative to E[y|x], Predictive intervals are relative to y. Sometimes, for example there is not much variation in y|x, the two intervals may be similar, but that is due to the nature of the data, not because they are one just "a bit larger than the other" (or, otherwise, think about (1) the uncertainty around the mean of an empirical symmetric distribution with a very small standard deviation--we are 95% confident the true mean is between z and k--and (2) the 2.5%-97.5% intervals of the raw data distribution. Numbers can look similar, but they are representing different measures).
    I paste an example here below that I had made in a following comment:
    --- In the vast majority of the cases, what we want it the range for y (prediction interval), that is, given x = 3, what is the expected distribution of y?. For example, say we train a model to estimate how the 100-m dash time varies with age. The uncertainty we want is, "at age 48, 90% of Master Athletes run the 100-m dash between 10.2 and 12.4 seconds" (here there would be another difference to point out between Frequentist and Bayesian intervals, but let's make things simple).
    We are generally not interested in, given x = 3, what is the uncertainty of the expected value of y (that is, the confidence interval)? In this case, the uncertainty we get (we might want it, but often we do not), is, "at age 48, we are 90% confident that the expected time to complete the 100-m dash for Master Athletes is between 11.2 and 11.6 seconds".
    ----
    The two intervals can be similar according to some metrics ("ah, come on, 11s or 12s who cares"), but they are measuring/estimating something very different and in many cases, they would matter a lot.
    Why do I say they "would" and not they "do"? Because many, and the vast majority I'd say, of decisions in industry settings (outside some niches) that are taken even when ML or statistical models are included in the process, are using point estimate (so, not even uncertainty intervals) only as one of the many input in the decision-making process.
    Let me give you an example. I was years ago developing models for estimating ROI relative to certain (very popular) products. The calculations made previously were absurdly wrong, there were log-transformations involved and guess what, they were using confidence intervals ("the uncertainty around the expected ROI for a similar class of products is") instead of predictive intervals ("the ROI for this class of products is expected to be between w and j").
    I provided the correct intervals (i.e., predictive), but in the end the decisions changed little, because those making decisions they were not even considering uncertainty in any way in the decision-making process. That's why, in general, I don't worry too much about uncertainty on the rare occasions these days when I develop models.
    I mean, who outside of academia (and even there...) measures the accuracy of a predictive model taking also into account the predictive intervals, for example adding to a metric like mean absolute error over test data also the proportion of test data that falls within the uncertainty intervals that were estimated for the model given the training data? The answer is "very few".
    
    j7ake 4 years ago
    
    They measure something different, I agree, but not different enough to matter in real life decision-making process, which often involves factors outside of the model.
    In real life decision-making, there are many other factors that are not known or quantifiable that come in and dominate any errors arising from using confidence instead of predictive interval.
- joconde 4 years ago
  
  What do "multi-level" and "mixed effects" mean? There are tons of non-linear models with lots of parameters, but I've never heard these other terms.
  - canjobear 4 years ago
    
    https://en.wikipedia.org/wiki/Nonlinear_mixed-effects_model
- curiousgal 4 years ago
  
  > the difference between the two
  One is bigger than the other as far as I remember which means that the standard error of the prediction interval is bigger?
  - marbletimes 4 years ago
    
    From a good SO answer, see https://stats.stackexchange.com/questions/16493/difference-b...
    "A confidence interval gives a range for E[y∣x], as you say. A prediction interval gives a range for y itself.".
    In the vast majority of the cases, what we want it the range for y (prediction interval), that is, given x = 3, what is the expected distribution of y?. For example, say we train a model to estimate how the 100-m dash time varies with age. The uncertainty we want is, "at age 48, 90% of Master Athletes run the 100-m dash between 10.2 and 12.4 seconds" (here there would be another difference to point out between Frequentist and Bayesian intervals, but let's make things simple).
    We are generally not interested in, given x = 3, what is the uncertainty of the expected value of y (that is, the confidence interval)? In this case, the uncertainty we get (we might want it, but often we do not), is, "at age 48, we are 90% confident that the expected time to complete the 100-m dash for Master Athletes is between 11.2 and 11.6 seconds".
NeedMoreTime4Me 4 years ago

You are definitely right; there are numerous classic applications (i.e. outside of the cutting-edge CV/NLP stuff) that could greatly benefit from such a measure.
The question is: Why don’t people use these models? While Bayesian Neural Networks might be tricky to deploy & debug for some people, Gaussian Processes etc. are readily available in sklearn and other implementations.
My theory: most people do not learn these methods in their „Introduction to Machine Learning“ classes. Or is it lacking scalability in practice?
- b3kart 4 years ago
  
  They often don’t scale, they are tricky to implement in frameworks that people are familiar with, but, most importantly, they make crude approximations meaning after all this effort they often don’t beat simple baselines like bootstrap. It’s an exciting area of research though.
- disgruntledphd2 4 years ago
  
  It takes more compute, and the errors from badly chosen data vastly outweigh the uncertainties associated with your parameter estimate.
  To be fair, I suspect lots of people do this, but for whatever reason nobody talks about it.
- shakow 4 years ago
  
  > Or is it lacking scalability in practice?
  Only speaking from my own little perspective in bioinformatics, lack of scalability above all else, both for BNNs and GPs.
  Sure, the library support could be better, but that was not the main hurdle, more of a friction.
  - NeedMoreTime4Me 4 years ago
    
    Do you have an anecdotal guess on the scalability barrier maybe? Like does it take too long with more than 10,000 data points having 100 features? Just to get a feel.
    
    shakow 4 years ago
    
    Please don't quote me on that, as it was academic work in a given language and a given library and might not be representative of the whole ecosystem.
    But in a nutshell, on OK-ish CPUs (Xeons a few generations old), we started seeing problems past a few thousands points with a few dozens features.
    And not only was the training slow, but also the inference: as we used the whole sampled chain of the weights distributions parameters, not only was memory consumption a sight to behold, but inference time quickly grew through the roof when subsampling was not used.
    And all that was on standard NNs, so no complexity added by e.g. convolution layers.
    
    rsfern 4 years ago
    
    The main bottleneck in GP models is the inversion of an NxN covariance matrix, so training with the most straightforward algorithm has cubic complexity (and quadratic memory complexity). 10k instance is what I’ve seen as the limit of tractability.
    The input dimensionally doesn’t necessarily matter since it’s kernel method, but if you have many features and want to do feature selection or optimize parameters things can really stack up.
    There are scalable approximate inference algorithms, and pretty good library support (gpflow, gpytorch, etc), but it seems like they are not widely known, and there are definitely tradeoffs to consider among the different methods.
math_dandy 4 years ago

Uncertainty estimates in traditional parametric statistics are facilitated by strong assumptions on the distribution of the data being analyzed.
In traditional nonparametric statistics, uncertainty estimates are obtained by a process called bootstrapping. But there's a trade-off. There's no free lunch!) If you want to eschew strong distributional hypotheses, you need to pay for it with more data and more compute. The "more compute" typically involves fitting variants of the model in question to many subsets of the original dataset. In deep learning applications in which each fit of the model is extremely expensive, this is impractical.

Settings

Modeling Uncertainty with PyTorch

Keyboard Shortcuts