Statistical tests are complicated because their inventors did not have fast computers

Students of frequentist statistics typically learn about a few statistical tests: their names (usually of the people who invented them), and some idea of what kind of data they apply to. For example:

Student’s t-test applies to continuous measurements and can check whether the average value of some measurement is different in two populations. For example, whether on average sheep from Wales or New Zealand have different heights.

The Kolmogorov-Smirnov test checks whether a set of samples come from a known probability distribution, e.g. whether they are Gaussian. (Or, whether two different sets of samples come from the same distribution).

The technical conditions are sometimes difficult to quickly understand, so often people just learn about specific use cases. For example, all I knew about the Mann-Whitney U-test is that you can use it to check whether your new algorithm or an existing algorithms is better, on a dataset of test cases.1 When writing a paper proposing a new algorithm, you run that and the old algorithm, check whether p<0.05, and you’re done.

All this is to say, tests are complicated. Understanding when and how to apply each test is complicated. The assumptions are complicated, the formulas are complicated, communicating the results in an understandable manner is complicated. If you dig into the papers describing these tests, you’ll find them full of jargon, long derivations, scary-looking probability distributions, proofs that the tests are valid, et c.

One would be tempted to think that it is therefore very difficult to make new tests. If your data don’t fit any named test well, you pick some aspect of it that does and then just run the test.

However, I contend that in many cases, tests are complicated because the people that invented them didn’t have fast computers.2 The idea of a statistical test is very simple and, once you have Monte Carlo methods with huge sample sizes, you don’t need much more than that.

This is the CDF of Student’s t-distribution. William Gosset had to derive it and prove that the difference of standard errors followed it. Looks pretty hard!

\(\text{T-CDF}(t) = \frac{1}{2} + x \Gamma(\frac{v+1}{2}) \times \frac{2^{F_{1}}(1/2,\frac{v+1}{2};3/2;-\frac{x^{2}}{v})}{\sqrt{\pi v} \Gamma(\frac{v}{2})},\)

The ingredients you need to craft a statistical test about some phenomenon are:

A null hypothesis about what is going on, that you will attempt to refute.
A measurable quantity that is different under the null hypothesis and alternative hypotheses.
A version of the null hypothesis that is concrete enough to run as a computer program.

To actually run the test, you:

Measure the quantity from your real data, and obtain some real measurement.
Simulate taking the same measurement from the null hypothesis many times, and obtain a simulated measurement for each simulation.
Apply the definition of p-value: the proportion of simulated measurements that are more extreme than the real measurement is your p-value.

That’s it! With the magic of Monte Carlo methods and a little CPU time, we have computed approximated p-values for your test. It’s even possible to convert this approximate p-value to an exact one using a little slack and Chernoff-Hoeffding bounds (see appendix). Using more simulations decreases the amount of slack needed exponentially.

We want to know whether Welsh or Kiwi sheep have the same height at the shoulder.

Let’s collect our ingredients:

The null hypothesis is that Height(Wales) = Height(New Zealand).
We will measure the average height of Welsh and Kiwi sheep and subtract them. That’s our quantity, which is large if the heights are different and close to 0 if they’re the same.
To simulate the null, we assume sheep height is normally distributed with standard deviation S(W) in Wales and S(NZ) in New Zealand3. For simplicity we assume that both are equal to 20cm.

We collect the height of N=30 sheep from each country. We measure 77.2cm for Wales and 82.1cm for New Zealand, for a difference of 4.9cm.

Then, we write a computer program that simulates measuring sheep from the null hypothesis. We draw 30 random normal variables with the same mean, for each of the two countries. We average Kiwi and Welsh sheep separately and then subtract the results, like in the real experiment. We do this 1 million times and count the number of times the difference was larger than 4.9cm. It did so 34.3% of the time, so our p-value is 0.343. Thus we cannot reject the null.

We created a rough equivalent to Student’s t-test using computational brute force. We did not have to invent the t-distribution, nor prove that the averages divided by the standard deviation would follow it. We just wrote a short Numpy program.

This really simple test would break if the real standard deviations of sheep height were not 20cm. We could deal with this by using a grid of variances and picking the ‘least convenient’ one every time (the one with highest p-value), but this could be brittle if the grid is too coarse.

Through experimentation, we might find that dividing the means by the observed standard deviation makes the test more robust or powerful. (We can computationally measure how good a test is by computing the false acceptance and false rejection rates over various alternative hypotheses, including those that break the assumptions of the test.)

Many things become easy if you have computational power. The bitter lesson applies to statistics as well as AI.

Code for this article is available here.

Our Monte Carlo estimator of the p-value is the average of many Bernoulli variables X_i. Because we’re worried about the p-value being falsely too small, we’d like a bound on the probability that the mean exceeds the real Bernoulli parameter plus some safety margin:

\(\frac{1}{N}\sum_{i=1}^N X_i \ge \mu - \epsilon\)

Fortunately the Chernoff-Hoeffding theorem tells us the following:

\(P\left(\frac{1}{N}\sum_{i=1}^N X_i \ge \mu + \epsilon\right) \le e^{-2\epsilon^2 \cdot N}\)

the bound is a constant that depends on epsilon, to the power of N.

How do we incorporate this into a p-value? Well, the p-value is supposed to be the maximum allowed Type-I (false positive) error. Suppose we target a p-value, that is a total Type-I error probability, of alpha. Then we can check whether the average of the Xi is below alpha - epsilon - (probability of bound failure), that is

\(\text{threshold} = \alpha - f(\epsilon, N) = \alpha - \epsilon - e^{-2\epsilon^2 \cdot N}\)

By increasing N and tuning epsilon we make this threshold easier to achieve. The optimal epsilon for a particular N can be tuned with a grid search. Looks like its optimal value decreases with increasing N.

N=10^1  eps=3.667e-01  f(eps)=0.434623
N=10^2  eps=1.421e-01  f(eps)=0.159724
N=10^3  eps=5.165e-02  f(eps)=0.056468
N=10^4  eps=1.814e-02  f(eps)=0.019526
N=10^5  eps=6.260e-03  f(eps)=0.006655
N=10^6  eps=2.126e-03  f(eps)=0.002245
N=10^7  eps=7.161e-04  f(eps)=0.000751
N=10^8  eps=2.393e-04  f(eps)=0.000250

Statistical tests are complicated because their inventors did not have fast computers

Discussion about this post

Ready for more?