Is “nearly significant” ridiculous?

Graphic: Parasitoid emergence from aphids on peppers, as a function of soil fertilization. Analysis courtesy of Chandra Moffat (but data revisualized for clarity).

“Every time you say ‘trending towards significance’, a statistician somewhere trips and falls down.” This little joke came to me via Twitter last month. I won’t say who tweeted it, but they aren’t alone: similar swipes are very common. I’ve seen them from reviewers of papers, audiences of conference talks, faculty colleagues in lab meetings, and many others. The butt of the joke is usually someone who executes a statistical test, finds a P value slightly greater than 0.05, and has the temerity to say something about the trend anyway. Sometimes the related sin is declaring a P value much smaller than 0.05 “highly significant”. Either way, it’s a sin of committing statistics with nuance.

Why do people think the joke is funny? Because of course we all know that a result can’t be “nearly significant”, and a “trend toward significance” (as for Experiment 2 in the graph above) isn’t evidence of anything except the statistical ignorance of the person who mentions it. We set a significance criterion (often α = 0.05), we conduct our test, and whether or not our calculated P value reaches that criterion is all that we should interpret, or even notice. Right?

Well, not so fast. It’s true that a lot of people are taught this way – I was, and odds are good that you were too. But in fact, this is only one of two ways we can think about the magnitude of a P value (I’ll call these, a bit loosely, two “philosophies”, although this usage is likely to bother philosophers*). People who make jokes at the expense of “nearly significant” P values are adopting one philosophy, but seem unaware that there’s another one.

What are these two philosophies? I’ll call them the absolutist and continualist philosophies**. They differ in what kind of consideration they give the magnitude of a P value, and there’s a case for and against each one. Neither is ridiculous.

It’s the absolutist philosophy that’s described in my second paragraph, and that dominates our statistical teaching (at least in biology, and at least to undergraduates). The absolutist significance criterion is a line in the sand: your result stands on one side or the other, and that’s it. Once you’ve adopted this philosophy, it’s nonsensical to describe the degree of significance of a P value: one can’t be “nearly” or “barely” or “highly” significant, only significant or not. This has one very big advantage: it forces you to make a decision about the strength of evidence*** that you’re looking for before you have the results of an analysis in sight. As a result, there’s no temptation to be lenient with your pet hypothesis but stringent with something you’re skeptical of. In a sense, absolutist statisticians are like drivers who use cruise control: they want to make a careful decision on strictly rational grounds, and deliberately give that decision primacy over their instincts in any particular situation.

The continualist philosophy is quite different. A continualist would hold that the job of the P value is to express the strength of evidence against the null, and that this is a naturally continuous thing. It follows that drawing different conclusions from patterns with P = 0.0498 and P = 0.0507 is pretty silly (one might be forced to do so in the graph above, and no, I did not make that result up). Those two patterns are, after all, almost equally unlikely under the null. Not only that: P values, just like means and just like test statistics, are influenced by sampling uncertainty (even if we don’t conventionally put standard errors on them). So at the risk of getting too meta, it’s entirely likely that P = 0.0498 and P = 0.0507 are not significantly different from each other, leaving the absolutist philosophy nicely hoist by its own petard****. A continualist statistician would say that you can’t make statistical analysis a line in the sand, because if strength of evidence is continuous, our inferential conclusions should be too. (Bayesian statisticians and model selectionists would presumably agree, since continualist interpretation of the magnitude of Bayes factors and AIC values is conventional.)

Note, by the way, that the continualist position is not that anything goes – that any P value is worth getting excited about. A remark that something is “trending towards significance” at P = 0.4 deserves all the scorn it’s likely to get. Rather, we can recognize and distinguish between results that provide weak evidence, moderate evidence, and strong evidence against the null. We don’t have to consider all propositions either supported or rejected; some we merely lean towards.

So there are two ways one can think about the interpretation of P-value magnitudes; and each has a sensible rationale behind it. It’s perfectly sensible to be a committed absolutist, and to defend that decision (and so I’m disagreeing with Hurlbert and Lombardi (2009)’s blistering takedown). It’s equally sensible to be a committed continualist, and to defend that decision. What’s not sensible is to think that people who hold one philosophy or the other are ridiculous.

So: back to the joke I started with. I hope it’s clear by now why it isn’t funny: it’s based on ignorance. People making the joke are unaware that continualist interpretations of P value magnitudes are perfectly sensible, and have both a long history and distinguished proponents. If you don’t know this, and can’t explain the case for and against each philosophy, then you probably shouldn’t make jokes about either of them. Doing so achieves irony, not comedy. After all, what could be more ironic than displaying your own ignorance by poking fun at what you think is somebody else’s?

But I’ll end with a confession. I was taught absolutist inference, and once made those “nearly significant” jokes myself. I’ve learned, though, and I’ve stopped. Pass this post on to someone you love, and maybe they’ll stop too.

© Stephen Heard (sheard@unb.ca) November 16, 2015

Thanks to Deborah Mayo for comments on two early drafts. I expect she’ll still disagree with some of my treatment here, but nonetheless her comments greatly improved my post. See her excellent Error Statistics blog for much more on the logic and philosophy of inference.

In defence of the P-value
Why do we make statistics so hard for our students?
On degrees of evidence: Our literature is not a big pile of facts

*^Because, I think, philosophers would say these are not really “philosophies” in the sense of well-formed, logically built epistimological structures. A more accurate label might be “informal, but commonly adopted, opinions about how to proceed”. That’s a little unwieldy, though, so I’ll stick with “philosophies”.

**^The absolutist philosophy is often labelled “Neyman-Pearsonian”, and the continualist philosophy “Fisherian”, but if you read Fisher, Neyman, and Pearson carefully these turn out to be misleading and confusing names for them. For starters, Fisher clearly held a “Neyman-Pearsonian” position (at least early in his career) and Neyman and Pearson arguably held a “Fisherian” one (at least for science, although perhaps not for process control). Later in his career, Fisher became more “Fisherian”, although perhaps only because he was feuding with Neyman. Finally, Deborah Mayo argues that there’s little difference between the original positions of Fisher, Neyman, and Pearson, and so no case for naming different philosophies after them. The bottom line is that the names of famous dead statisticians seem to generate more heat than light, and this obscures the very real differences between contemporary scientists who teach and practice statistics using one philosophy or the other. UPDATE: a new and relevant post from Deborah Mayo here, with more on the history of Fisher and Neyman’s thinking in part through the lens of Neyman’s first student, Erich Lehmann.

***^By “strength of evidence” I mean the degree of inconsistency of data with the null hypothesis. It’s important not to think of a P value as indicating (on its own) any degree of confirmation of, or consistency with, the null. It’s also important not to confuse strength of evidence for an effect with strength of an effect (the latter being measurable by a regression coefficient or similar statistic, not by a P value).

****^A petard is a mine (a small bomb) used to destroy fortifications. To be hoist by one’s own petard is to be blown up by one’s own weapon. The expression comes from Hamlet, in which the moody prince discovers that his schoolmates Rosencrantz and Guildenstern are carrying letters ordering his murder. Hamlet modifies the letters to order the murders of Rosencrantz and Guildenstern instead, and is (perhaps pardonably) quite proud of himself: “For ’tis the sport to have the engineer / Hoist with his own petard; and it shall go hard” (Hamlet 3:4 206-207, spelling modernized). So that’s petard. The similar word pedant is unrelated, and describes someone who thinks he should explain the meaning of petard in a blog post about statistics.