How to report a N=12 study?

Someone who goes by the handle Concerned Cow writes:

I am writing anonymously to ask whether you might be willing to look at a series of major statistical issues in a recently published Nature paper, “CD8⁺ T cell stemness precedes post-intervention control of HIV viremia,” that appears to contain a textbook unit-of-analysis error.

The central analyses of the manuscript treat epitope-specific T cell measurements as independent biological replicates, even though multiple responses come from the same individual (e.g., 23–26 “responses” from only 7 participants). This pseudo-replication inflates the effective sample size and makes non-significant participant-level differences appear highly significant.

When the data are aggregated properly at the participant level, the reported p-values collapse (for example, p = 0.007 becomes approximately p = 0.14–0.39 in Figure 2c, and removal of a single outlier even further eliminates all claimed effects). This pseudoreplication is evident in several panels (Fig. 1d, 2h, 2i, 4c-f). Moreover, a substantial selection bias in Figure 2h-j further compounds the problem.

The above picture reveals both the pseudoreplication, but also a significant imbalance in epitope-specific responses per participant (e.g., 23–26 responses from 7 individuals, with one individual contributing 5 responses), which substantially inflates the apparent sample size and drives the reported significance. When aggregated properly at the participant level, the differences disappear.

Here is a concise one-page technical summary
outlining the statistical issues and why the reported analyses cannot support the paper’s conclusions (which is also attached).

I don’t know nuthin bout CD8⁺ T cell stemness, but there was something about the name “Concerned Cow” that appealed to me. I have the unreasonable feeling that anyone who uses the handle of Concerned Cow will be a good person.

On the other hand, I have no good reason for that feeling, and, in any case, good people make scientific mistakes all the time–I know I do!–so we shouldn’t jump to any conclusions here.

At this point I could just give up, as I’m not planning to educate myself on the topic of HIV viremia, but the above issue seems purely statistical so I’ll take a look. I have some sympathy for people who see problems with published papers. I guess the Cow should also post these concerns on Pubpeer.

I guess the main concern here is that of generalizing from only 12 people. In a medical study you can learn a lot from just one person, so it’s not like a low sample size is disqualifying.

So maybe the most helpful way to consider this sort of study is not to compare it to a hypothetical study of 1200 people (in which you should be likely to get statistical significance even with an unquestionably legitimate analysis) but to a study of one or two people.

What do you get out of N=12 that you wouldn’t get out of N=1 or 2? Mostly, what you get is some sense of variability. The 12 people in your study will be different in various ways–different bodies, different ages, different stages of the disease, etc. If all 12 people show the some responses, that’s telling you something. To the extent these responses vary, that’s telling you something too.

Can N=12 give you reliable information on population average behavior? Let’s do a quick calculation. Suppose you’re comparing two groups with 6 people each. If the standard deviation of your outcome variable within each group is sigma, then the sd of the difference between the two group means is sqrt(sigma^2/6 + sigma^2/6) = sigma/sqrt(3) = 0.58*sigma. So, if you want your comparison to have a signal-to-noise ratio of 2 (so that you’d have approximately 50% chance of attaining conventional statistical significance in a clean experiment), your underlying mean effect size would have to be at least 1.16*sigma. That would be a huge effect. Not that it can’t happen, just that it will only happen if:
(a) The underlying effect really is large.
(b) The outcome varies very little within each group, or, if it does vary, this variation is explained by pre-treatment predictors included in your model.
(c) The outcome is stable within each person and is measured precisely. Just about any amount of uncontrolled measurement error or variation over time will make it hard for you to get that signal-to-noise ratio down.
(d) The treatment or exposure is measured well. Misclassification or noise in the treatment variable will destroy any chance of keeping that high signal-to-noise ratio.

From the perspective, one of the key roles of an N=12 study is to identify the sources of variation and error in your experiment, so you can figure out how to control these. Or, where that can’t be done, how you can adjust for them.

To paraphrase the famous saying:
God grant me the serenity to adjust for the things which cannot be controlled; The courage to control the things which can be controlled; And the wisdom to know the difference.

Now, to get to the study at hand, the key statistical point is that, unless you’re pretty sure you’ve satisfied conditions (a), (b), (c), and (d) above, you shouldn’t be looking for statistical significance in your data anyway, for three reasons:
1. With so much variability, the fact that an observed difference not reach statistical significance should not be taken that the underlying effect is zero, or even that it is small.
2. Any differences that are statistically significant in the data are likely to be huge overestimates–that’s the well-known problem of type M errors in noisy studies.
3. If you’re under pressure to find statistical significance, there’s a motivation to cheat. That’s the Amstrong principle. I’m not trying to say or imply or insinuate that the authors of this particular paper were “cheating,” just that, by reporting significance levels in this small study, they’re (inadvertently) asking for trouble.

And, indeed, these issues arise here. In addition to reporting some statistically significant comparisons (the ones addressed by Cow above), the paper also reports some lack of associations based on non-statistical-significance.

What should the researchers have done?

Most of the paper under discussion is about the technical details of the experiment and the associated biological processes. There’s also lots of data, including at the individual patient level. I won’t try to evaluate any of this! My guess would be that the value of the paper is in all these data and that these results could be useful in designing future studies. I wouldn’t do that based on statistical significance, that’s all.

You may notice that I never got around to evaluating the particular issues raised by Concerned Cow. That’s because I wouldn’t expect to see statistical significance in this small-sample, high-variance setting, absent some selection on forking paths. On one hand, this means that I would not be surprised if the Cow’s concerns are legitimate; on the other hand, in some sense it doesn’t matter so much anyway, because even if the standard errors aren’t invalidated by clustering in the data, I’d still be concerned.

It’s possible that the authors of the published article will see this post. If they do, my recommendation to them is to think more about how to control and adjust for variation, and to not use statistical significance thresholds to classify your results.