P-hacked hypotheses are deceivingly robust (2016)
datacolada.orgThe words "deceivingly" and "deceptively" have the same problem: there's a roughly 50/50 split in polar-opposite interpretations. https://grammarist.com/usage/deceptively/
In this case, does "deceivingly robust" mean they look robust but are fragile? or does it instead mean they look fragile but are robust?
This isn't a criticism of you, soundsop. Rather, it's intended to keep pointing at how difficult it can be to concisely deliver a message.
---
edit: sounds like the correct interpretation of the title is "P-hacked hypotheses appear more robust than they are."
Huh. I'm skeptical about that article's dichotomy of "deceptively" meaning either "in appearance but not in reality" or "in reality but not in appearance". I think the most common usage of "deceptively X" is, more broadly, "X in a way that deceives you". That includes "X in reality but not in appearance", but it also includes "X in reality and in appearance, but deceiving you about something else".
For example, they used this quote as an example of "in appearance but not in reality":
> It’s no mystery why images of shocking, unremitting violence spring to mind when one hears the deceptively simple term, “D-Day.” [Life]
But the term "D-Day" is simple. It's deceptive because it might wrongly lead you to think the event it refers to is also simple.
Similarly, if something is "deceptively simple-looking", it really is simple-looking; it's just not simple.
Mate, I don't know; I'm just going with all the research and linguistic warnings I've read on the word.
https://languagelog.ldc.upenn.edu/nll/?p=3500
https://brians.wsu.edu/2016/05/25/deceptively/
https://www.academia.edu/37488247/The_Deceptively_Simple_Pro...
Shoot, even oxford gives exactly opposite definitions of the word.
https://www.oxfordlearnersdictionaries.com/us/definition/eng...
---
I mean, even when I saw the title of the thread, I had an obligation to click (clickbait I guess?) because I could have interpreted the title as either a warning about p-hacking or an attestation in favor of the practice. In fact, on first glance, I read the title as "P-hacked hypotheses appear less robust than they actually are."
I think that the correct interpretation is: Statistical robustness tests do not address the problem of p-hacked hypotheses.
You'd be right, and that's a sentence about halfway down. And should you decide to perform robustness tests on a p-hacked hypothesis, you might be convinced that it's more robust than it is.
Eh, I don't think it's difficult so much as a situation where the writer did not weed out ambiguity for some reason. The fix is simply to use different words, especially since there's nothing particularly technical about this to constrain word choice.
The problem in the title is not the word "deceivingly" but the word "are". It should say "appear" or "seem". The reason is the common understanding of the word "robust" to mean the strong end of the spectrum from weak to strong and not the entire spectrum itself. The only way for the title to work is if you use the other (wrong) meaning of "robust".
Basically, if you take a p-hacked hypothesis and attempt to use it predictively, it falls apart.
That's kinda ... useful, actually.
It feels like this is sort of the same issue with overfitting in ML. Attempts to use ML results predictively often fail in hilarious ways.
Yep, that s also how science works: it predicts the future based on a model. Quick trick to know if it s a science or not: if it has the word science in it it’s not. (Eg social science)
P-hacking is a fine way to winnow through ideas to see what might be interesting to follow up on. There will certainly be false positives, but the real positives will usually be in there, too, if there are any. Determining which is which takes more work, but you need guidance on where to apply that work.
To insist that p-hacking, by itself, implies pseudo-science is fetishism. There is no substitute for understanding what you are doing and why.
> Direct replications, testing the same prediction in new studies, are often not feasible with observational data. In experimental psychology it is common to instead run conceptual replications, examining new hypotheses based on the same underlying theory. We should do more of this in non-experimental work. One big advantage is that with rich data sets we can often run conceptual replications on the same data.
I think actually relying on "conceptual replications" in practice is impossible. If the theory is only coincidentally supported by the data, that makes the replication more likely to exceed p < .05 coincidentally in a very difficult to analyze way.
The author mentions that problem, but doesn't mention a bigger issue: If you think people are unlikely to publish replications using novel data sets, just imagine how impossibly unlikely it is for people to publish failed replications with the original data set! If you read a "replicated" finding of the same theory using the same data set, you can safely ignore it, because 19 other people probably tried other related "replications" and didn't get them to work.
This problem is going to get more severe as available datasets get bigger and bigger. The more data you have to mine, the more likely you are to find something that looks like a signal but isn't.
At some point, social science must switch to tighter p-value cutoffs and corrections for multiple comparisons. These are the norm in particle physics, which dealt with and resolved exactly this problem back in the 1970s.