What's the next frontier for improving psychological research?

Login as a British Psychological Society member, at the top of this page, for access to author-narrated audio.

I was forged in the crucible of psychology's replication crisis.

I took my first psychology course in 2012, a year after a paper earnestly claimed evidence of extrasensory perception (ESP) and another paper not-so-earnestly demonstrated how to make people younger by playing them a Beatles tune.

I worked in my first psychology lab in 2013, a year after Daniel Kahneman penned a letter to social priming researchers in which he stated: 'I see a train wreck looming.'

I began my first full-time position in psychology in 2016, a year after the Open Science Collaboration found that less than half of 100 papers drawn from three top psychology journals successfully replicated.

My formative years in the field were shaped by watching psychology's skeletons dragged from the closet and paraded across the pages of the Guardian, the New York Times, and the New Yorker.

The skeletons go by many names. Data mining. Data dredging. Data fishing. Fishing for significance. Significance chasing. Undisclosed analytic flexibility. Or, the preferred shibboleth of the replication crisis: p-hacking.

In brief, these terms refer to the process of selecting or reporting statistical tests based on the desirability of the results they produce. To get published, psychologists are incentivised to find statistically significant results (p < .05), so find them they do – just, not through the means our statistician forbears intended.

As a researcher forged during the replication turmoil, it was obvious how p-hacking could damage the scientific enterprise. Moreover, p-hacking seemed not only prevalent but sometimes hard to avoid, manifesting as an unconscious bias luring researchers through a garden of forking paths.

Was this bias toward 'statistical significance' something our field could overcome? And if p-hacking were curbed, what would that leave as the next frontier?

I've recently had the opportunity to explore these questions.

Evaluating newly published papers

Fast-forward a decade from my first full-time position in psychology. I now direct Transparent Replications, a project designed to celebrate high-quality psychological research and shift researchers' incentives toward replicable, reliable methods.

Our team has embarked on a new kind of replication project. We randomly select recently published articles from five of the most influential journals that publish psychology research: Science, Nature, Proceedings of the National Academy of Sciences, Psychological Science, and Journal of Personality and Social Psychology. We then complete direct replications of a pivotal study in the paper and evaluate the study on three dimensions. (Given resource constraints, we currently select only articles whose studies can be run online with an adult population.)

First, we assess transparency. Are the study's data, materials, and analysis code available?

Was the study preregistered? Was the preregistration followed? Next, we assess replicability. Do our results match those reported in the original study? Finally, we assess clarity, a new evaluation criterion we created to assess the extent to which the paper's claims match what the studies actually demonstrate. Papers receive a score out of five stars on each of these three dimensions, and we write a report on each study we replicate. You can check out our reports here.

Unlike many replication projects, we target new papers. We see a few advantages to this approach. First, if a finding isn't reliable, we correct the scientific record quickly, and if it is reliable, we boost its credibility. Second, if psychologists know their work might be evaluated soon after publication, they have a greater incentive to follow good practices – ensuring transparency, checking for errors, appropriately calibrating claims, and so forth. Finally, targeting new papers reveals the current state of the field and lets us investigate a crucial question: what exactly is going right and what exactly is going wrong, right now?

We're in the early stages of this endeavour, but we already have some surprising findings.

So far, we've attempted to replicate 15 studies. How many of those do you think replicated?

For context, a large-scale replication project published in early April, 2026 found that 49 per cent of the 58 psychology papers it targeted successfully replicated. So, if you guessed somewhere around 50 per cent, that would be justified.

Actually, we were pleasantly surprised to find that 12 of the 15 studies (80 per cent) fully or mostly replicated (of the 70 findings in these studies we tested, 84 per cent replicated). If the replicability rate of new articles in top psychology journals were, say, 50 per cent, our result (or something more extreme) would be unlikely to happen by chance. To be precise, it would only happen about 3.5 per cent of the time.

But what surprised us even more than 12 of 15 studies replicating was that in only 1 of the 15 studies did we see signs of p-hacking.

What might explain the difference in replication rates between our project and the recently published paper mentioned above? Well, one major difference is that the studies evaluated in the recent paper were from 2009 through 2018 – that is, between 8 and 17 years old – whereas all but one of the papers we evaluated were from 2022 or later. So it's likely that many of the studies evaluated in the recent paper were conducted before discussions of p-hacking and other questionable research practices became widespread (even more so if we consider that the time between completing a study and publishing it can easily be a year or more). The studies we replicated were also drawn from a smaller and more prestigious sample of journals, which could have played a role.

We're continuing to run replications, so we'll have more precise estimates of replicability and p-hacking in the future. And it's important not to anchor too much on the specific numbers above given the modest number of completed replications. But we see these initial data on current replication rates in top journals as a good sign.

How surprising is this?

Is this news surprising, or is it what we should expect? After all, some have argued that our field has undergone a credibility revolution in response to the replication crisis.

To find out how surprising our replication results were, we surveyed more than 100 academic psychologists about their views on replicability (the methodological details can be found here).

Nearly two-thirds said they believe that, although the replication crisis is ongoing, substantial progress has been made.

Nevertheless, when we asked the psychologists to estimate the current replication rate of studies in top journals, the median response was 55 per cent. I don't know about you, but this didn't seem very optimistic to us. In fact, it didn't seem very optimistic to our respondents either: when asked what the replication rate would be if the field were in a healthy state, the median response was 75 per cent. In other words, there was a 20-percentage-point gap between the median estimates of where we are now and where we ought to be.

So, half of the psychologists we surveyed believed the replication rate would be 55 per cent or worse for papers in top journals. Yet, we found that 80 per cent of the studies we attempted to replicate from new papers in top journals did indeed replicate. It's too early for us to say that psychologists are collectively underestimating replicability, but we're hopeful.

Perhaps, psychologists are getting their house in order. Preregistrations are becoming more common. Sample sizes have mushroomed. Scientific accelerators have been engineered. Maybe science is demonstrating, yet again, that it is self-correcting – at least, when we put in the work to correct it and follow scientific best practices.

There's more than one way to hack it

Here's the more concerning news. Yes, only one paper we replicated showed signs of p-hacking, but another form of 'hacking' was common; the founder of Transparent Replications, Spencer Greenberg, has called it Importance Hacking.

To understand Importance Hacking, it's useful to think about how scientists in today's publish-or-perish academic milieu might stave off the 'perish' part of that equation.

One route to publishing work in a prestigious journal is to make a novel and valuable contribution to the ever-growing, ever-more-complex scientific literature. In other words, conduct research that merits publication. That's what nearly all scientists are shooting for. But it's very hard to do. So let's take a moment to consider the other strategies for getting published.

A much easier route to publishing a study is to make it up: fabricate participants, duplicate preferred responses, transfer data from one experimental condition to another. Thankfully, fraud is rare (though not as rare as one would wish, as recent fraud scandals have illuminated). Most scientists wouldn't even consider committing fraud because it's highly unethical. And even those who are willing to act highly unethically may still avoid it due to the risk of getting caught.

Another route is to stumble upon a false positive. False positives are findings that are unlikely to materialise again if subjected to a high-quality replication. Remember the LK-99 kerfuffle from 2023, when a group of materials scientists thought they had discovered a room-temperature superconductor? That certainly would be a finding important and novel enough to publish in a top journal, but it was shown to be a false positive before it ascended through the pearly white gates of Nature's editorial process. There are multiple ways to get a false positive result, including p-hacking.

But there's yet another route to publication. One that involves neither false positives nor fraud. This publication strategy – Importance Hacking –entails making a real, replicable finding appear more valuable or worthy of publication than it really is. For example, a paper might frame its results as causal evidence when, in fact, the study design only offers correlational evidence. Or a paper might claim that its findings generalise to circumstances that were not actually tested in the study. Or a paper might omit messy or inconvenient details that would make the result appear less elegant. (For more examples, see our article introducing this concept.)

Any degree of Importance Hacking is bad because it cleaves a rift between the actual evidence and the impression readers come away with. Sometimes Importance Hacking even convinces journal editors and reviewers that a paper merits publication when it likely doesn't. We call such papers 'Importance-Hacked acceptances'. Had no Importance Hacking occurred, the paper would have been unlikely to be accepted by the journal, even though the finding is real and replicable.

Importance-Hacked acceptances are not a consequence of differences of opinion between researchers about what findings deserve to be published – a topic that reasonable people could easily disagree on. Rather, importance-Hacked acceptances are specifically the instances in which the findings of a study just don't mean what the study purports. These are cases where the way that findings were discussed in the paper caused peer reviewers to have an inaccurate understanding of those findings, which led them to conclude that the findings were substantially more valuable than they were. Thus, importance-Hacked acceptances are those that were only accepted because the reviewers or editor did not identify the discrepancy between the findings and the paper's narrative.

Now that we've established what Importance Hacking is, let's return to the worrisome news. Of our first 15 replications, we identified Importance Hacking issues in 9 of the studies. These issues varied in type and severity. For instance, one study's central claims were simply not supported by its results. Another study did not sufficiently acknowledge plausible alternative explanations that could render the results uninteresting.

To be clear, Importance Hacking, like p-hacking, need not be intentional. A researcher might be so smitten with an experimental paradigm that they overlook some of the tenuous assumptions it rests on. A researcher might not have adequately considered alternative explanations. Or they might misunderstand what a given statistical test actually does.

Imagine a spectrum. On one end, you have researchers fully, clearly, and accurately describing the strengths and weaknesses of their work. On the other end, you have researchers lying to make their work seem like the psychology equivalent of the General Theory of Relativity. Importance Hacking usually lurks in the dusky middle of that spectrum. The key feature is that it's a publication strategy – a way to push papers through peer review that, were it not for the Importance Hacking, may not have been accepted by that journal. Like p-hacking, it is heavily incentivised by publish-or-perish culture.

Importance Hacking can be in the short-term interest of multiple parties. The authors, of course, benefit from getting a paper published. Academic journals can benefit from publishing papers that appear more important, novel, and beautiful than they otherwise might. News outlets benefit from having flashier and more important-seeming findings to share with the public. Funders can benefit from media attention directed at the work they funded. But the problem is that the broader scientific ecosystem, and ultimately the public, loses out from papers whose findings don't match their claims.

Now, you might be thinking, 'Wait, wait, wait. Isn't this the whole reason we have peer review? Surely that addresses this problem?' Well, 9 of the 15 papers we replicated would beg to differ. While 15 papers is a small sample, our findings hint that Importance Hacked papers are surviving the peer review gauntlet. Likely, peer review does catch many attempts at Importance Hacking. But the prevalence of Importance Hacking in published papers tells us that there are forms of Importance Hacking that peer review doesn't catch. This could be because the Importance Hacking methods are subtle and hard to detect, or because detecting them takes more work than reviewers put into the process. For example, in our work evaluating papers, we often only discovered Importance Hacking when we tried to recalculate the study's findings from scratch or attempted to rebuild the study from scratch using its raw materials.

Goodhart's ghost

If this is happening at the scale we think it is, wouldn't academic psychologists be aware of it? In our survey of psychologists, we aimed to find out.

We told the psychologists about the four kinds of studies described above, presented as mutually exclusive, collectively exhaustive categories: studies meriting publication, fraudulent studies, false-positive studies, and Importance-Hacked studies that would replicate but don't merit publication. We offered definitions for each term, which was especially important because the respondents were hearing the term 'Importance Hacking' for the first time. We then asked them to estimate what percentage of papers in top psychology journals fall into each bucket. On average, they estimated a similar incidence of false positives (26 per cent) and Importance-Hacked acceptances (27 per cent).

We also asked the psychologists how severe a problem they thought p-hacking and Importance Hacking were. On a scale from 0 (not at all) to 4 (extremely severe), the psychologists' average rating was 1.71 for p-hacking, but – to our shock – it was 2.45 for Importance Hacking. In other words, psychologists rated Importance Hacking as a more severe problem than p-hacking.

If we (and the academic psychologists in our study) are right that Importance Hacking is a severe problem, what can be done about it?

Preregistration and registered reports are excellent defenses against p-hacking. We don't know of defenses of comparable calibre for Importance Hacking. But one approach we think could help is what we call the study diagram (which we developed for use in our replication reports). The study diagram shares details about the participants, the study procedures, the hypotheses, and the key findings in one tidy graphic. When done right, it provides only factual details and no interpretations. For example, it wouldn't say 'we measured participants' likelihood of exercising.' Instead, it would say 'participants answered the question "Do you intend to go to the gym tomorrow?" (Yes/No).' A study diagram strips out interpretive language that can obfuscate the connection between who the participants were, what they actually did, what was hypothesised, and what the statistical tests revealed. It lays the study bare, making it easier to catch invalid claims.

Another useful strategy that we developed for our replications is what we call the simplest valid analysis. A highly complex statistical analysis may be the most precise test of a given hypothesis, but complex statistical analyses are more likely to be misinterpreted (or simply uninterpreted) by reviewers unfamiliar with the procedure. Complex analyses often embed more assumptions and make it harder for reviewers to understand exactly what happened under the hood. We see no problem with a paper including complex analyses; but, we recommend that the paper include the simplest valid analysis (when such a procedure exists) alongside the complex one. We hope this approach allows reviewers (and even authors themselves!) to more easily identify limitations, assumptions, and errors. Indeed, when we've applied this approach, we've discovered Importance Hacking that we otherwise would have missed.

But, perhaps more than anything, we need an overhaul of our scientific culture. Too often, our field approaches the publication process as a game that is won or lost. 'Winners' are lauded with praise, prizes, and positions of power. 'Losers', well, lose in the competition to secure a permanent position in the field. So, unsurprisingly, people develop sophisticated strategies, like Importance Hacking, to win the publication game.

Goodhart's Law says that when a measure becomes the target, it loses its value as a measure. Importance Hacking is Goodhart's ghost incarnate, and maybe we should indeed expect Importance Hacking to rise as p-hacking falls. If a pipe is leaking water from multiple holes, and you plug one hole, the water leaks faster from the others.

What now?

I firmly believe that most of us became scientists to be truth-seekers, not status-seekers. And I think that starts by treating the empirical research article as a transmission of truth. Not a pitch, job interview, or sensational story. Research should be our most valiant attempt at saying something true about the world that's worth saying.

Current incentives and norms do not always favour this approach. And I know my own research papers have fallen short of this ideal. But did we really choose to dedicate our lives to science just to blindly follow incentives and norms that we know, deep down, to be antithetical to the mission of science?

Our project's early findings give us hope. We see reason for optimism around p-hacking and replicability in our field's top journals (hopefully to be confirmed with more data). But we also see what we believe to be the next frontier – Importance Hacking.

As I said, I came of age during psychology's replication crisis. I've seen our skeletons dragged from the closet, but I've also seen our house deconstructed and rebuilt. And that's what was most formative for me as a junior scientist: witnessing the kind of change we can manifest when we act on our values. Let's renew those efforts.

Isaac Handley-Miner, PhD. Director, Transparent Replications
With thanks to Spencer Greenberg, PhD