Living the metascience dream (or nightmare) with AI for science

11 min read Original article ↗

Recently I wrote about multiverse analysis, which takes the idea of sensitivity analysis to the extreme by multiplexing over every reasonable decision you could have made in analyzing your data. If anything “feels like rigor,” it’s a multiverse, with its fancy visualizations of every counterfactual.

Some would consider a world in which every paper includes a multiverse an obvious improvement over the current status quo. It could prevent many fragile results from making it through to publication, and “contaminating” the published record by getting cited and regarded as true.

But should we be careful what we wish for? As someone with a longtime interest in metascience and reform, I can’t help but notice that it’s now trivial to turn any paper into a multiverse analysis. Just ask Claude Code to create a notebook replicating the analysis in a paper, then add whatever variations you want. Or let it plan the multiverse too by suggesting some common ablations. What does this newfound ease in testing for robustness mean for science?

There are a few corollaries of this reality. One is that it’s trivial to test reproducibility. No human reviewer may have the patience to deal with your messy codebase, but automated evaluation makes it much easier to detect the failures (at least where it’s possible for data to be made public) and prevent them from moving forward. This can take some curation of fallback plans for common failure cases (see, e.g., this recent paper), but once we endow agents with these skills, we can apply them at previously unknown scale. It’s a short hop from there to multiverse writ large.

Replicability testing at scale–where results are tested on new data sets to see if they hold–is also not far off. For example, you can have it synthesize data with the same structure but variation along some dimension. Before long we may also see agents being given permission to collect new data, for example by running online human subjects studies on Prolific or some other platform. The years of work it used to take to run a large-scale replication project may now translate to one grad student’s summer project.

In short, we should expect the level of scrutiny on papers to change dramatically. Most reviewers are not incentivized to look carefully at materials the authors submit beyond the paper text itself. Even when incentivized, human reviewers have limited time and attention. But AI reviewers can scale evaluation of reproducibility, consistency in claims and evidence, and robustness to perturbing inputs or methods slightly. In the trajectory that led from replication crises revelations and pushback against narrowly constrained subject pools (“WEIRD” science) we appear to be on the verge of living the open science dream.

The question is, What kind of dream will it be? Certainly from a surface reading of the science reform and the so-called replication crisis, it seems a win-win situation for science. Authors (whether human-AI combos or fully AI) get the assurance that the work they are producing is robust. The scientific record benefits by converging upon what really stands up to scrutiny, not what sells as a story. There is now the potential for improvements to status quo science that would be hard to conceive of five years ago.

On the other hand, if you think the open science movement has largely overindexed on simple techniques that misconstrue what scientific progress means, and empowered rigor signaling over judgment, welcome to your nightmare. I suspect we will see some perversions of real progress when “science as checklist” becomes policy.

Much will depend on how carefully we steer the kinds of checks we implement, and what to do about the results. But a few things seem likely regardless:

Lots of issues that humans missed will be found by AI. In an optimistic view where automated review uses tools on par with the best that are available today, many (most?) of these issues will be real problems. Even if a human with considerable expertise in the field did go through them one by one, I doubt they would disagree that often. Just look at refine.ink - I started using it shortly after it was released and immediately it became part of the pre-submission routine. I tend to be hyper-careful about what I say in papers, and it can find even the subtle issues in notation and argumentation.

The best way to pass automated evals will be to plan and build-in robustness tests throughout the research process. This is the obvious path to acceptance rates recovering. As evaluation gets easier, the demand for it will increase, so that evaluation becomes continuous or “always on.” Checks can be run at every commit, every new experiment, every revised claim.

All of this is already happening to some extent. At least in CS, it now feels risky to send papers out if they are still throwing lots of issues when run through AI evaluations–not just the code, but the entire paper. I tell my students to check their work periodically to catch major issues early.

It will become harder to pass off fragile findings, regardless of how compelling the story may be. “Such-and-such conference is no longer taking risks” is something that people have complained about before AI, as academic communities have matured and acceptance rates lowered. Widespread AI checks could take that sentiment to a new level.

The nature of the new status quo that selecting for “safer” papers creates will depend on what kinds of robustness we prioritize. It will depend on what happens to fuzzier criteria like intellectual risk-taking, or whether a paper opens up new ways of thinking versus purports to resolve uncertainty. What’s the value of telling a good story, one that inspires the (human) imagination, versus presenting a robust (if boring) empirical result or incremental advancement to methods? I expect AI to incentivize the latter unless we explicitly intervene to incentivize fuzzier, human-like aspects of taste.

How big a shift robustness-forward, heavily AI-driven science brings is likely to look different depending on what area you’re in, and what constitutes a novel contribution. In fields where incremental change is the norm, public datasets are commonly used, and combining tricks from prior work has a relatively high probability of success (e.g., machine learning), the change may seem more tolerable. It’s less clear to me how fields like social psychology or sociology, where the story, and how it stands in contrast to our expectations, is paramount will respond.

The question that ideas like multiverse analysis raise—“What exactly do I conclude on the basis of this glorified robustness analysis?”--is one we’ll need to have a stance on. This is where lots of nuance could be lost. For example, in writing this post, I gave Claude Code one of my papers, a study that compared human image labeling performance when the participants had access to different presentations of prediction uncertainty. I pointed it to the data files, and prompted it to reproduce the results. I also asked it to extend the results by doing a multiverse to vary key analytic decisions. Since I wanted to see what it associated with terms like “sensitivity analysis” and “multiverse,” I gave it very little specific advice.

We had a little back and forth–for example, initially it looked only at aggregate effects, rather than distinguishing by treatment arms, though our hypotheses in the paper were specific to data conditions. But overall the process was much, much faster and easier than if I had to do it myself, or ask a grad student.

However—in the end, it summarized the results in exactly the way that theorists warn not to: reporting the frequency of significant results across a set of universes that varied in the covariate structure of the regression model they used. The problem is that these specifications are not draws from a probability distribution. There is only one data-generating process. Treating model variants as if they form a random sample turns analytic flexibility into an uninterpretable frequency. This is why robustness testing at scale does not guarantee insight: we still have to figure out how to interpret the results, and that is hard.

Given that many of the people working on automated evaluation and AI for science are not metascientists, we could see a lot of nonsensical aggregations of results. Consider, for example, the possibility of synthesizing automated meta-analyses of different empirical literatures. If you think meta-analyses are questionable because it’s not clear what the “average effect” estimated over a heterogenous set of studies even represents, or take issue with the conventional throw-it-all-into-a-simple-random-effects-model approach, brace yourself for a flood of highly precise, poorly defined aggregated estimates.

On the other hand, one of the frustrating things about metascience has been that it’s hard to predict how corrective measures will impact science as a whole. Many of the social sciences are more conservative than CS (and haven’t faced the urgency of submission numbers that CS venues are facing), and so policy change to publication practices is slow. The louder and more convincing reformers have profited from this–they can tell a convincing story about how preregistration or other open science methods will save science and win lots of support, without having to prove it. As AI speeds up paper production, and makes it easy to implement new evaluation procedures relatively uniformly at scale, so will our ability to get feedback more quickly on how the published record can shift with various interventions. Metascience stands to become more empirical.

We should expect more and more reliance on AI to figure out the way around the errors. Ultimately we get “self-play,” as John Horton calls it, where different sets of agents propose, implement, stress test, and critique.

This is where things could get fascinating. We know that the inductive biases of large language models, when left to dialogue, can lead them to converge on strange equilibria, like the Claude bliss attractor effect. What happens when the selection pressure is not for fluency, but for reproducibility, transparency, and sensitivity to perturbations? What kind of science emerges when agents are rewarded not for sounding reasonable, but for surviving stress tests? What’s the feel of an equilibrium of low epistemic risk?

It’s probably not going to look much like the science we’ve become accustomed to. Will it be “feels like rigor” on steroids, or will we eventually get genuine innovation and robustness?

Of course, depending on what types of checks become policy, things could get obviously stupid, like when those (human or AI) overseeing algorithmic review decide to filter on null hypothesis significance tests, or assume that requiring claims to be consistent with evidence and transparency around replication materials is sufficient for good science. In reality, honesty and transparency are not enough, as Gelman likes to remind us.

Some would argue that the open science movement, despite its noble intentions, has failed to appreciate the nuance and personal agency that science depends on, instead fixating on easy-to-implement tricks and enabling the egos of those willing to loudly prescribe. All this could get worse if we’re not mindful of the traps that metascientists have already pointed out. It’s a good time to get familiar with the complexity of narratives around replication crisis reform — the moral of the story is that we don’t really understand how to prescribe good science.

Many have argued that science must ultimately remain largely human driven. If we aren’t producing knowledge for ourselves, we won’t be doing science as we know it. Chenhao Tan and Haokun Liu aptly invoke Tukey’s advice: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

The risk is not that AI will make science less rigorous. It’s that we will confuse what can be stress-tested with what is worth knowing. From this perspective, AI may enable what the late Brian Cantwell Smith called reckoning (as summarized here by Melanie Mitchell). But it won’t ever suffice for judgment, “a form of dispassionate deliberative thought, grounded in ethical commitment and responsible action, appropriate to the situation in which it is deployed.”

If enough people feel this way, strongly enough, perhaps we can preserve the best of our current institutions while eliminating some obvious noise.

Discussion about this post

Ready for more?