Abstract
Scientists must choose which among many experiments to perform. We study the epistemic success of experimental choice strategies proposed by philosophers of science or executed by scientists themselves. We develop a multi-agent model of the scientific process that jointly formalizes its core aspects: active experimentation, theorizing, and social learning. We find that agents who choose new experiments at random develop the most informative and predictive theories of the world. The agents aiming to confirm, falsify theories, or resolve theoretical disagreements end up with an illusion of epistemic success: they develop promising accounts for the data they collected, while misrepresenting the ground truth that they intended to learn about. Agents experimenting in these theory-motivated ways acquire less diverse or less representative samples from the ground truth that also turn out to be easier to account for. Random data collection, on the other hand, combines virtues of diverse and representative sampling from a target scientific domain which enables cumulative development of the successful theoretical accounts of it. We suggest that randomization, already a gold standard within experiments, is also beneficial at the level of experiments themselves.
Introduction
The problem of experimental choice
Every empirical scientist faces a question of how to choose the next experiment. Some would pick an experiment likely to produce data favoring their theory, some would attempt to falsify their theory or resolve major theoretical disagreements, while others would just test a relationship between variables of interest and look for the theoretical motivation post-hoc. Despite the crucial importance of experimentation choice for sciences, the methodological recommendations on this front are rather limited.
Several broad strategies for experimental choice can be found in the philosophical literature. Fleck and Kuhn famously argued that scientists usually conduct experiments where the outcomes are already well known, or at least broadly expected (Fleck, 1979; Kuhn, 1970; whether this is what they thought we ought to do is left for another discussion; also, see Klayman and Ha, 1987). Popper suggested that scientists should seek out those experiments most likely to falsify their theories (Popper, 2014), and in a related vein Mayo has suggested that scientists should subject their theories to severe tests (Mayo, 2018). Lakatos, and others, highlighted the importance of crucial experiments which distinguished between one theory and another (Lakatos, 1974; Platt, 1964). These alternatives share a crucial assumption that theory-motivated experimentation, in one way or another, facilitates theory-building and learning about the world. This stands in contrast to contemporary discussions in statistics and machine learning which highlight the importance of unbiased, random sampling in many circumstances (e.g., Baribault et al., 2018; Sloman et al., 2022) and with the well-recognized role of serendipity in most influential scientific findings, such as the discoveries of penicillin, electric current, and X-rays (Roberts, 1989; Rosenman, 1988; Yanai and Lercher, 2019).
Even though choice of experiment constitutes an essential everyday part of scientific activity, the discussions over the efficiency of different strategies remain largely unresolved. Scientists who engage in experimentation on an everyday basis are left to their own intuitions and social conventions of their fields. Psychologists, for example, often conduct new experiments to corroborate or (sometimes) challenge the century-old theories, they tend to use similar paradigms and experimental settings, and aim to avoid experiments that are not informed by a developed theory (e.g., see van Rooij and Baggio, 2021, for a recent proposal). Beyond guiding experimental predictions, theoretical assumptions also shape which variables scientists consider worth manipulating in the first place (Dubova and Goldstone, 2023). This focus has historically led psychology to overlook cultural variation, with recent work revealing that many well-established effects are modulated by cultural context (Henrich et al., 2010; Medin and Atran, 2004). This tradition promotes a theory-first research cycle in which new data collection is ideally justified by a pre-existing theory and its predictions, rooted in the theory testing ideal of the hard sciences, such as physics (Meehl, 1967; Platt, 1964).
The idealized notion of fully deliberate, theory-guided experimentation appears to conflict with how successful learning is achieved in the natural world (Campbell, 1974). Biological evolution and innovation proceeds by random mutations rather than targeted changes (e.g., Wagner, 2011). Humans, too, learn through an adaptive balance of exploration and exploitation (e.g., Wilson et al., 2021), and often fall into “learning traps” when they exploit their knowledge too early (Rich and Gureckis, 2018). Human exploration is especially prominent at early developmental stages (e.g., Hart et al., 2022; Liquin and Gopnik, 2022).
Thus, despite the historical appeal of theory-driven experimentation, it remains unclear whether such strategies actually promote more successful learning in science. The potential need for more exploratory experimental approaches has recently been highlighted by a growing number of researchers who advocate for broad, integrative experimentation rather than narrowly targeted testing (Almaatouq et al., 2024; Baribault et al., 2018). Here, we develop a formal modeling framework to compare different experimental strategies and evaluate their respective epistemic outcomes.
Formal assessment of data collection strategies: Does theory-motivated data collection help to develop better theories?
We develop a general modeling framework to study the epistemic success of different strategies that communities of interacting scientists may follow when trying to learn about the world. In this multi-agent model, agents collect data, try to explain them, and communicate their findings and theoretical accounts to each other (Figure 1). This framework enables us to investigate essential aspects of scientific activity—experimentation, explanation, and communication—as well as their possible interactions (e.g., one’s theoretical framework influencing the data one collects). We apply the model to computationally probe one of the fundamental intuitions of the broad scientific community about experimentation: the primacy of theory-driven experimentation.

Here, we rigorously assess the epistemic success—specifically, the informativity and predictive accuracy of developed theories—resulting from different experimentation strategies in different contexts. We formalize the commonly proposed experimentation strategies: confirmation (Fleck, 1979; Kuhn, 1970), falsification (Popper, 2014), crucial experimentation (Lakatos, 1974), novelty-driven experimentation (e.g., in a sampling literature: Shi and Tang, 2021), random baseline (Brunswik, 1955), and their hybrids. To achieve the most robust and general results, we extensively vary other potentially important components, such as complexity of the “ground truth” that the agents are learning about, the ways in which they interact, their meta-theoretic preferences, measurement resources, and other factors.
Across all the conditions, we found that the agents collecting data at random end up with the most informative and predictive theoretical accounts for the ground truth. The agents following confirmation-, falsification-, theoretical disagreement-based strategies and their hybrids limited their observations to a smaller subset of the ground truth that did not let them produce a successful theory about the ground truth, but created an illusion of such. We therefore conclude that theoretically motivated experiment choice is potentially damaging for science, but in a way that will not be apparent to the scientists themselves.
Results
We conducted five simulation experiments to investigate a role of experimentation strategy (Table 1) on the epistemic success of the learners. First, we looked at the main effects of the agents’ experimentation strategy on their “perceived” and “actual” epistemic success across contexts. We found that random experimentation leads agents to develop the most informative accounts for the ground truth in nearly all the contexts we considered. Agents collecting data in a theoretically informed way, even if guided by falsification or disagreement, ended up sampling less diverse or less representative observations from the ground truth, which are easier to account for. These agents developed theories that appeared very successful, but in fact misrepresented the ground truth. Then, we tested whether theoretically informed experimentation can be justified when the agents are limited in the amount of observations they can collect, when they have good theories, when the dependent and independent variables are clearly separated in the experimental space, or when the agents’ epistemic goal changes (e.g., from reconstruction to prediction). We found that neither limited learning time, separate status of independent experimental variables, different epistemic goals nor well-developed theories help agents to develop better theories by following theory-informed strategies than when they simply choose experiments at random.
| Strategy | Intuition |
| Confirmation | The agent samples a new observation close to the previously collected observation (referent) that is well explained by its theory |
| Falsification | The agent samples a new observation close to the previously collected observation that is poorly explained by its theory |
| Disagreement | The agent samples a new observation close to the previously collected observation on which predictions of its and another agent’s theories largely disagree |
| Disagreement + confirmation | The agent samples a new observation close to the previously collected observation that is both well explained by its theory and also leads to a different prediction by another agent’s theory |
| Disagreement + falsification | The agent samples a new observation close to the previously collected observation that is both poorly explained by its theory and also leads to a largely divergent prediction by another agent’s theory |
| Novelty | The agent samples a new observation that is very different from its previously collected observations |
| Random | The agent samples a new observation at random |
Theory-motivated strategies are probabilistic, so that the referents for sampling are chosen with respect to their confirmation-, falsification-, or disagreement-score (or both scores, in hybrid strategies). Note that the referent in theory-motivated experimentation only determines a value along one dimension of a future observation (controlled dimension), while all other observations are sampled from the ground truth distribution conditioned on that value. Agents following all strategies except random start with 10 randomly collected observations, and then have a 10% random exploration rate.
Experiment 1: Epistemic success of experimentation strategies across contexts
First, we simulated the model to determine the relative success of experimentation strategies across contexts: for different complexities of the ground truth, agents’ communication strategies, agents’ preference for simpler explanations, measurement capacities, group sizes, and others. Ground truth complexity was manipulated through the number of dimensions and the number of clusters it comprised. Multi-agent learning with each combination of parameters determining the “context” (see Methods) was simulated 3 times, resulting in 9060 simulations analyzed further (two simulations did not finish in time, which is controlled for in the analysis).
Perceived performance
First, we looked at the influence of experimentation strategy on how well agents’ resulting theories accounted for only the data they or their colleagues actually collected (perceived performance). In our analysis of perceived performance, we did not consider how the agent’s theories would account for data that was not collected by any agent in the community or against the ground truth itself. For our main descriptive analysis of the simulation results, we applied linear regression with the formula:
to estimate the effects of data sampling strategies on agents’ perceived performance, while controlling for the effects of all other variables we varied between simulations. We replicated the analysis with regression models that include an interaction of the experimentation strategy with the group communication strategy or with the ground truth complexity (number of dimensions and clusters in the ground truth) to estimate how context-dependent the main effects of different experimentation strategies are. We also replicated the analyses using the group’s best individual’s perceived reconstruction error, rather than the group average, as a target variable; and with respect to the perceived scores of agents’ at different points of their learning progress. We performed post-hoc contrast tests to construct the ranking of the data sampling strategies according to their perceived success, based on the results from Model 1 (Table 2).

The solid lines between the strategies represent significant (p < .05) post-hoc differences between the strategy and all other strategies across the line (Tukey test). The dotted line denotes that the strategy is significantly different from only some of the strategies across the line. These results are based on the scores of the agents at the end of their learning (after the group conducted 300 experiments).
Across the analyses, we found that the agents following random or novelty-based experimentation strategies ended up with worse accounts for their collected observations than the agents following all other experimentation strategies that we tested (Figure 2(a), left). Figure 2(a) (left) shows the regression coefficients for each experimentation strategy relative to the random strategy baseline. Positive coefficients indicate greater theoretical error (worse performance) than random experimentation, and negative coefficients indicate smaller error (better performance). All theory-guided strategies (i.e., confirmation, falsification, disagreement, and their hybrids) produced significantly lower perceived theoretical error than random (p < .001 for all). The confirmation and disagreement + confirmation strategies achieved the smallest perceived errors, followed by disagreement, disagreement + falsification, and falsification. The novelty-based strategy appeared to perform worse than all others. This shows that, when evaluated only on the data already collected by the agents in the community, agents using theory-motivated approaches appear to outperform random or novelty-based exploration, producing the illusion of success discussed in the introduction. These agents appear to themselves to have successfully captured the ground truth.

These results replicated with respect to both average perceived success of the group and its best individual’s perceived score. The ranking of the strategies with respect to their perceived success remained the same after possibilities of the interactions were added, indicating that the main effects are quite stable across the learning contexts. These results were also stable across different amounts of other agents’ collected data used for evaluation (Figure 2(b)), across time, resulting in comparable rankings of the strategies based on the groups’ perceived performances measured after 50, 100, 150, 200, 250, and 300 observations were collected by the group.
In sum, the agents following novelty-based and random experimentation strategies generated less successful accounts for their own observations than agents following all other experimentation strategies, no matter how complex the “ground truth” was, how the agents communicated about their results, how many agents were learning together, how elaborate theories they were building, how limited they were in their measurements, and how many observations they collected.
Actual performance
We performed the same analyses to assess the influence of data sampling strategies on agents’ actual performance, evaluating them against the representative samples from the full “ground truth” distribution instead of their own collected datasets. This represents the actual success of scientific theories in accounting for the truth, not the apparent success captured by our perceived performance measure.
Figure 2(a) (right) displays Model 1 coefficients for actual performance, again relative to the random-strategy baseline. Here, the relative success of different experimentation strategies reverses: The agents collecting data at random ended up capturing the ground truth better than agents following all other strategies (p < .001 for all). Confirmation-based, falsification-based, disagreement-based strategies, and their hybrids performed significantly worse than the random strategy, while the novelty-based experimentation performed either on par or worse than the random strategy, depending on the type of analysis (Table 2, Figure 2(a) right). The superiority of the random sampling strategy replicated across all our analyses: when the model included only the main effects, interaction between the data sampling strategy and the group communication strategy, and interaction between the data sampling strategy and the properties of the ground truth (dimensionality and number of clusters).
Then, we looked at the relative actual epistemic success of the data sampling strategies depending on the number of observations that the agents have collected from their environment. Again, we found that the random strategy leads the agents to come up with either the best or same-quality accounts for the ground truth as all other strategies, measured after the agents have learned from 50, 100, 150, 200, 250, and 300 samples. Thus, random experimentation outperforms all other tested strategies across a variety of parameters, including the type of ground truth, number of observations, measurement limitations, communication strategy, learning time, and others.
Perceived and actual epistemic success
One might be tempted to interpret the results so far as a simple case of overfitting. Agents who find tighter fits to their observed data will do worse than those with slightly worse fits that are predictively more accurate. In order to investigate this possibility, we analyzed the direct relationship between perceived and actual epistemic success of the agents.
The relationship between actual and perceived performance was significant and positive (Kendall’s tau = 0.35, z = 50.39, p < .001), indicating that a particular group of agents who seemed to perform better-than-average was also more likely to actually perform better-than-average. The positive significant effect replicated with the best actual and perceived scores in the group as the variables (Kendall’s tau = 0.38, z = 54.87, p < .001). Therefore, while the expected actual and perceived epistemic successes for each experimentation strategy are almost inversely related (Table 2, Figure 2(a)), the particular group of agents’ higher perceived performance is predictive of its higher actual performance (Figure 3). For example, if a group of agents following the confirmation strategy is perceived to perform worse than another group following the confirmation strategy, we expect the first group to also be actually less successful. This relationship almost reverses when the simulations are grouped according to the experimentation strategy: if the strategy seems more successful than another strategy according to perceived epistemic success, it will most likely be actually less successful (Figure 2(b)). This result indicates that the group of agents’ perceived performance is generally aligned with the metrics of actual epistemic success, while the two are conflicting at the experimentation condition level.

Within-group theoretical heterogeneity
We analyzed the influences of data sampling strategy on within-group theoretical heterogeneity, sometimes known as cognitive diversity (Kitcher, 1993). To measure heterogeneity, we (1) computed the mean Euclidean distance between the autoencoder weights (theory) of all pairs of agents in a group after aligning their internal nodes (heterogeneity of theories) and (2) asked all the agents in a group to generate predictions on 10000 random samples from the ground truth and computed the average Euclidean distance between such predictions (heterogeneity of theory-based predictions) (for more details, see Methods). According to both metrics, the groups following the confirmation strategy became more heterogeneous than the groups following other experimentation strategies. Moreover, the agents following all other strategies ended up with comparable within-group theoretical heterogeneity (Table 3).

The solid lines between the strategies represent significant post-hoc differences between the strategy and all other strategies across the line (Tukey test). The dotted line denotes that the strategy is significantly different from only some of the strategies across the line. These results are based on the scores obtained at the end of agents’ learning (after the group conducted 300 experiments).
Then, we looked at the relationship between a group’s theoretical heterogeneity and its actual epistemic success. We used the regressions that included all the main effects (Model 1) and the theoretical heterogeneity scores as predictors. We found significant effects of both types of theoretical heterogeneity on agents’ theoretical success. The more heterogeneous theoretical frameworks a group developed, the higher actual theoretical errors it ended up with (Group average theoretical error analysis; heterogeneity of theory-based predictions: {est = 16.67; 95% CI [15.89, 17.46]; t = 41.66; p = .00}, heterogeneity of theories: {est = 11.66; 95% CI [11.23, 12.10]; t = 52.88; p = .00}. Group best individual’s theoretical error analysis; heterogeneity of theory-based predictions: {est = 6.39; 95% CI [5.99, 6.79]; t = 31.28; p = .00}, heterogeneity of theories: {est = 14.61; 95% CI [13.94, 15.27]; t = 43.06; p = .00}).
This result stands in contrast to other results that postulate the importance of cognitive diversity for theoretical success (Hong and Page, 2004; Kitcher, 1993; Zollman, 2010). High theoretical heterogeneity in our simulations was most strongly associated with the confirmation strategy, which is also characterized by low within-agent and between-agent sampling variability, yet high representativeness of sampled data (Table 3). When agents sample narrowly, they fail to explore the space effectively, which generally leads to poor learning. However, this narrow sampling also provides simpler datasets (as demonstrated by the high perceived success of the confirmation agents) that can be explained by many distinct theories. As a result, communities can exhibit high theoretical heterogeneity not because they explore the space in more diverse ways, but because multiple, divergent theories can equally account for the same limited data.
Agents’ sampling behavior
To further understand consequences of following different types of experimentation strategies and their potential relation to agents’ epistemic success, we developed a number of metrics to analyze agents’ sampling behavior. Namely, we looked at the between- and within-agent sampling diversity, representativeness of sampled observations, and the way these properties of sampling behavior change over the time of learning (for details, see Methods).
Agents’ following theoretically-informed experimentation strategies collected less diverse samples from the ground truth than the agents collecting data in novelty-based or fully random ways. Moreover, the theory-informed experimentalists collected less and less diverse samples the more they learned. On the group-level, the agents’ who followed the confirmation strategy collected similar data which were becoming more and more similar over time. The agents’ driven by novelty, on the other hand, collected different observations which were becoming more distinct further in learning. Novelty-driven agents, however, suffered from collecting much less representative samples from the ground truth than agents following all other strategies (Table 3, Figure 4, note that this result is contingent on the fact that all the dimensions of the experimental space could play a role of dependent or independent variable in a given experiment; see Experiment 4 for the simulation that explores scenarios that impose a functional separation between experimental dimensions).

While we did not find an epistemic benefit to theoretical diversity, these results indicate that there is an indirect epistemic benefit to the collecting of diverse data through diverse experimentation (see also Zollman, 2010). Our novel result suggests that this diversity will appear counterproductive to the scientists themselves (perceived performance) while it is in fact indicative of more successful learning (actual performance).
Experiment 2: Learning with very limited observations
We hypothesized that the novelty-based strategy might be superior to the random strategy in the very beginning of learning, allowing the agents to get more varied and informative samples from the ground truth (as in Shi and Tang, 2021; Young et al., 2012). We performed new simulations, where the agents either conduct all their experiments at random or start to conduct experiments in a novelty-driven way after collecting only one (instead of 10, as in previous simulations) observation at random. In these simulations, we looked at the perceived and actual epistemic success of the agents with better temporal resolution, recording their scores each time the group collected 5 new empirical observations. All other parameters of the simulations were extensively varied as in the first experiment, and every condition was simulated 5 times, resulting in 4191 simulations in total (129 simulations did not finish in time, which is controlled for in the analysis).
In contrast to our expectations, we found that even in the very beginning of learning, the novelty-based experimentation strategy does not allow the agents to supersede the benefits of more representative, fully random experimentation. The superior actual performance of agents sampling the data at random is already recognizable after they collect 40 observations, and it only increases throughout learning (Figure 5).

Experiment 3: “Good theories” do not save theory-informed experimentation
In all the simulations thus far, agents began with a random theory that they refined over time in response to collected data. We hypothesized that having substantial prior knowledge (good theories) about the ground truth might help the agents to benefit from theory-informed experimentation. We tested this intuition by pre-training each agent’s theory on 10, 50, or 100 prior observations randomly sampled from the ground truth. Then, we let the agents learn with a predetermined experimentation strategy (as described in Methods) and recorded their perceived and actual performance each time 5 new observations were collected in a group. Note that the groups pretrained with 100 observations per agent ended up collecting less observations with their target strategy than they were pre-trained with. Here, we varied all other parameters in the same way as in experiments 1 and 2, which resulted in 8906 simulations (166 simulations did not finish in time, which is controlled for in the analysis).
We analyzed the data using a linear model that included all contextual variables from the simulations (as in Model 1), as well as the number of pretraining observations and its interaction with agents’ experimentation strategy as predictors. We found that pretraining and its extent did not significantly affect the relative epistemic success of the strategies. The relative ranking of strategies remained stable across different levels of prior knowledge (Figure 6). Full coefficient estimates, confidence intervals, and test statistics are provided in Appendix Table A2. After starting to sample in an informed way, the agents’ theories start to become less and less actually successful than the theories constructed by agents that continue to sample fully at random. Overall, the relative ranking of strategies is quite stable over time and independent of the amount of prior knowledge (Figure 6).

Experiment 4: Separating independent and dependent variables
In Experiments 1–3, every dimension within the experimental design space could potentially function as either a dependent or independent variable in a given experiment. However, this approach does not align with real-world scientific scenarios where manipulated and measured variables usually differ (e.g., in cognitive psychology, brightness of the stimuli is more likely to serve as an independent variable, while reaction time is often a dependent variable). Here, we conducted an experiment that delineated the roles of dependent and independent variables within the experimental space. Specifically, agents are restricted to controlling only one specific experimental dimension, treating all other dimensions as dependent variables. In this way, we maintain a separation between independent and dependent experimental variables between experiments. We varied all the contextual parameters in the same manner as in Experiment 1, with one repetition for each set of contextual variables, leading to a total of 3024 simulation runs.
The overall outcomes of these simulations are in strong agreement with the results of Experiment 1 (Table 4). A notable difference, however, lies in the comparison between random and novelty-driven (space-filling) strategies when generalizing to new observations. The discrepancy between these strategies disappears since the novelty strategy is now unable to bias itself against representative samples, lacking access to the statistics of independent variables required for such biasing.

The solid lines between the strategies represent significant (p < .05) post-hoc differences between the strategy and all other strategies across the line (Tukey test).
Experiment 5: Learning to predict unseen properties of the observations
In a final set of simulations, we changed the agents’ epistemic objective. Instead of developing theories to reconstruct the observations, the agents were trained to predict hidden dimensions based on the dimensions they had observed (Figure 7). When running an experiment, the agents could only access a selected portion of the total experimental dimensions, with the hidden dimensions randomly chosen. They were then tasked to predict the values of the hidden dimensions, leading to a supervised update of their theories. To evaluate both “actual” and “perceived” performance, we estimated the agents’ ability to predict the hidden dimensions from the observed ones. The contextual parameters were altered consistently with Experiment 1, with the exception that the “teaching and learning” strategy was omitted and the proportion of dimensions to be hidden comprised either 50% or 75% of the total number of dimensions in the experimental space. These variations led to 2352 simulation runs in total.

The findings (Table 5) revealed that random and novelty-driven strategies notably outperformed all other strategies in terms of actual scores. Conversely, both confirmation and falsification strategies yielded theories that were least effective in predicting hidden dimensions, while strategies involving disagreement ranked in the middle. Examining the perceived scores, agents employing falsification and confirmation strategies managed to gather data that are most easily accommodated by their theories. In contrast, those relying on disagreement in their experimentation collected observations that posed the greatest challenges to explain.

The solid lines between the strategies represent significant (p < .05) post-hoc differences between the strategy and all other strategies across the line (Tukey test).
Discussion
Scientific theories play many roles in science. They support scientific communication, allow scientists to generalize their results to new situations, and guide experimentation within the subfields. Finally, theories often enable scientists to constrain the measurable space by pointing at the dimensions of interest and their relevant variation. The simulations presented here bear on only one of these functions: we test whether it is helpful to alter experimentation within a constrained experimental space based on existing theoretical accounts of the phenomena (their successes, failures, or disagreements). Intended to test only one of the functions of theories, the model features some of their other functions: specifically, agents develop theories to capture the regularities of the complex data and communicate the results, and are evaluated according to their theoretical success. Moreover, the measurable space of phenomena that the agents deal with is constrained, potentially reflecting another function of theories in constructing the experimental space in the first place.
In the context of our modeling framework, we found that collecting new data at random is the best strategy for learning about the “ground truth.” Collecting new observations that are as different from previous observations as possible, is the second best strategy, and could behave on par with random experimentation in cases when the dependent and independent variables are clearly separated within the experimental space. The falsification-based and crucial experiment-based strategies, often suggested as academic “gold standards,” end up generating less helpful data for the subsequent theory development. Finally, the confirmation-based strategy, potentially most commonly adopted in the academic world, results in the least successful learning even about the simplest ground truth distributions. The relative efficiencies of different experimentation strategies were strikingly invariant to the contexts we varied: the random and novelty-based experimentation were superior to all other tested strategies for ground truths of different complexity, across the learning time, for different group sizes and social learning schemes, epistemic aims, resource limitations, and other conditions.
Interestingly, the picture reverses when the agents are evaluated with respect to “perceived” metrics of epistemic success. The confirmation-, falsification-, and crucial experiment-driven scientists seem to create an illusion of epistemic success, simply by collecting the easiest data to explain (although see Experiment 5 for a different result on the perceived success of the crucial experimentation-driven scientists). The agents collecting new data at random or in a novelty-seeking way often appear to be the least epistemically successful when assessed against the data that they have access to, because they end up collecting more diverse data that actually reflects the ground truth. Thus, there is a striking inconsistency between the seeming and actual epistemic success: the agents following the least actually successful experimentation strategies, often end up with the highest self-confidence in their own success (this presents many results, such as in Stewart and Plotkin, 2021 or Rzhetsky et al., 2015 in a new light: these works only study the perceived success, but seem to make conclusions about the actual success).
The values of randomization
Our main result is quite straightforward: to learn an accurate representation of the ground truth, one should collect as unbiased observations from the ground truth as possible. Note that scientists already consider random data collection as a standard within some aspects of experimental choice: for instance, we try to randomize samples along the potentially confounding variables within each experiment. For example, when trying to estimate the average human height, we try to randomize the location and age of the participants as much as we can. While randomization is a commonly desirable strategy for the vast majority of statistical estimation problems, we somehow consider it irrelevant to the data collection on a more general level of experimental design.
The key advantage of random sampling is its robustness: it produces reliable results regardless of whether our expectations or prior theories about the world are accurate. Random sampling contrasts with the active (biased) sampling strategies that have been demonstrated to enable more efficient learning in a variety of contexts. Namely, in any structured environment, there are ways to speed up theory-building by biasing observation sampling (Myung et al., 2013; Settles, 2009). For example, if 16-year-olds have the most close-to-average height, a scientist can estimate average human height much faster if she only samples from this age group. The adaptive sampling, however, requires a great insight into the problem structure that the agents are trying to learn about in the first place. The successful active learning strategies facilitate the learning process only in very specific circumstances, such as when the learner starts with an accurate prior knowledge about the problem space, and they have to be used cautiously: only when the learning context corresponds to the one that the strategy has been proven to work on (e.g., for the cases when the theory and priors are well-specified: Sugiyama, 2005; Sloman et al., 2022, 2023).
In other words, adaptive sampling is fragile: if non-random sampling is applied but the original assumptions about the target domain are wrong, or often if the strategy is applied imperfectly, the scientist is very likely to be led astray. For example, the average human height inferred from a subsample of 16-year-old people can be any different from the actual average human height whenever the original assumptions for biasing sampling in this way were not entirely correct. Random sampling, too, is not bias-free; its success depends on whether the right variables are being randomized. Moreover, both random and all other experimentation strategies we examined require constructing a bounded experimental space, a challenge that lies beyond the scope of the current work (see Almaatouq et al., 2024, for further discussion). Nevertheless, compared to the theory-motivated experimentation strategies that determine both which variables to control and which values of these variables to test based on prior theory, random sampling is more robust to uncertainty about the structure of the problem.
This fragility becomes even more important when applied to real scientific practice. The active sampling strategies that have been mathematically proven to succeed in certain contexts are typically highly specific algorithms (e.g., those reviewed in Settles, 2009), and they bear little resemblance to the “theory-motivated” experimentation strategies that scientists actually use or could feasibly implement. In practice, theory-driven experimentation often reflects informal reasoning rather than a precise procedure. The potential for automated or hybrid approaches, such as integrating active learning algorithms into experimental design in appropriate contexts (Almaatouq et al., 2024), remains a promising direction for future work.
What are experiments really for?
Scientists pursue a diversity of goals, and capturing phenomena with informative and predictive theories is only one of them (Potochnik, 2017). Our model is limited in that it assumes a specific epistemic purpose for experimentation, while, in practice, experiments serve many other aims and often have to also satisfy practical constraints (also see discussion of practical costs in the next Section). The experimental strategies discussed in this paper have often been conceived with different aims in mind. For instance, some philosophical traditions view science as a sequence of conjectures and refutations that progressively bring us closer to the truth. In that view, the primary goal of experimentation is to test and definitively reject theories (i.e., falsification), rather than to help refine theories to better capture observed phenomena (Popper, 2014). We do not argue for the primacy of our chosen, representational, goal of science, but instead offer our computational results as most relevant to scientists for whom constructing informative or predictive representations of the world is an important goal (van Fraassen, 1980).
Recommendations, limitations, and further directions
Due to the generality of the model, the evidence against biasing experimental choice based on theoretical considerations could be interpreted across the scales: as reflecting dynamics within a small subfield or science as a whole. These results question prioritization of theory-driven experimentation among funders, publishers, and editors in many scientific fields.
The generality of our model, however, also limits the direct application of its results for any specific scientific field or problem. The optimal experimentation strategy depends on the structure of the ground truth scientists are dealing with, their prior knowledge about it, and the diversity of aims they pursue, many of which extend well beyond creating successful representations of the world (Potochnik, 2017). Moreover, there are practical and social reasons to favor theory-driven experimentation that lie outside the scope of our simulations. Theory-motivated experiments may be less costly or time-consuming (see King et al., 2004, on cost-minimizing experimental strategies), may fit better with existing expertise and equipment, or may facilitate more effective communication within a scientific community. These and other considerations prevent us from recommending that individual scientists avoid theory-motivated experimentation altogether. Rather, our results invite a re-examination of the presumed epistemic virtue of theory-driven experimentation as the ideal strategy for scientific discovery.
Another limitation of our model is that our current way of estimating the agents’ performance overlooks the possibility of the agents specializing in particular regions of the ground truth (e.g., Weisberg and Muldoon, 2009; Wimsatt, 2007). Thus, our approach eliminates many possibilities of a group that ends up learning about the world efficiently through specializing. In fact, some of the strategies we investigated ended up producing more heterogeneous agents (e.g., confirmation-based experimentation), which may prove more efficient if each agent is only evaluated against its area of expertise. We recognize these as both a limitation and an avenue for future exploration, emphasizing that our current findings still highlight critical dynamics potentially at play in the scientific community. In particular, even within the extremely specialized subfields of scientists’ primary expertise (e.g., perception of color in human adults), scientists deal with the phenomena that contain more than one effect or regularity. Therefore, even learning in such very specialized areas would require the strategies that could efficiently deal with the phenomena coming from several distinct structures or families (clusters), and this is exactly the type of process that shows a failure of theoretically-motivated experimentation strategies that we tested in our simulation experiments.
Another important limitation concerns the structure of the modeled ground truth. In our simulations, the ground truth consisted of mixtures of high-dimensional multivariate Gaussian distributions representing “blobby” phenomena spanning all the dimensions of the ground truth. This abstraction allowed us to examine how agents learn high-dimensional regularities in their environments through experimentation. However, in many scientific domains, the relevant regularities may instead lie on lower-dimensional manifolds or discrete causal structures (e.g., if phenomena occur only when a specific variable is present). In such cases, random exploration may not yield the same epistemic advantages we observe here, since large portions of the search space may be uninformative or irrelevant. This does not, however, immediately imply that theory-guided strategies would necessarily be more efficient in such contexts. A theory may direct experimentation toward the more “interesting” causal regions, but it may equally lead scientists to focus on uninformative regions if the underlying theory is substantially misguided.
Moreover, the agents in our simulations searched for statistical patterns rather than deterministic laws, such as the law of gravity in physics or psychophysical law in psychology. Since the preprint of this paper was posted, Musslick et al. (2023) have conducted a conceptual replication using existing laws and models in cognitive science as the ground truth. Their findings replicated the core results reported here. These efforts mark only an initial step, and future work should investigate experimentation in sparse or low-dimensional ground truths, where causal relations are localized rather than widely distributed and where generalizable scientific laws are the primary objects of discovery.
A further limitation concerns the absence of measurement noise in our model. In the reported simulations, agents obtain exact samples from the ground truth distributions, which are themselves stochastic (each observation corresponds to a draw from a mixture of multivariate Gaussians). This means that the only source of uncertainty stems from the phenomenon itself rather than from measurement error. This abstraction departs from real scientific practice, where the measurements are inherently noisy and often require replication to separate signal from noise (Bland and Altman, 1996). Incorporating measurement uncertainty could change the relative success of experimentation strategies: for instance, repeated sampling in the same region of the space (as in confirmation strategy) might become more epistemically valuable under more noisy conditions (Loken and Gelman, 2017). Future work should therefore explore how varying levels of measurement noise influence the comparative performance of different experimentation strategies.
Completely a-theoretic experimentation is unattainable (e.g., Daston and Galiston, 2021; also noted in the commentary on our preprint of this paper: Hullman, 2022). While the randomness in experimental choices can be augmented by automated methods (Yarkoni et al., 2020; also, see de Almeida et al., 2017 for examples of accelerating organic chemistry by increasing randomness in the experimentation), even the automated experimental design software operates on the dimensions and variables defined or constrained by scientific theories and concepts (Daston and Galison, 2021; Dubova and Goldstone, 2023; Musslick et al., in press; see this issue being addressed in Hoffrage and Hertwig, 2006). Despite the impossibility of perfectly random experimentation, scientists often retain a degree of choice in determining the amount of theoretical motivation behind their experiments. Our results call into question the prevalent preference for more theory-driven experimentation (e.g., Platt, 1964), uncovering the risks of what may become vicious self-fulfilling prophecies when such an approach is used.
Finally, interpreting our work requires acknowledging several fundamental meta-limitations. Our simulations are themselves far from representing random experimentation: they were designed in a way that mirrors crucial experimentation. Specifically, we selected a set of “dominant” experimentation strategies and pitted them against one another, rather than exploring a more comprehensive space of possible strategies. This design choice may introduce biases that affected our results, and we are currently extending our work to include a broader range of strategies, such as those explicitly maximizing information or empowerment gain (Lidayan et al., 2025; Wilson et al., 2021). A higher-level insight emerging from our findings is the risk of being misled when applying prior theoretical assumptions to the collection of new data. This concern also extends to the interpretation of our own simulation results. Accordingly, the relevance of these findings to real-world scientific practice must be judged by individual researchers in light of their specific epistemic aims, constraints, and contexts of their own domains.
Beyond the specific results reported here, our modeling framework provides a method for evaluating a variety of aspects of scientific practice. We have focused on one particular issue, but the model is sufficiently general to apply to many others including the methods of theory construction, scientific communication, and strategies for dividing labor in a scientific community.
Conclusion
We present evidence from a multi-agent model that random experimentation allows agents to develop more successful theories about the ground truth, than a variety of theory-driven experimentation procedures, such as falsification, confirmation, and crucial experimentation. Despite being actually superior across all contexts we tested it on, the random experimentation appears to produce inferior theories to scientists themselves, precisely because it prevents agents from inadvertently simplifying their theoretical job by sampling narrow sets of observations that are easier to explain. We are suggesting that, when it comes to experimental choice within a specific design space, scientists should be less influenced by theories than has often been supposed by philosophy of science and by the scientists themselves.
Methods
Model of collective learning
We developed a minimal multi-agent model that captures three essential aspects of scientific activity: collecting the data, building explanations, and learning from others. In the simulations, the agents are trying to develop efficient lower-dimensional representations (theories) of the simulated high-dimensional “ground truth” environments.
Ground truth
On each simulation, we seed a ground-truth environment for the agents to learn about. Each environment is a mixture of multivariate Gaussian distributions (N_clusters = [2, 10, 30]) that span across many dimensions (N_dim = [20; 100]; dim_length = 200). The ground truth space is used to produce observations for the scientists, which correspond to the points sampled from the overall ground truth distribution. Observations are therefore described by their values along the ground truth’s dimensions. Some values along the dimensions are more likely than the others; some dimensions correlate with each other, while the others do not; some observations are clustered around a particular range of values along the dimensions. Therefore, different values along dimensions have differential predictive ability of the values along other dimensions, and some observations are more useful to learn about than others. For example, observing a datapoint from a highly frequent, but narrow, cluster might help generalize to other observations in the same cluster, but would be uninformative (or even misleading) with respect to the observations generated from other clusters. On the other hand, obtaining a rare and unrepresentative observation might not allow for any generalization at all.
Agents
The agents perform two epistemic actions: (1) collecting the data by controlling a value along one of the dimensions of the ground truth distribution and recording the values along other dimensions of the resulting observation (measurement_capacity = [all the dimensions, half of them], when limited to recording half of the dimensions, the agents choose them randomly; spatial representation of agents’ experimentation and theorizing is analogous to Klahr and Dunbar, 1988; in Experiment 5, measurement_capacity varied between [½ of the dimensions, ¾ of the dimensions]); (2) constructing compact “explanations” for the collected data. Agents’ explanations are lower-dimensional representations of their data, and the agents’ goal is to come up with the most efficient and representative lower-dimensional account for the ground truth space. Efficient lower-dimensional representation of the ground-truth essentially allows the agents to predict new observations and the values that they take along the unrecorded dimensions. Here, agents’ explanations take the form of neural autoencoders (Kramer, 1992) with one hidden layer (N_hidden_neurons = [3, 6]) which are frequently updated on the agents’ individual collected data, in the same way for all the experimentation strategy conditions. When the agents update their explanations (see collective strategies for different updating schemes), they undergo 50 stochastic gradient descent updates (batch size = 1; optimizer = adam; loss function = mean squared reconstruction error for Experiments 1–4; loss function = mean absolute error for Experiment 5) on their observations serving as both inputs and outputs. The activation function for the agents’ autoencoders is chosen to be linear for the encoding layer and ReLU for the decoding layer.
Experimentation strategies
The main focus of this work is the relative epistemic success of different experimentation strategies. We tested the following strategies:
Confirmation
The agent computes the “theoretical error” (reconstruction error of its neural autoencoder) on each observation in its dataset. Reconstruction error reflects how much information about an observation is captured in an agent’s autoencoder (theory). Low reconstruction error on an observation means that this high-dimensional observation can be successfully recovered from agent’s lower-dimensional representation of it alone.
The confirmation-driven agent samples a new observation close to a previous observation chosen with respect to the theoretical error on it: the probability of an observation to be chosen as a “reference” for the new experiment is inversely related to its theoretical error on it (the better the current theory accounts for the observation, the more likely this observation is chosen as a reference for new experiment). When collecting the data in the theory-motivated ways, agents only specify the value along one of the dimensions, which is sampled close to the reference observation’s value along that dimension. The dimension on which the sampling will be conditioned is randomly chosen from all dimensions of the reference observation. All other values are then conditionally sampled from the ground truth given the controlled dimension and its value. Thus, the confirmation-driven agents basically follow the gradient of their theoretical success when sampling new observations.
Falsification
The agent computes the “theoretical error” (reconstruction error of its neural autoencoder) on each observation in its dataset. Then, the agent samples a new observation close to a previous observation chosen with respect to the theoretical error on it: the probability of an observation to be chosen as a “reference” for the new experiment is positively related to its theoretical error on it (the worse the current theory accounts for the observation, the more likely it is chosen as a reference for new experiment).
Disagreement (crucial experiment)
Two agents combine their datasets and generate predictions for each sample in the combined dataset. The observations are ranked according to the euclidean distance of the agents’ predictions on them. Then, one of the agents samples a new observation close to the observation from the combined dataset chosen proportionally to its place in the “disagreement” ranking (the more agents disagree on an observation, the more likely it is to become a reference for the next experiment).
Novelty
The agent randomly generates 500 potential sampling locations, which correspond to the dimension to control and the value along that dimension. The agent computes the distance of each of these potential experiments from all the observations it already collected. Proposals concerning dimensions that have never been observed before (when the agents have a limited measurement capacity) receive a distance of infinity. The next experiment is chosen as the one having the highest minimal distance from already collected samples.
Random
The agent chooses the conditioning dimension and its value to sample from with uniform probability, independently of its “theory” for the ground truth or previously collected observations.
We also formalized two hybrid strategies, “confirmation + disagreement” and “falsification + disagreement,” which operate by ranking the observations based on both their disagreement scores and confirmation/falsification scores.
For all strategies except for random, the agents start with fully random sampling up until they collect 10 first observations (unless otherwise specified—see Experiment 2). After this initial stage, each agent has a 10% constant random exploration rate.
Social learning strategies
We varied the social learning strategies that a group of agents (N = [5, 10]) follows while learning about the world:
Data sharing
The agents share their observations with each other after they collect each datapoint. Each agent adds another agent’s observation to its dataset and updates its explanation on the new dataset each time it gets bigger. Each agent treats its own collected observations and other agents’ observations indistinguishably.
Dimension sharing
Every 10 steps, two agents are randomly chosen to exchange how much they value different dimensions of the ground truth. The agents recombine their internal nodes so that one internal node’s encoding weights are completely replaced with encoding weights of a randomly chosen internal node from another agent. After recombination of weights, both agents’ explanations are retrained on their datasets.
Explanation sharing
Two agents are chosen every 10 steps to exchange their explanations. Both agents are assigned a pairwise average of their autoencoder weights as their new explanation. After this exchange, the agents’ explanations are retrained on their datasets.
Aligned explanation sharing
Two agents are chosen every 10 steps to exchange explanations. After the exchange, both agents end up with a pairwise average of the weights of their aligned explanations as their new explanation. For this, the agents’ internal nodes are first aligned based on the similarity of the weights that the internal nodes have to the input/output nodes. We use the Hungarian algorithm (Kuhn, 1955) to find an alignment of two agents’ internal nodes that minimizes the difference between their internal nodes’ weights. Then, we take the pairwise average for the aligned nodes’ weights and assign it as a new explanation for both agents. The agents’ explanations are retrained on their datasets after sharing.
Self-confident explanation sharing
This social learning strategy is the same as aligned explanation sharing, but the explanation exchange is weighted: the agents’ value their own explanations more than the other agents’ explanations. For this, we take a weighted (0.8 to own weights, 0.2 to other’s weights) average of two agents’ aligned weights when they exchange their explanations. The agents are retrained after their weights are reassigned.
Explanation sharing with some data sharing
The agents share their explanations in an aligned way every 10 experimentation steps, but they also share each observation they collect with one randomly chosen colleague each time they acquire it. The colleague adds the shared observation to its dataset and updates its explanation on its full dataset.
Teaching and learning
After each experimentation step, one agent is chosen to teach one other agent, imagining new observations based on the data it has seen so far. Formally, the new observation is created as follows. First, the new “raw” observation is made as a linear combination of datapoints that the teacher already collected. The weights in this linear combination are always positive, sum to one and are sampled from a dirichlet distribution. Then, as the last step, this “raw” observation is passed through the teacher’s conceptual framework (autoencoder) and reconstructed, thus generating the final observation. The student agent adds the new observation to its dataset and updates its explanation on its full dataset.
Importantly, we vary the collective learning strategies to test the dependence of the experimentation choice strategies’ efficiency on these different social contexts. In our simulations, the social interaction strategies are not controlled by the number of learning updates that each agent undergoes or the number of raw datapoints that each agent has in its possession. Therefore, we urge caution in interpreting any differences in performance for the groups with different social learning strategies.
Goal of the epistemic activity and evaluation
Even though the modeling framework is very flexible, in current simulations we explore the results with respect to constructing good theories, or more precisely in our model: efficient lower-dimensional accounts for the underlying reality.
To evaluate scientists’ accounts for reality in Experiments 1–4, we compute the reconstruction error that the scientists’ theories achieve on their own collected datasets and the data collected by their peers1 (“perceived”) and on 10000 samples from the original “ground truth” distribution (“actual”). Reconstruction error basically reflects the amount of structure in the agent’s collected data or the ground truth that the agent’s theory captures. In other words, reconstruction error shows how accurately the rich ground truth (actual) or collected data (perceived) can be reproduced given the concise theory alone. We record the average perceived and actual performance of the group, and also the best individual’s “actual” and “perceived” performance after every 50 experimentation steps (unless otherwise specified—see Experiment 2) until the group collects 300 observations in total. In Experiment 5, instead of computing reconstruction error, we evaluate the agents according to their prediction error—the ability to predict values along the hidden dimensions of the observations based on the available ones.
Hyperparameters
Even though we tried to vary as many potentially relevant components of the model as possible, many parameters that we did not vary may affect the results. First of all, our simulations end when the group of agents collects 300 observations. Moreover, we had to stick to one arbitrarily chosen updating regime for the agents (50 adam steps, with batch size = 1). All these hyperparameters remained constant while we varied the experimentation strategy, and might have affected the generalizability of the reported results.
Analysis of theoretical heterogeneity
We designed two metrics to assess theoretical heterogeneity of the agents for each simulation:
1. Heterogeneity of theory-based predictions: we sample 10000 observations from the ground truth. We run each of these observations through the agents’ final theories to obtain their predictions. Then, we compute the average pairwise Euclidean distance of agents’ predictions for each observation and average them out to obtain the group’s heterogeneity score.
2. Heterogeneity of the theory representations: we compute the mean Euclidean distance between the weights of all pairs of agents in a group after aligning the internal nodes of each pair with the Hungarian algorithm (Kuhn, 1955).
Analysis of sampling behavior
We designed a number of metrics to shed light onto the sampling behavior of the agents for each simulation:
1. Individual sampling variability: for each agent, we compute pairwise Euclidean distance between all its observations. Then, we compute an average of the sampling variability for each group as a measure of individual variability of samples for each simulation.
2. Individual sampling variability over time: we look at how the variability of samples changes over time. We compute sequential pairwise Euclidean distance of the samples in each agent’s dataset. Then, we average the sequential distances between the agents in the group, and compute a spearman correlation of the average sequential distances with their order (time).
3. Between-agent sampling variability: we compute an average pairwise Euclidean distance between the samples of all agents in a group.
4. Between-agent sampling variability over time: we compute an average pairwise Euclidean distance between agents’ observations sampled at the same order (time). Then, we compute the spearman correlation of the sequential between-agent distance of samples with order of the observation.
5. Representativeness of samples: we compute per-group average log likelihood of sampled observations in the ground truth.
6. Representativeness of samples over time: we compute the spearman correlation of the average log likelihood of the observations collected by the agents in a group at each step with time.
Acknowledgments
The authors thank Sabina Sloman, Robert Goldstone, James Michelson, Mahi Luthra, Eduardo Izquierdo, Joshua Nunley, David Landy, Sebastian Musslick, Mark Steyvers, Richard Shiffrin, members of PCL, Computational Neuroethology, and Building a Mind labs at Indiana University, participants of the Pittsburgh Formal Epistemology Workshop for productive discussions that lead to the improvement of this work.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: M.D. was supported by IU Cognitive Science program, CMU Center for Formal Epistemology, and the NSF-NRT grant 1735095 “Interdisciplinary Training in Complex Networks and Systems.” This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute: https://kb.iu.edu/d/anwt\#carbonate.
ORCID iD
Footnotes
Open science statement


Significance statement Choosing which experiments to perform or what data to collect is a crucial aspect of science. Many strategies have been proposed by philosophers and methodologists, or practiced by scientists themselves. These strategies include falsification (choosing an experiment most likely to falsify a dominant theory), disagreement (choosing an experiment likely to help diagnose which of multiple competing theories is better), and others. Here, we computationally test the effectiveness of such strategies. To do so, we developed a multi-agent model of the scientific process that includes active experimentation, theorizing, and social learning. Our findings suggest that an exploratory strategy of choosing new experiments at random leads agents to develop the most accurate and predictive theories of their simulated environments. In contrast, agents aiming to confirm theories, falsify theories, or resolve theoretical disagreements through experimentation develop seemingly promising accounts of the data they collect that ultimately misrepresent the underlying ground truth.
1. The amount of community data used for perceived evaluation depended on the social learning strategy: In some cases (“full data sharing”) the agents shared all their observations with others, in other cases (“partial data sharing”) the agents exchanged 10% or 20% of all observations with others, and in the rest of the cases the agents only had access to their own collected datapoints.
References
Almaatouq A, Griffiths TL, Suchow JW, et al. (2024) Beyond playing 20 questions with nature: integrative experiment design in the social and behavioral sciences. Behavioral and Brain Sciences 47: e33.
Baribault B, Donkin C, Little DR, et al. (2018) Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences 115(11): 2607–2612.
Bland JM, Altman DG (1996) Measurement error. BMJ British Medical Journal 312(7047): 1654.
Brunswik E (1955) Representative design and probabilistic theory in a functional psychology. Psychological Review 62(3): 193–217.
Campbell D (1974) Evolutionary Epistemology. In: Schilpp P. (ed). LaSalle, IL: Open Court, The Philosophy of Karl R. Popper, 412–463.
Daston L, Galison P (2021) Objectivity. Princeton University Press.
de Almeida AF, Moreira R, Rodrigues T (2019) Synthetic organic chemistry driven by artificial intelligence. Nature Reviews Chemistry 3(10): 589–604.
Dubova M (2022) Generalizing with overly complex representations. In: NeurIPS 2022 workshop on information-theoretic principles in cognitive systems, New Orleans, LA, 28 November 2022–9 December 2022.
Dubova M, Goldstone RL (2023) Carving joints into nature: reengineering scientific concepts in light of concept-laden evidence. Trends in Cognitive Sciences 27(7): 656–670.
Fleck L (1979) Genesis and Development of a Scientific Fact. University of Chicago Press.
Hart Y, Kosoy E, Liquin EG, et al. (2022) The development of creative search strategies. Cognition 225: 105102.
Henrich J, Heine SJ, Norenzayan A (2010) The weirdest people in the world? Behavioral and Brain Sciences 33(2-3): 61–83.
Hoffrage U, Hertwig R (2006) Which world should be represented in representative design. In: Fiedler K, Juslin P (eds) Information Sampling and Adaptive Cognition. Cambridge University Press, 381–408.
Hong L, Page SE (2004) Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences 101(46): 16385–16389.
King RD, Whelan KE, Jones FM, et al. (2004) Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971): 247–252.
Kitcher P (1993) The Advancement of Science: Science Without Legend, Objectivity Without Illusions. Oxford University Press on Demand.
Klahr D, Dunbar K (1988) Dual space search during scientific reasoning. Cognitive Science 12(1): 1–48.
Klayman J, Ha Y-W (1987) Confirmation, disconfirmation, and information in hypothesis testing. Psychological Review 94(2): 211–228.
Kramer MA (1992) Autoassociative neural networks. Computers & Chemical Engineering 16(4): 313–328.
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2): 83–97.
Kuhn TS (1970) The Structure of Scientific Revolutions. Chicago University of Chicago Press, Vol. 111.
Lakatos I (1974) The role of crucial experiments in science. Studies In History and Philosophy of Science Part A 4(4): 309–325.
Lidayan A, Du Y, Kosoy E, et al. (2025). Intrinsically-motivated humans and agents in open-world exploration. ArXiv Preprint arXiv:2503.23631.
Liquin EG, Gopnik A (2022) Children are more exploratory and learn more than adults in an approach-avoid task. Cognition 218: 104940.
Loken E, Gelman A (2017) Measurement error and the replication crisis. Science 355(6325): 584–585.
Mayo DG (2018) Statistical Inference as Severe Testing. Cambridge University Press.
Medin DL, Atran S (2004) The native mind: biological categorization and reasoning in development and across cultures. Psychological Review 111(4): 960–983.
Meehl PE (1967) Theory-testing in psychology and physics: a methodological paradox. Philosophy of Science 34(2): 103–115.
Musslick S, Strittmatter Y, Dubova M (2023) Closed-loop scientific discovery in the behavioral sciences. PsyArXiv, in press.
Myung JI, Cavagnaro DR, Pitt MA (2013) A tutorial on adaptive design optimization. Journal of Mathematical Psychology 57(3–4): 53–67.
Platt JR (1964) Strong inference: certain systematic methods of scientific thinking may produce much more rapid progress than others. Science 146(3642): 347–353.
Popper K (2014) Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge.
Potochnik A (2017) Idealization and the Aims of Science. University of Chicago Press.
Rich AS, Gureckis TM (2018) The limits of learning: exploration, generalization, and the development of learning traps. Journal of Experimental Psychology: General 147(11): 1553–1570.
Roberts RM (1989) Serendipity: Accidental Discoveries in Science. Wiley, p. 288.
Rosenman MF (1988) Serendipity and scientific discovery. Journal of Creative Behavior 22(2): 132–138.
Rzhetsky A, Foster JG, Foster IT, et al. (2015) Choosing experiments to accelerate collective discovery. Proceedings of the National Academy of Sciences 112(47): 14569–14574.
Settles B (2009) Active learning literature survey. University of Wisconsin-Madison. Technical Report TR1648.
Shi C, Tang B (2021) Model-robust subdata selection for big data. Journal of Statistical Theory and Practice 15(4): 1–17.
Sloman SJ, Oppenheimer DM, Broomell SB, et al. (2022). Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias. ArXiv preprint arXiv:2205.13698.
Sloman SJ, Cavagnaro D, Broomell SB (2023). Knowing what to know: implications of the choice of prior distribution on the behavior of adaptive design optimization. ArXiv preprint arXiv:2303.12683.
Stewart AJ, Plotkin JB (2021) The natural selection of good science. Nature Human Behaviour 5(11): 1510–1518.
Sugiyama M (2005) Active learning for misspecified models. Advances in Neural Information Processing Systems 18: 1305–1312.
van Fraassen BC (1980) The Scientific Image. Oxford University Press.
van Rooij I, Baggio G (2021) Theory before the test: how to build high-verisimilitude explanatory theories in psychological science. Perspectives on Psychological Science 16(4): 682–697.
Wagner A (2011) The Origins of Evolutionary Innovations: A Theory of Transformative Change in Living Systems. Oxford University Press.
Weisberg M, Muldoon R (2009) Epistemic landscapes and the division of cognitive labor. Philosophy of Science 76(2): 225–252.
Wilson RC, Bonawitz E, Costa VD, et al. (2021) Balancing exploration and exploitation with information and randomization. Current Opinion in Behavioral Sciences 38: 49–56.
Wimsatt WC (2007) Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality. Harvard University Press.
Yanai I, Lercher M (2019) Night science. Genome Biology 20(1): 179.
Yarkoni T, Eckles D, Heathers JAJ, et al. (2021) Enhancing and accelerating social science via automation: challenges and opportunities. Harvard Data Science Review 3(2): 1–34.
Young ME, Cole JJ, Sutherland SC (2012) Rich stimulus sampling for between-subjects designs improves model selection. Behavior Research Methods 44(1): 176–188.
Zollman KJ (2010) The epistemic benefit of transient diversity. Erkenntnis 72(1): 17–35.
Appendix
| Experimentation Strategy | Coefficient | 95% CI | t | p | Performance Type |
| Disagreement + falsification | −282.26 | [−322.83, −241.68] | −13.64 | <.001 | Perceived |
| Disagreement + confirmation | −430.81 | [−471.38, −390.23] | −20.81 | <.001 | Perceived |
| Disagreement | −404.80 | [−445.37, −364.22] | −19.56 | <.001 | Perceived |
| Falsification | −276.42 | [−316.99, −235.85] | −13.35 | <.001 | Perceived |
| Confirmation | −433.23 | [−473.80, −392.66] | −20.93 | <.001 | Perceived |
| Novelty | 158.14 | [117.57, 198.71] | 7.64 | <.001 | Perceived |
| Disagreement + falsification | 582.24 | [468.87, 695.63] | 10.07 | <.001 | Actual |
| Disagreement + confirmation | 720.45 | [607.07, 833.82] | 12.45 | <.001 | Actual |
| Disagreement | 666.92 | [553.55, 780.30] | 11.53 | <.001 | Actual |
| Falsification | 656.62 | [543.25, 770.00] | 11.35 | <.001 | Actual |
| Confirmation | 1228.46 | [1115.09, 1341.84] | 21.24 | <.001 | Actual |
| Novelty | 262.37 | [149.00, 375.75] | 4.54 | <.001 | Actual |
| Predictor | Coefficient | 95% CI | t | p |
| N of pretraining observations | 0.73 | [–3.22, 4.68] | 0.36 | .72 |
| N of pretraining obs × disagreement + falsification | 0.33 | [–5.26, 5.92] | 0.12 | .91 |
| N of pretraining obs × disagreement + confirmation | −0.61 | [–6.20, 4.98] | −0.21 | .83 |
| N of pretraining obs × disagreement | −0.87 | [–6.45, 4.72] | −0.30 | .76 |
| N of pretraining obs × falsification | 1.34 | [–4.24, 6.93] | 0.47 | .64 |
| N of pretraining obs × confirmation | −1.91 | [–7.50, 3.68] | −0.67 | .50 |
| N of pretraining obs × novelty | −0.10 | [–5.69, 5.49] | −0.04 | .97 |