Data availability
Data, materials and code associated with this research that can be shared without restriction are publicly available in a living OSF repository (https://doi.org/10.17605/OSF.IO/G5SNY)48. The living OSF repository represents improvements, fixes and additions that occur post-publication. Readers can also access a registered, archived version of this repository that is precisely the data, code and documentation as they existed upon publication of this paper (https://doi.org/10.17605/OSF.IO/BZFGY). The repository includes all available documentation for replication attempts regardless of whether they were completed. This includes most of the data and code from the individual replication attempts, save for any data that is proprietary or protected that will not be made available, or for which analyst teams were uncertain or unable to confirm that they were allowed to share secondary data. It is possible that some data, materials or code that could be shared openly is not available at the time of publication. Readers are encouraged to contact the corresponding author or the authors of the relevant sub-project (Supplementary Table 3) to see if more research content can be shared in the living repository. This paper is part of a collection of papers reporting on the SCORE program. Documentation, data and code for the entire program are available at https://doi.org/10.17605/OSF.IO/DTZX4.
Code availability
Code for individual replication projects is available alongside data and materials for each project in the OSF repository (https://doi.org/10.17605/OSF.IO/G5SNY). This includes a push button package with all code and data used to produce all statistics, figures and tables, and code that populates them directly into the manuscript from a template. Also available is a registered, archived version of the repository containing precisely the data, code and documentation used to generate the outcomes reported in this paper (https://doi.org/10.17605/OSF.IO/BZFGY).
References
Nosek, B. A. & Errington, T. M. What is replication? PLoS Biol. 18, e3000691 (2020).
Reproducibility and Replicability in Science (National Academies of Sciences, Engineering and Medicine, 2019).
Nosek, B. A. et al. Replicability, robustness, and reproducibility in psychological science. Annu. Rev. Psychol. 73, 719–748 (2022).
Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 0021 (2017).
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Klein, R. A. et al. Many Labs 2: investigating variation in replicability across samples and settings. Adv. Methods Pract. Psychol. Sci. 1, 443–490 (2018).
Klein, R. A. et al. Investigating variation in replicability: a “many labs” replication project. Soc. Psychol. 45, 142–152 (2014).
Ebersole, C. R. et al. Many Labs 5: testing pre-data collection peer review as an intervention to increase replicability. Preprint at OSF https://doi.org/10.31234/osf.io/sxfm2 (2019).
Ebersole, C. R. et al. Many Labs 3: evaluating participant pool quality across the academic semester via replication. J. Exp. Soc. Psychol. 67, 68–82 (2016).
Camerer, C. F. et al. Evaluating replicability of laboratory experiments in economics. Science 351, 1433–1436 (2016).
Camerer, C. F. et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2, 637–644 (2018).
Cova, F. et al. Estimating the reproducibility of experimental philosophy. Rev. Philos. Psychol. 12, 9–44 (2018).
Errington, T. M. et al. Investigating the replicability of preclinical cancer biology. eLife 10, e71601 (2021).
Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124 (2005).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. & Kievit, R. A. An agenda for purely confirmatory research. Perspect. Psychol. Sci. 7, 632–638 (2012).
Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366 (2011).
Greenwald, A. G. Consequences of prejudice against the null hypothesis. Psychol. Bull. 82, 1–20 (1975).
Rosenthal, R. The file drawer problem and tolerance for null results. Psychol. Bull. 86, 638–641 (1979).
Nosek, B. A., Ebersole, C. R., DeHaven, A. C. & Mellor, D. T. The preregistration revolution. Proc. Natl Acad. Sci. USA 115, 2600–2606 (2018).
Giner-Sorolla, R. Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect. Psychol. Sci. 7, 562–571 (2012).
Nosek, B. A., Spies, J. R. & Motyl, M. Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci. 7, 615–631 (2012).
Nosek, B. A. et al. Promoting an open research culture. Science 348, 1422–1425 (2015).
Freese, J. & Peterson, D. Replication in social science. Annu. Rev. Sociol 43, 147–165 (2017).
Andrews, I. & Kasy, M. Identification of and correction for publication bias. Am. Econ. Rev. 109, 2766–2794 (2019).
Patil, P., Peng, R. D. & Leek, J. T. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11, 539–544 (2016).
Valentine, J. C. et al. Replication in prevention science. Prev. Sci. 12, 103–117 (2011).
Abatayo, A. L. et al. Credibility assessments in the social and behavioral sciences. Preprint at MetaArXiv https://doi.org/10.31222/osf.io/7u58q_v1 (2025).
Fanelli, D. “Positive” results increase down the hierarchy of the sciences. PLoS ONE 5, e10068 (2010).
Fanelli, D. Negative results are disappearing from most disciplines and countries. Scientometrics 90, 891–904 (2012).
Heyard, R. et al. A scoping review on metrics to quantify reproducibility: a multitude of questions leads to a multitude of metrics. R. Soc. Open Sci. 12, 242076 (2025).
Muradchanian, J., Hoekstra, R., Kiers, H. & van Ravenzwaaij, D. How best to quantify replication success? A simulation study on the comparison of replication success metrics. R. Soc. Open Sci. 8, 201697 (2021).
Peels, R. Replicability and replication in the humanities. Res. Integr. Peer Rev. 4, 2 (2019).
Peels, R. & Bouter, L. The possibility and desirability of replication in the humanities. Palgrave Commun. 4, 95 (2018).
TalkadSukumar, P. & Metoyer, R. Replication and transparency of qualitative research from a constructivist perspective. Preprint at OSF https://doi.org/10.31219/osf.io/6efvp (2019).
Hubbard, D. W. & Carriquiry, A. L. Quality control for scientific research: addressing reproducibility, responsiveness, and relevance. Am. Stat. 73, 46–55 (2019).
Alipourfard, N. et al. Systematizing Confidence in Open Research and Evidence (SCORE). Preprint at SocArXiv https://doi.org/10.31235/osf.io/46mnb (2021).
Miske, O. et al. Investigating the reproducibility of the social and behavioral sciences. Nature https://doi.org/10.1038/s41586-026-10203-5 (2025).
Ebersole, C. R. et al. Many Labs 5: testing pre-data-collection peer review as an intervention to increase replicability. Adv. Methods Pract. Psychol. Sci. 3, 309–331 (2020).
Held, L., Pawel, S. & Micheloud, C. The assessment of replicability using the sum of p-values. R. Soc. Open Sci. 11, 240149 (2024).
Micheloud, C., Balabdaoui, F. & Held, L. Assessing replicability with the sceptical p-value: type-I error control and sample size planning. Stat. Neerlandica https://doi.org/10.1111/stan.12312 (2023).
Gronau, Q. F., Heck, D. W., Berkhout, S. W., Haaf, J. M. & Wagenmakers, E.-J. A primer on Bayesian model-averaged meta-analysis. Adv. Methods Pract. Psychol. Sci. https://doi.org/10.1177/25152459211031256 (2021).
Simonsohn, U. Small telescopes: detectability and the evaluation of replication results. Psychol. Sci. 26, 559–569 (2015).
Verhagen, J. & Wagenmakers, E.-J. Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol. Gen. 143, 1457–1475 (2014).
Ly, A., Etz, A., Marsman, M. & Wagenmakers, E.-J. Replication Bayes factors from evidence updating. Behav. Res. Methods 51, 2498–2508 (2019).
Steiner, P. M., Sheehan, P. & Wong, V. C. Correspondence measures for assessing replication success. Psychol. Methods https://doi.org/10.1037/met0000597 (2023).
Ben-Shachar, M., Lüdecke, D. & Makowski, D. effectsize: Estimation of effect size indices and standardized parameters. J. Open Source Softw. 5, 2815 (2020).
Haber N. et al. Materials for “Investigating the replicability of the social and behavioral sciences”. OSF https://doi.org/10.17605/OSF.IO/G5SNY (2025).
Acknowledgements
This work was supported by the DARPA under cooperative agreements no. N660011924015 (to principal investigator B.A.N.) and HR00112020015 (to principal investigator T.M.E.). The views, opinions, findings and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the US Government. We thank B. Arendt, A. Denis, M. Dirzo, Z. Loomas, B. Luis, L. Markham, E. S. Parsons and A. Russell for their contributions to this project.
Ethics declarations
Competing interests
A.H.T., M.D., N.H., K.H., O.M., T. Stankov, B.A.N. and T.M.E. are employees of the non-profit organization Center for Open Science, which has a mission to increase openness, integrity and trustworthiness of research.
Peer review
Peer review information
Nature thanks Antica Culina, Malgorzata Lagisz and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Proportion of papers with a completed replication by discipline.
Proportion of papers by discipline for which a replication attempt was finished (purple), never attempted (blue), or for which a replication team was sourced but the replication study was not started or completed (other colors). OSF = Open Science Framework. This is presented as Supplementary Fig. 7 with additional narrative context in the Supplementary Information.
Extended Data Fig. 2 Proportion of papers with a completed replication by year.
Proportion of papers by publication year for which a replication attempt was finished (purple), never attempted (blue), or for which a replication team was sourced but the replication study was not started or completed (other colors). OSF = Open Science Framework. This is presented as Supplementary Fig. 8 with additional narrative context in the Supplementary Information.
Extended Data Fig. 3 Proportion of papers with a completed replication by journal.
Proportion of papers by journal for which a replication attempt was finished (purple), never attempted (blue), or for which a replication team was sourced but the replication study was not started or completed (other colors). Sample sizes per journal ranged from 5 to 10. This is presented as Supplementary Fig. 9 with additional narrative context in the Supplementary Information.
Extended Data Fig. 4 Retrospective review of papers that were not matched to replication teams to conduct a new data replication by discipline.
Y-axis indicates the proportion of available papers per discipline sample. “Plausible” means that there were no clear barriers to conducting a replication other than capacity within the project. “Secondary data” means that these papers were more appropriate for a secondary data replication. This is presented as Supplementary Fig. 10 with additional narrative context in the Supplementary Information.
Extended Data Fig. 5 Retrospective review of papers that were not matched to replication teams to conduct a secondary data replication by discipline.
Y-axis indicates the proportion of available papers per discipline sample. “Plausible” means that there were no clear barriers to conducting a replication other than capacity within the project. “Primary data” means that these papers were more appropriate for new data replications. Admin. = Administrative. This is presented as Supplementary Fig. 11 with additional narrative context in the Supplementary Information.
Extended Data Fig. 6 Correlation matrix among binary assessments of replication success across claims.
Correlation values are right of the diagonal, and correlation magnitude is visualized left of the diagonal with darker shading indicating stronger correlations. CI = confidence interval. This is presented as Supplementary Fig. 12 with additional narrative context in the Supplementary Information.
Extended Data Fig. 7 Replication success or failure for 13 binary assessments by the effect size difference between the replication and original studies.
Data points are differences in effect sizes for individual claims. Data points on top of each graph are successful replications, and data points on the bottom are failed replications, according to the graph’s metric. This is presented as Supplementary Fig. 13 with additional narrative context in the Supplementary Information.
Extended Data Fig. 8 Replication success rates across 13 binary assessments for claims.
The vertical white line for each row is the estimate, and the 95% confidence interval around the estimate is represented by the dark bar. CI = confidence interval. This is presented as Supplementary Fig. 14 with additional narrative context in the Supplementary Information.
Extended Data Fig. 9 Percentage of replicated papers that were automatically identified as using each method or technique.
Two LLMs (GPT-4.1 and Kimi K2) identified the range of methods or techniques used across all abstracts (prompt: “What statistical techniques or analytic approaches are used?”). They then coded each abstract for the presence (1) or absence (0) of each—a method/technique is considered present if at least one of the models identified it as being present. Error bars = 95% confidence intervals. This is presented as Supplementary Fig. 15 with additional narrative context in the Supplementary Information.
Extended Data Fig. 10 Percentage of replicated papers that were automatically identified as citing each theoretical framework or paradigm.
Two LLMs (GPT-4.1 and Kimi K2) identified the range of frameworks or paradigms used across all abstracts (prompt: “What are the main theoretical frameworks/paradigms being cited?”). They then coded each abstract for the presence (1) or absence (0) of each—a framework/paradigm is considered present if at least one of the models identified it as being present. Error bars = 95% confidence intervals. This is presented as Supplementary Fig. 16 with additional narrative context in the Supplementary Information.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tyner, A.H., Abatayo, A.L., Daley, M. et al. Investigating the replicability of the social and behavioural sciences. Nature 652, 143–150 (2026). https://doi.org/10.1038/s41586-025-10078-y
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41586-025-10078-y