New course on generative AI for behavioral science

This is Jessica. It feels like an “old” course now that the quarter is almost over, but this winter at Northwestern I taught a grad seminar on Generative AI for Social Science. The goal was to survey emerging applications of generative AI (mostly language model agents) in the social sciences, with special attention to methodological and metascientific concerns that come up when AI is used to simulate or substitute for human observations or labels. I became interested in this topic last year as a result of the problems it presents to inference, but also the opportunities it may present to improve behavioral research, which we recently discussed here.

I joined up my computer science section of the course with a Communications section led by my colleague Aaron Shaw, which resulted in a good mix of students across AI and social science. This was great for discussion, as many of the Comm students were experts in survey methods or psychology, while the CS students were more knowledgeable about how transformers work and methods for probing their internals. We also organized a workshop on validating generative AI for social science in February, which some of the students attended and therefore got to see the authors of the papers they were reading present them live.

The only downside of mixing backgrounds was occasional friction with the more formal methods papers. A few times in class there were dismissive jokes made about the more statistically-demanding papers, I assume as a result of people finding the content challenging. The course was advertised as requiring grad-level experience in stats, so they should have known what they were getting into, but I think sometimes people read these things and assume it doesn’t apply to them. My stance on these things is typically that as long as you’re doing your best to understand the material, it’s not a problem and I’ll help you get through it. But I don’t have a lot of patience with the attitude that because it’s challenging to you, it’s ok not to try to understand it. Especially in a class about why we need to take seriously the challenges that LLM simulations present for drawing valid inferences about human behavior!

This week they present project proposals, which include things like exploring new prompting architectures based in cognitive theory, using ML interpretability methods to steer models in ways informed by social science, and studying belief elicitation and uncertainty expression in language models. It’s a good topic for a seminar course because there are lots of aspects of methods that haven’t yet been explored, and thus lots of opportunity to bring social science theories to bear on how we interact with language models, or to apply the latest methods in AI or stats to behavioral science questions.

Week 1: Course introduction: Can generative AI transform social science?

This week we set context, reviewing proposals that argue for the transformative power of generative AI for social science.

Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? (No. w31122). National Bureau of Economic Research. https://www.nber.org/papers/w31122.
Kozlowski, A. C., & Evans, J. (2025). Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research. https://journals.sagepub.com/doi/full/10.1177/00491241251337316.
Bail, C. A. (2024). Can generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21), e2314021121. https://www.pnas.org/doi/10.1073/pnas.2314021121.

Optional:

Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008.
Anthis, J. R., Liu, R., Richardson, S. M., Kozlowski, A. C., Koch, B., Brynjolfsson, E., Evans, J., & Bernstein, M. S. (2025). Position: LLM social simulations are a promising research method. Forty-second International Conference on Machine Learning Position Paper Track. https://openreview.net/pdf?id=cRBg1dtj7o.

Week 2: LLMs as surrogates I: Attitudes, opinions, social behavior

This week we start to read papers that evaluate how well LLMs can act as surrogates of humans. These readings focus on studies that use them to simulate human attitudes, opinions, and social behavior.

Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–22. https://doi.org/10.1145/3586183.3606763
Kim, Junsol, and Byungkyu Lee. 2024. “AI-Augmented surveys: Leveraging large language models and surveys for opinion prediction.” arXiv. https://doi.org/10.48550/arXiv.2305.09620.
Ashery, A. F., Aiello, L. M., & Baronchelli, A. (2025). Emergent social conventions and collective bias in LLM populations. Science Advances, 11(20), eadu9368. https://doi.org/10.1126/sciadv.adu9368 (very optional: you may be interested in this commentary/critique of the paper: https://doi.org/10.48550/arXiv.2505.23796).

Optional:

Chuang, Y.-S., Goyal, A., Harlalka, N., Suresh, S., Hawkins, R., Yang, S., Shah, D., Hu, J., & Rogers, T. (2024). Simulating opinion dynamics with networks of LLM-based agents. In K. Duh, H. Gomez, & S. Bethard (Eds.), Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 3326–3346). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-naacl.211
Hansen, A. L., Horton, J. J., Kazinnik, S., Puzzello, D., & Zarifhonarvar, A. (2024). Simulating the survey of professional forecasters (SSRN Scholarly Paper No. 5066286). Social Science Research Network. https://doi.org/10.2139/ssrn.5066286
Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., Willer, R., Liang, P., & Bernstein, M. S. (2024). Generative agent simulations of 1,000 people (No. arXiv:2411.10109). arXiv. https://doi.org/10.48550/arXiv.2411.10109

Week 3: LLMs as surrogates II: Cognition and behavioral experiments

This week’s material expands on the first “surrogates” set to focus on using LLMs to simulate human cognition and experimental effects more directly.

Cui, Z., Li, N., & Zhou, H. (2025). A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8), 627–634. https://doi.org/10.1038/s43588-025-00840-7.
Binz, M., Akata, E., Bethge, M., Brändle, F., Callaway, F., Coda-Forno, J., Dayan, P., Demircan, C., Eckstein, M. K., Éltető, N., Griffiths, T. L., Haridi, S., Jagadish, A. K., Ji-An, L., Kipnis, A., Kumar, S., Ludwig, T., Mathony, M., Mattar, M., … Schulz, E. (2025). A foundation model to predict and capture human cognition. Nature, 644(8078), 1002–1009. https://doi.org/10.1038/s41586-025-09215-4.
Tranchero, M., Brenninkmeijer, C.-F., Murugan, A., & Nagaraj, A. (2024). Theorizing with large language models (Working Paper No. 33033). National Bureau of Economic Research. https://doi.org/10.3386/w33033

Optional:

Chen, Y., Liu, T. X., Shan, Y., & Zhong, S. (2023). The emergence of economic rationality of GPT. Proceedings of the National Academy of Sciences, 120(51), e2316205120. https://doi.org/10.1073/pnas.2316205120
Ashokkumar, A., Hewitt, L., Ghezae, I., & Willer, R. (2025). Predicting results of social science experiments using large language models. Preprint. https://docsend.com/view/ity6yf2dansesucf
Peng, T., Gui, G., Merlau, D. J., Fan, G. J., Sliman, M. B., Brucks, M., Johnson, E. J., Morwitz, V., Althenayyan, A., Bellezza, S., Donati, D., Fong, H., Friedman, E., Guevara, A., Hussein, M., Jerath, K., Kogut, B., Kumar, A., Lane, K., … Toubia, O. (2025). A mega-study of digital twins reveals strengths, weaknesses and opportunities for further improvement (No. arXiv:2509.19088). arXiv. https://doi.org/10.48550/arXiv.2509.19088
Akata, E., Schulz, L., Coda-Forno, J., Oh, S. J., Bethge, M., & Schulz, E. (2025). Playing repeated games with large language models. Nature Human Behaviour, 9(7), 1380–1390. https://doi.org/10.1038/s41562-025-02172-y

Week 4: Bias, alienness, and other threats to generalization

When the goal is to learn about human behavior, relying on LLM simulations risks biasing downstream inferences. The readings survey ways that LLMs tend to misrepresent human response distributions and exhibit non-human-like errors, as well as metascientific concerns that arise from their availability as a cheap source of simulated data.

Wang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411. https://doi.org/10.1038/s42256-025-00986-z
Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49–58. https://doi.org/10.1038/s41586-024-07146-0
Westwood, S. J. (2025). The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47), e2518075122. https://doi.org/10.1073/pnas.2518075122
Mancoridis, M., Weeks, B., Vafa, K., & Mullainathan, S. (2025). Potemkin understanding in large language models (No. arXiv:2506.21521). arXiv. https://doi.org/10.48550/arXiv.2506.21521

Optional:

Atari, M., Xue, M. J., Park, P. S., Blasi, D. E., & Henrich, J. (2023). Which humans? (No. 5b26t_v1). PsyArXiv. https://doi.org/10.31234/osf.io/5b26t
Dominguez-Olmedo, R., Hardt, M., & Mendler-Dünner, C. (2024). Questioning the survey responses of large language models. Proceedings of the 38th International Conference on Neural Information Processing Systems, 37, 45850–45878. https://dl.acm.org/doi/10.5555/3737916.3739374
Wang, P., Zou, H., Yan, Z., Guo, F., Sun, T., Xiao, Z., & Zhang, B. (2024). Not yet: Large language models cannot replace human respondents for psychometric research (No. rwy9b_v1). OSF Preprints. https://doi.org/10.31219/osf.io/rwy9b
Cummins, J. (2025). The threat of analytic flexibility in using large language models to simulate human data: A call to attention. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2509.13397.

Week 5: Validation I

This week we turn our attention to the methods that authors propose to use to check how well a language model simulates human behavior and draws inferences about the world, or to get valid estimates of their predictive accuracy.

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351. https://doi.org/10.1017/pan.2023.2.
Manning, B. S., & Horton, J. J. (2025). General social agents (No. arXiv:2508.17407). arXiv. https://doi.org/10.48550/arXiv.2508.17407
Vafa, K., Chang, P. G., Rambachan, A., & Mullainathan, S. (2025). What has a foundation model found? Using inductive bias to probe for world models. Proceedings of the Forty-Second International Conference on Machine Learning (ICML 2025), PMLR 267. https://openreview.net/pdf?id=i9npQatSev.

Optional:

Neumann, T., De-Arteaga, M., & Fazelpour, S. (2025). Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation (No. arXiv:2504.08954). arXiv. https://doi.org/10.48550/arXiv.2504.08954.
Larooij, M., & Törnberg, P. (2025). Do large language models solve the problems of agent-based modeling? A critical review of generative social simulations (No. arXiv:2504.03274). arXiv. https://doi.org/10.48550/arXiv.2504.03274
Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR 202, 337–371. https://proceedings.mlr.press/v202/aher23a.html

Week 6: Validation II

In contrast to heuristic approaches to validation that aim to show that LLM outputs are “close enough” to human ones, statistical approaches use some human observations to learn how to correct estimates drawn from LLM observations. These readings formally motivate why heuristic validation is not enough, introduce calibration frameworks, and present empirical results on how these methods compare to other repair strategies like fine-tuning.

Ludwig, J., Mullainathan, S., & Rambachan, A. (2025). Large language models: An applied econometric framework (Working Paper No. 33344). National Bureau of Economic Research. https://doi.org/10.3386/w33344
Broska, D., Howes, M., & Loon, A. van. (2025). The mixed subjects design: Treating large language models as potentially informative observations. Sociological Methods & Research, 54(1), 1074–1109. https://doi.org/10.1177/00491241251326865
Hullman, J., Broska, D., Sun, H., & Shaw, A. (2025). This human study did not involve human subjects: Validating LLMs as behavioral evidence. Preprint. PDF.
Krsteski, S., Russo, G., Chang, S., West, R., & Gligorić, K. (2025). Valid survey simulations with limited human data: The roles of prompting, fine-tuning, and rectification. arXiv preprint arXiv:2510.11408. https://doi.org/10.48550/arXiv.2510.11408.

Optional:

Song, Y., Kluger, D. M., Parikh, H., & Gu, T. (2026). Demystifying Prediction Powered Inference. arXiv:2601.20819.
Egami, N., Hinck, M., Stewart, B., & Wei, H. (2023). Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models. Advances in Neural Information Processing Systems, 36, 68589-68601. https://proceedings.neurips.cc/paper_files/paper/2023/hash/d862f7f5445255090de13b825b880d59-Abstract-Conference.html.
Hoffman, K., Salerno, S., Afiaz, A., Leek, J. T., & McCormick, T. H. (2024). Do we really even need data? (No. arXiv:2401.08702). arXiv. https://doi.org/10.48550/arXiv.2401.08702.
Zrnic, T., & Candès, E. J. (2024). Active statistical inference (No. arXiv:2403.03208). arXiv. https://doi.org/10.48550/arXiv.2403.03208

Week 7: AI as social scientist

So far we’ve mostly talked about LLMs being used as plug-in simulations for human data In surveys and experiments. This week we broaden to consider use of AI in other parts of the research process, like identifying what to research or what to manipulate in an experiment.

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated social science: Language models as scientist and subjects (No. w32381). National Bureau of Economic Research. https://doi.org/10.3386/w32381.
Almaatouq, A., Griffiths, T. L., Suchow, J. W., Whiting, M. E., Evans, J., & Watts, D. J. (2024). Beyond playing 20 questions with nature: Integrative experiment design in the social and behavioral sciences. Behavioral and Brain Sciences, 47, e33. https://doi.org/10.1017/S0140525X22002874 (See also commentaries on this article).
Si, C., Yang, D., & Hashimoto, T. (2024). Can llms generate novel research ideas? A large-scale human study with 100+ nlp researchers. arXiv preprint arXiv:2409.04109.

Optional:

Musslick, S., Bartlett, L. K., Chandramouli, S. H., Dubova, M., Gobet, F., Griffiths, T. L., … & Holmes, W. R. (2025). Automating the practice of science: Opportunities, challenges, and implications. Proceedings of the National Academy of Sciences, 122(5), e2401238121.
Tong, S., Mao, K., Huang, Z., Zhao, Y., & Peng, K. (2024). Automating psychological hypothesis generation with AI: When large language models meet causal graph. Humanities and Social Sciences Communications, 11(1), 896. https://doi.org/10.1057/s41599-024-03407-5

Week 8: Causal discovery & explanation

Continuing with the theme of using AI to design experiments or support theory, this week we look at using models or interpretability methods to learn representations of stimuli or outputs to support causal inference, discover outcomes in text, and theorize about human behavior.

Imai, K., & Nakamura, K. (2025). Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments. arXiv preprint arXiv:2410.00903
Modarressi, I., Spiess, J., & Venugopal, A. (2025). Causal inference on outcomes learned from text. arXiv preprint arXiv:2503.00725.
Zhu, J. Q., Xie, H., Arumugam, D., Wilson, R. C., & Griffiths, T. L. (2025). Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions. arXiv preprint arXiv:2505.11614. https://doi.org/10.48550/arXiv.2505.11614.

Optional:

Tak, A. N., Banayeeanzade, A., Bolourani, A., Kian, M., Jia, R., & Gratch, J. (2025). Mechanistic interpretability of emotion inference in large language models. Findings of the Association for Computational Linguistics: ACL 2025 (pp. 13090–13120). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-acl.679.
Movva, R., Peng, K., Garg, N., Kleinberg, J., & Pierson, E. (2025). Sparse Autoencoders for Hypothesis Generation. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:44997-45023. https://proceedings.mlr.press/v267/movva25a.html.
Zhu, J. Q., Peterson, J. C., Enke, B., & Griffiths, T. L. (2025). Capturing the complexity of human strategic decision-making with machine learning. Nature Human Behaviour, 1-7. https://doi.org/10.1038/s41562-025-02230-5.
Kim, J., Evans, J., & Schein, A. (2025). Linear representations of political perspective emerge in large language models. arXiv preprint arXiv:2503.02080. https://doi.org/10.48550/arXiv.2503.02080

Week 9: Belief-like representations and Bayesian inference

Behavioral scientists often take for granted that people have beliefs, attitudes, desires, and other mental states. This week we look at proposals for how to look for similar representations in language models. We also consider a Bayesian formulation of the prompting process that describes how the researcher’s expectations of reasonable data influence what they generate with language models.

Herrmann, D.A., Levinstein, B.A. Standards for Belief Representations in LLMs. Minds & Machines 35, 5 (2025). https://doi.org/10.1007/s11023-024-09709-6.
Yamin, Khurram, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder. (2026). Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making. https://arxiv.org/abs/2602.06286.
Misra, S. (2025). Foundation Priors. arXiv preprint arXiv:2512.01107.

Optional:

Wu, S., Choi, E., Khatua, A., Wang, Z., He-Yueya, J., Weerasooriya, T., Wei, W., Yang, D., Leskovec, J., and Zou, J. HUMANLM: Simulating Users with State Alignment Beats Response Imitation. https://humanlm.stanford.edu/HumanLM_paper.pdf
Churchland, P. M. (1981). Eliminative materialism and the propositional attitudes. the Journal of Philosophy, 78(2), 67-90. https://www.jstor.org/stable/pdf/2025900.pdf