Data availability
MedQA-USMLE case vignettes were downloaded from https://github.com/jind11/MedQA. Derm-Public case vignettes were downloaded from https://www.clinicaladvisor.com/. The images and corresponding vignettes for the NEJM Image Challenge were downloaded from https://www.nejm.org/image-challenge. The private dataset generated as a part of our study can be found at https://github.com/rajpurkarlab/craft-md. All case vignettes used in the study are also available in the following repository: https://github.com/rajpurkarlab/craft-md.
Code availability
All code for reproducing our analysis is available in the following repository: https://github.com/rajpurkarlab/craft-md.
References
Lasser, K. E., Himmelstein, D. U. & Woolhandler, S. Access to care, health status, and health disparities in the United States and Canada: results of a cross-national population-based survey. Am. J. Public Health 96, 1300–1307 (2011).
Irving, G. et al. International variations in primary care physician consultation time: a systematic review of 67 countries. BMJ Open 7, e017902 (2017).
Wong, J. L. C., Vincent, R. C. & Al-Sharqi, A. Dermatology consultations: how long do they take? Future Hosp. J. 4, 23–26 (2017).
Shaver, J. The state of telehealth before and after the COVID-19 pandemic. Prim. Care 49, 517–530 (2022).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712 (2023).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. https://doi.org/10.48550/arXiv.2303.13375 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. Health 5, 1161098 (2023).
Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit. Med. 6, 135 (2023).
Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and adoption of large language models in medicine. JAMA 330, 866–869 (2023).
Ali, R. et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery 93, 1090–1098 (2023).
Fijačko, N., Gosak, L., Štiglic, G., Picard, C. T. & John Douma, M. Can ChatGPT pass the life support exams without entering the American Heart Association course? Resuscitation 185, 109732 (2023).
Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).
Han, T. et al. MedAlpaca—an open-source collection of medical conversational AI models and training data. https://doi.org/10.48550/arXiv.2304.08247 (2023).
Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern. Med. 183, 1028–1030 (2023).
Nair, V., Schumacher, E., Tso, G. & Kannan, A. DERA: enhancing large language model completions with dialog-enabled resolving agents. https://doi.org/10.48550/arXiv.2303.17071 (2023).
OpenAI et al. GPT-4 technical report. https://doi.org/10.48550/arXiv.2303.08774 (2023).
Brown, T. B. et al. Language models are few-shot learners. https://doi.org/10.48550/arXiv.2005.14165 (2020).
Jiang, A. Q. et al. Mistral 7B. https://doi.org/10.48550/arXiv.2310.06825 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. https://doi.org/10.48550/arXiv.2307.09288 (2023).
GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf (2023).
Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). https://doi.org/10.48550/arXiv.2309.17421 (2023).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Martinez, V., Schleicher, S., Falci, M. & Rau, J. PA & NP medical guidance. Clinical Advisor. https://www.clinicaladvisor.com/ (2019).
Lowell, B. A., Froelich, C. W., Federman, D. G. & Kirsner, R. S. Dermatology in primary care: prevalence and patient disposition. J. Am. Acad. Dermatol. 45, 250–255 (2001).
Buckley, T., Diao, J. A., Rodman, A. & Manrai, A. K. Accuracy of a vision-language model on challenging medical cases. https://doi.org/10.48550/arXiv.2311.05591 (2023).
Image Challenge. N. Engl. J. Med. https://www.nejm.org/image-challenge (2024).
Takagi, S., Watari, T., Erabi, A. & Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison study. JMIR Med. Educ. 9, e48002 (2023).
Lin, J. C., Younessi, D. N., Kurapati, S. S., Tang, O. Y. & Scott, I. U. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye 37, 3694–3695 (2023).
Giannos, P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol. Open 5, e000451 (2023).
Moshirfar, M., Altaf, A. W., Stoakes, I. M., Tuttle, J. J. & Hoopes, P. C. Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus 15, e40822 (2023).
Lai, U. H. et al. Evaluating the performance of ChatGPT-4 on the United Kingdom medical licensing assessment. Front. Med. 10, 1240915 (2023).
Liu, K.-C. et al. Performance of ChatGPT on Chinese Master’s Degree Entrance Examination in Clinical Medicine. PLoS One 19, e0301702 (2024).
Ueda, D. et al. Evaluating GPT-4-based ChatGPT’s clinical potential on the NEJM quiz. BMC Digit. Health 2, 4 (2023).
Tu, T. et al. Towards conversational diagnostic AI. https://doi.org/10.48550/arXiv.2401.05654 (2024).
Liu, N. F. et al. Lost in the middle: how language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173 (2024).
Introducing OpenAI o1-preview. https://openai.com/index/introducing-openai-o1-preview/ (2024).
Rajpurkar, P. & Lungren, M. P. The current and future state of AI interpretation of medical images. N. Engl. J. Med. 388, 1981–1990 (2023).
Agarwal, N., Moehring, A., Rajpurkar, P. & Salz, T. Combining human expertise with artificial intelligence: experimental evidence from radiology. http://www.nber.org/papers/w31422.pdf (2023).
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.870 (Association for Computational Linguistics, 2023).
Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. Nature 623, 493–498 (2023).
de Zarzà, I., de Curtò, J., Roig, G., Manzoni, P. & Calafate, C. T. Emergent cooperation and strategy adaptation in multi-agent systems: an extended coevolutionary theory with LLMs. Electronics 12, 2722 (2023).
OpenAI. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf (2023).
Goldberger, T. & Armoni-Weiss, G. Miliaria crystallina. N. Engl. J. Med. 388, e68 (2023).
Acknowledgements
S.J. is supported by the Quad Fellowship. This research project has benefitted from the Microsoft Accelerate Foundation Models Research grant program awarded to P.R., and we especially thank K. Takeda for facilitating access to resources. We are also appreciative of Harvard Medical School’s Dean’s Innovation Award (awarded to P.R.) for funding this study.
Ethics declarations
Competing interests
R.D. reports receiving personal fees from DWA, personal fees from Pfizer, personal fees from L’Oreal, personal fees from VisualDx and stock options from MDAlgorithms and Revea outside the submitted work and has a patent for TrueImage pending. D.I.S. is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant for Appiell Inc. and LuminDx and an investigator for AbbVie and Sanofi. E.M.V.A. serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research and Serinus Bio. E.M.V.A provides research support to Novartis, Bristol Myers Squibb, Sanofi and NextPoint. E.M.V.A. holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio and Syapse. E.M.V.A. has filed for institutional patents on chromatin mutations and immunotherapy response and methods for clinical interpretation and provides intermittent legal consulting on patents to Foaley & Hoag. E.M.V.A. also serves on the editorial board of Science Advances. The other authors declare no competing interests.
Ethics Declaration
The CRAFT-MD framework is designed to enable faster evaluation of LLMs for leading clinical conversations and to uncover limitations to guide future model development. These LLMs could enhance clinical workflows by engaging in preliminary conversations with patients, collecting and summarizing relevant medical information and presenting these data to doctors before patient visits, potentially improving the effectiveness of doctor–patient interactions. These LLMs could be more effective than the pre-visit questionnaires, given their ability to lead dynamic conversations. However, this will require not only developing more capable LLMs but also making them more fault tolerant and cognizant of appropriate empathetic behavior.
Peer review
Peer review information
Nature Medicine thanks Milica Gasic, Pearse A. Keane and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Distribution of case vignettes across medical specialties and source datasets.
(a) CRAFT-MD evaluation dataset, showing the distribution of case vignettes across 12 medical specialties - Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology and Others. (b) Inset pie chart showing the proportion of case vignettes based on source of curation (MedQA-USMLE, Derm-Public and Derm-Private). (c) MELD analysis showing Levenshtein Distance between original and GPT-4 completed case vignettes.
Extended Data Fig. 2 Examples of summarized conversations.
(i) an ideal summarized conversation containing all relevant medical history, without hallucinations or use of medical terminology; (ii) a summarized conversation with medical terminology use; red highlight indicates use of medical terminology (iii) a summarized conversation with incomplete medical history; red highlight demarcates missing information that is crucial for the diagnosis.
Extended Data Fig. 3 Distribution of clinical LLM’s accuracy in 4-choice MCQ across the medical specialties.
Distribution of clinical LLM’s accuracy in 4-choice MCQ across the 12 medical specialties for (a) GPT-4, (b) GPT-3.5, (c) Mistral-v2-7b, and (d) LLaMA-2-7b. Trends for the 4 experimental settings (vignette, multi-turn conversation, single-turn conversation and summarized conversation) are consistent to the combined accuracy for all 12 specialties - Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Error bars represent 95% confidence intervals, and numbers represent the mean accuracy.
Extended Data Fig. 4 Distribution of clinical LLM’s accuracy in FRQ across the medical specialties.
Distribution of clinical LLM’s accuracy in FRQs across the 12 medical specialties for (a) GPT-4, (b) GPT-3.5, (c) Mistral-v2-7b, and (d) LLaMA-2-7b. Trends for the 4 experimental settings (vignette, multi-turn conversation, single-turn conversation and summarized conversation) are consistent to the combined accuracy for all 12 specialties - Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Error bars represent 95% confidence intervals, and numbers represent the mean accuracy.
Extended Data Fig. 5 Trends in vignette and conversational formats in dermatology datasets. for cases with single most likely diagnosis.
Trends in vignette and conversational formats persist across skin disease datasets (MedQA-USMLE, Derm-Public and Derm-Private) for cases with single most likely diagnosis. Results are shown for both (a,b,c,d) 4-choice MCQ and (e,f,g,h) FRQ settings. Error bars represent 95% confidence intervals.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Johri, S., Jeong, J., Tran, B.A. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med 31, 77–86 (2025). https://doi.org/10.1038/s41591-024-03328-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41591-024-03328-5