AI is starting to beat doctors at making correct diagnoses

If you walk into an emergency room (ER) in 10 years, you’ll encounter a new type of caregiver: an artificial intelligence (AI) system designed to get you a diagnosis faster and help your care team make more informed decisions. While you sit in the waiting room, you’ll be hooked up to a blood pressure cuff that’s constantly and autonomously monitored. All the while, an AI agent will be listening in while you and your doctor talk about your symptoms, ready to flag any mistakes your physician makes or suggest next steps.

This vision of AI-assisted emergency health care may soon be reality. In a new study, researchers show that a type of AI known as a large language model (LLM) often outperformed physicians at diagnosing complex and potentially life-threatening conditions, including decreased blood flow to the heart, even in the fast-moving stages of real ER care when information is limited, they report today in Science. In early ER cases, the model identified the correct or a very close diagnosis in about 67% of cases, compared with roughly 50% to 55% for physicians. And the technology is only getting better.

“Evaluating AI in medicine demands both depth and breadth across different clinical tasks and settings,” and these authors were able to incorporate both in this study, says Shreya Johri, a computer scientist at the Dana-Farber Cancer Institute who was uninvolved with the new research. Still, she notes, wide adoption of these AI systems in health care will hinge on knowing the contexts in which they’re most reliable.

The team behind the new study tested how accurately an advanced LLM, OpenAI’s o1 model, could diagnose patients with conditions of all kinds. Five of the tasks required the model to read through hand-selected medical profiles and suggest a diagnosis, choose next steps, or estimate the probability of a specific change in future health. In all five exercises, o1 performed similarly to or outperformed physicians. The gap between the model and humans was so robust across all of the tasks that the authors worried no one would believe the results, says Adam Rodman, co-author of the paper and an internist at Beth Israel Deaconess Medical Center. In one task, o1 received a perfect clinical reasoning score—based on how well it explained diagnostic thinking and next steps—for 98% of the cases it examined, whereas attending physicians were only able to do so 35% of the time.

The LLM’s final test, deemed the “most important” of them by Thomas Buckley, a co-author and computer scientist at Harvard University, required diagnosing ER patients at three different points in their care. When a patient enters the ER, they must first explain their symptoms to an intake nurse, then a doctor evaluates them, and finally the physician must decide on an appropriate course of action. Each step is fraught with possible mistakes, as patients often have trouble explaining their symptoms, and the doctors themselves may be juggling several high-stress cases at once. Early triage decisions are particularly challenging because clinicians must act quickly, and mistakes can have immediate consequences. A doctor who mistakes a blood infection for a common cold, for example, could send a patient home without antibiotics—a potentially fatal decision.

The researchers used cases from real patients who went to the ER at Beth Israel and supplied information to o1 in increments that mirrored the three points of care when patients describe their cases. Unlike the other experiments, this one directly probed how the LLM interacted with “real-world, messy” data that could be incomplete or biased, Buckley notes. Early in the emergency care process, when a patient checks in to an ER and provides limited information about their ailment, o1 identified an exact or close diagnosis 67% of the time—more than 10% higher than two physicians given the same cases. Although the gap closed slightly when more information was available, the LLM still outperformed doctors by 2% to 10% later in the care pipeline.

Notably, OpenAI’s o1 was first released in late 2024. “That’s kind of like ancient history now in machine learning time,” Buckley says. For this reason, Eric Strong, an internist at Stanford University who was not involved with the study, considers the age of the model tested by the researchers “irrelevant” because newer models are likely to perform just as well, if not much better.

Experts who study AI’s potential uses in health care are intrigued by the findings. “To see [the model] being tested in a real-world setting … is exciting,” says Daniel McDuff, a computer scientist at Google who did not contribute to the new work. Johri agrees, praising the authors for evaluating o1’s diagnostic and reasoning skills “in a way that no single experiment could.”

Still, the study doesn’t analyze how an LLM would act in the face of more than a few hours’ worth of patient history, as many cases require, Rodman says. ER stays are relatively short, so even the real-world experiment isn’t comparable to the diagnostic process for other settings. “I do not think that the current model would work for a hospitalized patient who has days and days of information,” he warns. “I think the performance would drop off.” In addition, the study provided o1 with only written case information and did not include nontext inputs such as imaging, which are central to many real diagnoses including blood clots and cancers.

Rodman says his team is already conducting new experiments that ask a model to evaluate patients using longer term and broader real-world information. The next challenge, experts say, is determining whether these systems can improve real patient care outside controlled tests. “We need to understand how these models can play a role as someone’s care evolves over time,” McDuff says.