Generative AI falls short in diagnostic reasoning despite accuracy

Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short at their clinical reasoning capabilities.

By asking 21 different large language models (LLMs) to play doctor in a series of clinical scenarios, the researchers showed that LLMs often fail often fail at navigating diagnostic workups and coming up with a testable list of potential or "differential" diagnoses. Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open.

Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment. Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available – not always the case"

Marc Succi, MD, corresponding author, executive director of the MESH Incubator at Mass General Brigham

This new research is a follow-up to previous work led by Succi's MESH group in which researchers evaluated ChatGPT 3.5 ability to accurately in diagnose a series of a clinical vignettes.

In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model's competency across different stages of clinical reasoning-coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness, according to the researchers.

The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the models' ability to work through 29 published clinical cases. To simulate the way that clinical cases unfold, the researchers gradually fed the models information, beginning with basics like a patient's age, gender and symptoms before adding physical examination findings and laboratory results. The LLMs' performance at each stage was assessed by medical student evaluators, and these evaluations were used to calculate the models' overall PrIME-LLM scores.

In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses. However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. In the real world, a differential diagnosis is critical, but in this study, the models were given more information so that they could proceed to the next stage of the clinical workup even if they failed at the differential diagnosis step.

"By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor," said Arya Rao, lead author, MESH researcher, and MD-PhD student at Harvard Medical School. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information."

Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. More recently released models generally outperformed older models, showing that LLMs are improving incrementally. The models' PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.

According to Succi, PrIME-LLM represents a standardized way to evaluate AI's clinical competency that could be used by AI developers and hospital leaders to benchmark new technologies as they are released.

"We want to help separate the hype from the reality of these tools as they apply to health care," he said. "Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight."

Source:
Journal reference:

Rao, A. S., et al. (2026). Large Language Model Performance and Clinical Reasoning Tasks. JAMA Network Open. DOI: 10.1001/jamanetworkopen.2026.4003. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Review explores how generative AI could support precision oncology decision-making