Study finds top AI models still struggle with clinical reasoning

A new benchmark shows that even the most advanced AI models can often reach the final diagnosis but still falter when clinicians must weigh uncertainty, build a differential diagnosis, and decide what to test next.

A new benchmark shows that even the most advanced AI models can often reach the final diagnosis but still falter when clinicians must weigh uncertainty, build a differential diagnosis, and decide what to test next.

Study: Large Language Model Performance and Clinical Reasoning Tasks. Image Credit: Iryna Pohrebna / Shutterstock

In a recent study published in JAMA Network Open, researchers investigated the clinical reasoning ability of large language models (LLMs).

LLMs have rapidly gained interest in medicine, powering tools that support diagnostic reasoning and propose management, among others. These systems are now actively marketed for clinical use, but concerns about hallucinations, integrity, and safety remain. Moreover, existing evaluations often depend on multiple-choice questions that do not reflect the complexities of patient care. Whether LLMs can support end-to-end clinical reasoning is unclear.

LLM Clinical Reasoning Study Design

In the present study, researchers investigated the performance of LLMs on clinical reasoning tasks. They compared 21 LLMs: OpenAI’s GPT-5, GPT-4.5, GPT-o3-Mini, GPT-4o, GPT-o1-Pro, and GPT-o1, Anthropic’s Claude 4.5 Opus, Claude 3.7 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.5 Haiku, DeepSeek’s DeepSeek R1 and V3, Google DeepMind’s Gemini 3.0 Pro, Gemini 2.5 Pro, Gemini 1.5 Pro, Gemini 3.0 Flash, Gemini 2.0 Flash, and Gemini 1.5 Flash, and xAI’s Grok 3 and 4.

The team evaluated the LLMs' accuracy in handling 29 standardized clinical vignettes from the January 2025 update of the Merck Sharp & Dohme (MSD) Manual. Each vignette presented a structured case with physical examination findings, history of present illness, laboratory findings, and review of systems. Clinical vignettes were presented to each LLM in a stepwise manner, preserving clinical context, and each vignette was evaluated in triplicate.

Prompts were presented in a question-and-answer format. For LLMs without multimodal capabilities, questions that required image interpretation were excluded from scoring. LLMs were prompted using their defaults, and, where available, the reasoning setting was disabled to evaluate the base models only. Real-time browsing, retrieval, and web search features were turned off for all LLMs.

Performance was evaluated across five clinical reasoning domains: diagnostic testing, differential diagnosis, final diagnosis, management, and miscellaneous clinical reasoning. LLM outputs were scored against the answer keys of the MSD Manual. Responses were scored using a deterministic rubric mapping the LLM output to multiple-choice options; a response was awarded full credit only when correct options were included, while incorrect ones were excluded.

In addition, a Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score was developed to capture longitudinal reasoning in an interpretable metric. Performance was visualized as a radar plot, and vertices represented accuracy across domains. The PrIME-LLM score was computed as the area of an LLM’s polygon divided by that of a reference polygon, which corresponded to a model scoring 100% across all domains.

PrIME-LLM Results Across Clinical Tasks

LLMs generally scored highest in the final diagnosis domain and showed relatively stronger performance in management than in diagnostic testing and differential diagnosis, but consistently showed deficits in the diagnostic testing and differential diagnosis domains. The PrIME-LLM scores significantly differed across LLMs; the top-performing cluster included Claude 4.5 Opus, Grok 4, Gemini 3.0 Flash, GPT-5, Gemini 3.0 Pro, and GPT-4.5, with Grok 4 achieving the highest mean PrIME-LLM score. Notably, newer releases within LLM families generally performed better.

The average overall accuracy ranged between 0.81 and 0.90, whereas the average PrIME-LLM scores showed a wider separation, differentiating high-performance models from low-performance models. Notably, there was a significant performance difference between reasoning-optimized models, e.g., Grok 4, GPT-5, Claude 4.5 Opus, etc., and non-reasoning models. The probability that a random score from reasoning-optimized models would be greater than that from non-reasoning models was 0.99.

For virtually all LLMs, final diagnosis items had significantly higher accuracy than diagnostic testing items. Moreover, diagnostic testing items showed consistently higher accuracy than differential diagnosis items, whereas management and miscellaneous item types had intermediate accuracy. Eighteen multimodal LLMs capable of image interpretation were evaluated across vignettes containing electrocardiograms, computed tomography scans, and chest radiographs.

While the accuracy of multimodal LLMs was consistent across non-image questions, performance on image-based questions varied by LLM. GPT-4.5, GPT-o3-Mini, and Claude 3 Opus demonstrated higher accuracy on image-based items than text-only items, and significant gains were also reported for Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash, and Grok 4. Further, the model failure rate, i.e., the proportion of questions that were not fully correctly answered, was lowest for final diagnosis and highest for differential diagnosis. Other domains had intermediate failure rates.

LLM Differential Diagnosis and Uncertainty Gaps

In sum, frontier LLMs achieved high accuracy in final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM scores provided greater separation than raw accuracy, the traditional summary metric, highlighting critical gaps obscured by conventional benchmarks.

Overall, the PrIME-LLM framework provides an independent, extensible, and reproducible benchmark for tracking progress and guiding safe integration into healthcare practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised patient-facing clinical decision-making.

Journal reference:
Tarun Sai Lomte

Written by

Tarun Sai Lomte

Tarun is a writer based in Hyderabad, India. He has a Master’s degree in Biotechnology from the University of Hyderabad and is enthusiastic about scientific research. He enjoys reading research papers and literature reviews and is passionate about writing.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sai Lomte, Tarun. (2026, April 14). Study finds top AI models still struggle with clinical reasoning. News-Medical. Retrieved on April 14, 2026 from https://www.news-medical.net/news/20260414/Study-finds-top-AI-models-still-struggle-with-clinical-reasoning.aspx.

  • MLA

    Sai Lomte, Tarun. "Study finds top AI models still struggle with clinical reasoning". News-Medical. 14 April 2026. <https://www.news-medical.net/news/20260414/Study-finds-top-AI-models-still-struggle-with-clinical-reasoning.aspx>.

  • Chicago

    Sai Lomte, Tarun. "Study finds top AI models still struggle with clinical reasoning". News-Medical. https://www.news-medical.net/news/20260414/Study-finds-top-AI-models-still-struggle-with-clinical-reasoning.aspx. (accessed April 14, 2026).

  • Harvard

    Sai Lomte, Tarun. 2026. Study finds top AI models still struggle with clinical reasoning. News-Medical, viewed 14 April 2026, https://www.news-medical.net/news/20260414/Study-finds-top-AI-models-still-struggle-with-clinical-reasoning.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
What happens in the brain on psychedelics? Scientists identify a common circuit pattern