A new benchmark shows that even the most advanced AI models can often reach the final diagnosis but still falter when clinicians must weigh uncertainty, build a differential diagnosis, and decide what to test next.

Study: Large Language Model Performance and Clinical Reasoning Tasks. Image Credit: Iryna Pohrebna / Shutterstock
In a recent study published in JAMA Network Open, researchers investigated the clinical reasoning ability of large language models (LLMs).
LLMs have rapidly gained interest in medicine, powering tools that support diagnostic reasoning and propose management, among others. These systems are now actively marketed for clinical use, but concerns about hallucinations, integrity, and safety remain. Moreover, existing evaluations often depend on multiple-choice questions that do not reflect the complexities of patient care. Whether LLMs can support end-to-end clinical reasoning is unclear.
LLM Clinical Reasoning Study Design
In the present study, researchers investigated the performance of LLMs on clinical reasoning tasks. They compared 21 LLMs: OpenAI’s GPT-5, GPT-4.5, GPT-o3-Mini, GPT-4o, GPT-o1-Pro, and GPT-o1, Anthropic’s Claude 4.5 Opus, Claude 3.7 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.5 Haiku, DeepSeek’s DeepSeek R1 and V3, Google DeepMind’s Gemini 3.0 Pro, Gemini 2.5 Pro, Gemini 1.5 Pro, Gemini 3.0 Flash, Gemini 2.0 Flash, and Gemini 1.5 Flash, and xAI’s Grok 3 and 4.
The team evaluated the LLMs' accuracy in handling 29 standardized clinical vignettes from the January 2025 update of the Merck Sharp & Dohme (MSD) Manual. Each vignette presented a structured case with physical examination findings, history of present illness, laboratory findings, and review of systems. Clinical vignettes were presented to each LLM in a stepwise manner, preserving clinical context, and each vignette was evaluated in triplicate.
Prompts were presented in a question-and-answer format. For LLMs without multimodal capabilities, questions that required image interpretation were excluded from scoring. LLMs were prompted using their defaults, and, where available, the reasoning setting was disabled to evaluate the base models only. Real-time browsing, retrieval, and web search features were turned off for all LLMs.
Performance was evaluated across five clinical reasoning domains: diagnostic testing, differential diagnosis, final diagnosis, management, and miscellaneous clinical reasoning. LLM outputs were scored against the answer keys of the MSD Manual. Responses were scored using a deterministic rubric mapping the LLM output to multiple-choice options; a response was awarded full credit only when correct options were included, while incorrect ones were excluded.
In addition, a Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score was developed to capture longitudinal reasoning in an interpretable metric. Performance was visualized as a radar plot, and vertices represented accuracy across domains. The PrIME-LLM score was computed as the area of an LLM’s polygon divided by that of a reference polygon, which corresponded to a model scoring 100% across all domains.
PrIME-LLM Results Across Clinical Tasks
LLMs generally scored highest in the final diagnosis domain and showed relatively stronger performance in management than in diagnostic testing and differential diagnosis, but consistently showed deficits in the diagnostic testing and differential diagnosis domains. The PrIME-LLM scores significantly differed across LLMs; the top-performing cluster included Claude 4.5 Opus, Grok 4, Gemini 3.0 Flash, GPT-5, Gemini 3.0 Pro, and GPT-4.5, with Grok 4 achieving the highest mean PrIME-LLM score. Notably, newer releases within LLM families generally performed better.
The average overall accuracy ranged between 0.81 and 0.90, whereas the average PrIME-LLM scores showed a wider separation, differentiating high-performance models from low-performance models. Notably, there was a significant performance difference between reasoning-optimized models, e.g., Grok 4, GPT-5, Claude 4.5 Opus, etc., and non-reasoning models. The probability that a random score from reasoning-optimized models would be greater than that from non-reasoning models was 0.99.
For virtually all LLMs, final diagnosis items had significantly higher accuracy than diagnostic testing items. Moreover, diagnostic testing items showed consistently higher accuracy than differential diagnosis items, whereas management and miscellaneous item types had intermediate accuracy. Eighteen multimodal LLMs capable of image interpretation were evaluated across vignettes containing electrocardiograms, computed tomography scans, and chest radiographs.
While the accuracy of multimodal LLMs was consistent across non-image questions, performance on image-based questions varied by LLM. GPT-4.5, GPT-o3-Mini, and Claude 3 Opus demonstrated higher accuracy on image-based items than text-only items, and significant gains were also reported for Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash, and Grok 4. Further, the model failure rate, i.e., the proportion of questions that were not fully correctly answered, was lowest for final diagnosis and highest for differential diagnosis. Other domains had intermediate failure rates.
LLM Differential Diagnosis and Uncertainty Gaps
In sum, frontier LLMs achieved high accuracy in final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM scores provided greater separation than raw accuracy, the traditional summary metric, highlighting critical gaps obscured by conventional benchmarks.
Overall, the PrIME-LLM framework provides an independent, extensible, and reproducible benchmark for tracking progress and guiding safe integration into healthcare practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised patient-facing clinical decision-making.