AgentClinic puts medical AI through a more realistic diagnostic test

A new benchmark shows that passing medical exams is not enough; clinical AI agents must gather information, handle uncertainty, use tools, interpret images, and navigate bias in simulated patient encounters.

Running language agents in AgentClinic. (Left) Workflow diagram of agents in AgentClinic. The doctor agent interacts with tools and agents in order to arrive at a diagnosis. Moderator agent compares conclusion to ground truth diagnosis at the end of the simulation. (Right) Example dialogue between agents in the AgentClinic benchmark.

Running language agents in AgentClinic. (Left) Workflow diagram of agents in AgentClinic. The doctor agent interacts with tools and agents in order to arrive at a diagnosis. Moderator agent compares conclusion to ground truth diagnosis at the end of the simulation. (Right) Example dialogue between agents in the AgentClinic benchmark.

A recent study published in the journal npj Digital Medicine introduced a multi-modal agent benchmark, AgentClinic, for clinical artificial intelligence (AI) agents in simulated clinical environments.

Building interactive systems capable of solving a wide range of problems is one of the main goals of AI. Many recent large language models (LLMs) have solved difficult problems, some that are challenging even for humans, and also surpassed the mean human score on medical licensing examinations. However, several limitations prevent their application in real-world clinical settings.

Clinical work is multiplexed, involving sequential decision making that requires handling uncertainty with finite resources and limited information. This capability is not reflected in current evaluations, in which all necessary data are presented in case vignettes and LLMs are tasked with either answering or selecting the most plausible option.

The authors noted that strong performance on static medical question-answering tasks was only weakly predictive of performance in the interactive AgentClinic setting. In some cases, diagnostic accuracy dropped sharply when static cases were converted into AgentClinic’s sequential format.

AgentClinic Study Design and Benchmark Structure

In the present study, researchers presented AgentClinic, a multi-modal agent benchmark for LLM evaluation in simulated clinical settings. The benchmark encompassed four language agents: a measurement agent, a doctor agent, a patient agent, and a moderator. Each agent has specific instructions and is provided with unique information unavailable to other agents. The doctor agent is the model whose performance is assessed by other agents.

Questions from the MedQA dataset based on United States Medical Licensing Exam-style cases, New England Journal of Medicine (NEJM) Case Challenges, and de-identified MIMIC-IV electronic health records were used to build agents grounded in medically relevant scenarios. The questions were concerned with diagnosis based on symptoms, which were used to build a template for prompts. For AgentClinic-MIMIC-IV and AgentClinic-MedQA, questions were selected from the MIMIC-IV and MedQA datasets, respectively.

A structured input file containing case information was generated using GPT-4, and the case scenarios were manually validated. In general, the doctor agent was provided an objective; the patient agent received the patient's symptoms and history; the measurement agent received the physical examination results; and the moderator received the correct diagnosis. The accuracy of 11 LLMs was evaluated on AgentClinic-MedQA, with each acting as the doctor agent to diagnose the patient agent (GPT-4) through dialogue.

Twenty interactions were permitted for the doctor agent with the patient and measurement agents before making a diagnosis. In addition, the performance of three human physicians was assessed using the same constraints and instructions, although this small clinician sample should be interpreted cautiously. Claude 3.5 Sonnet demonstrated the highest accuracy of 62.1%, followed by OpenBioLLM-70B (58.3%) and physicians (54%).

AgentClinic Performance Across Models, Tools, and Modalities

Moreover, the accuracy on AgentClinic-MIMIC-IV was highest for Claude 3.5 Sonnet (42.9%), followed by GPT-4 (34%) and GPT-3.5 (27.5%). Reducing the number of interactions to 10 significantly decreased the accuracy to 25%, while increasing it to 30 interactions also decreased accuracy. The doctor agent’s accuracy varied by patient agent; GPT-4 patient agents achieved higher accuracy than Mixtral-8x7B or GPT-3.5 patient agents.

Accuracy of various doctor language models and human physicians on AgentClinic-MedQA using GPT-4 patient and measurement agents (left). Accuracy of GPT-4 on AgentClinic-MedQA based on patient language model (middle). Accuracy on AgentClinic-MIMIC-IV by number of using GPT-4 patient and measurement agents (right).

Accuracy of various doctor language models and human physicians on AgentClinic-MedQA using GPT-4 patient and measurement agents (left). Accuracy of GPT-4 on AgentClinic-MedQA based on patient language model (middle). Accuracy on AgentClinic-MIMIC-IV by number of using GPT-4 patient and measurement agents (right).

Next, the researchers assessed the impact of six agent tools on diagnostic accuracy: Reflection Chain-of-Thought (CoT), Notebook, Zero-Shot CoT, Adaptive Retrieval Augmented Generation using textbook sources, Adaptive Retrieval Augmented Generation using web sources, and One-Shot CoT. Claude 3.5 Sonnet demonstrated the best performance with mean and peak accuracies of 51.3% and 56.1%, respectively, with the Notebook tool. GPT-4o and GPT-4 gained moderate improvements across most tools, but tool use was not uniformly beneficial across all models.

Further, implicit biases (unconscious associations influenced by cultural and societal norms, e.g., gender bias) and cognitive biases (systematic patterns of deviation from rationality or norms in judgment, e.g., recency bias) were included in prompts to assess their effects on diagnostic accuracy. For GPT-4, accuracy decreased to 48% and 50.3% for patient and doctor cognitive biases and to 51.3% and 50.5% for patient and doctor implicit biases, respectively. The benchmark also assessed simulated patient confidence, treatment compliance, and willingness to consult the same doctor again, but these ratings came from LLM-simulated patients rather than real patients.

Next, the team examined specialist cases using case report questions spanning nine medical specialties from the MedMCQA dataset. Consistently, Claude 3.5 Sonnet was the best-performing model, with a mean diagnostic accuracy of 66.7%, demonstrating strong performance in internal medicine, otolaryngology, and gynecology. Performance varied by specialty, suggesting that dialogue-based diagnosis may differ from static multiple-choice medical testing. Next, the team evaluated four multi-modal LLMs in a diagnostic setting that additionally required understanding image readings.

The researchers also evaluated multilingual cases across seven languages: English, Chinese, French, Spanish, Hindi, Persian, and Korean. Most models performed best in English and showed substantial variability across other languages, while Claude 3.5 Sonnet maintained the strongest overall multilingual performance.

To this end, 120 questions from the NEJM Case Challenges were used. When the image was initially provided to the doctor agent, Claude 3.5 Sonnet had a diagnostic accuracy of 37.2%, followed by GPT-4 (27.7%), GPT-4o (21.4%), and GPT-4o-mini (8%). When images were provided upon request by the agent, the accuracies were 35.4%, 25.4%, 19.1%, and 6.1% for Claude 3.5 Sonnet, GPT-4, GPT-4o, and GPT-4o-mini, respectively.

Accuracy of Claude 3.5 Sonnet, GPT-4, GPT-4o, and GPT-4o-mini on AgentClinic-NEJM with multimodal text and language input. (Pink) Accuracy when the images are presented as initial input. (Blue) Accuracy when images must be requested from the image reader.

Accuracy of Claude 3.5 Sonnet, GPT-4, GPT-4o, and GPT-4o-mini on AgentClinic-NEJM with multimodal text and language input. (Pink) Accuracy when the images are presented as initial input. (Blue) Accuracy when images must be requested from the image reader.

AgentClinic Implications for Clinical AI Evaluation

Together, LLMs need to be evaluated with novel strategies beyond static question-answer benchmarks. AgentClinic, which provides a simplified clinical environment including agents that represent a moderator, a patient, a doctor, and measurements, represents a step towards building dialogue-driven, more interactive benchmarks that assess the sequential decision-making ability of LLMs across distinct, multi-modal, and challenging settings. However, the authors cautioned that AgentClinic remains a simplified simulation of clinical care, using LLM-based patient, measurement, and moderator agents. They also noted potential data leakage risks for proprietary models and emphasized that the human-comparison data came from only three clinicians.

These findings should therefore be interpreted as benchmark performance, not evidence that any model is ready for autonomous clinical diagnosis.

Journal reference:
Tarun Sai Lomte

Written by

Tarun Sai Lomte

Tarun is a writer based in Hyderabad, India. He has a Master’s degree in Biotechnology from the University of Hyderabad and is enthusiastic about scientific research. He enjoys reading research papers and literature reviews and is passionate about writing.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sai Lomte, Tarun. (2026, April 30). AgentClinic puts medical AI through a more realistic diagnostic test. News-Medical. Retrieved on April 30, 2026 from https://www.news-medical.net/news/20260430/AgentClinic-puts-medical-AI-through-a-more-realistic-diagnostic-test.aspx.

  • MLA

    Sai Lomte, Tarun. "AgentClinic puts medical AI through a more realistic diagnostic test". News-Medical. 30 April 2026. <https://www.news-medical.net/news/20260430/AgentClinic-puts-medical-AI-through-a-more-realistic-diagnostic-test.aspx>.

  • Chicago

    Sai Lomte, Tarun. "AgentClinic puts medical AI through a more realistic diagnostic test". News-Medical. https://www.news-medical.net/news/20260430/AgentClinic-puts-medical-AI-through-a-more-realistic-diagnostic-test.aspx. (accessed April 30, 2026).

  • Harvard

    Sai Lomte, Tarun. 2026. AgentClinic puts medical AI through a more realistic diagnostic test. News-Medical, viewed 30 April 2026, https://www.news-medical.net/news/20260430/AgentClinic-puts-medical-AI-through-a-more-realistic-diagnostic-test.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Study shows masculine depression is not just a male mental health pattern