A new benchmark shows that passing medical exams is not enough; clinical AI agents must gather information, handle uncertainty, use tools, interpret images, and navigate bias in simulated patient encounters.

Running language agents in AgentClinic. (Left) Workflow diagram of agents in AgentClinic. The doctor agent interacts with tools and agents in order to arrive at a diagnosis. Moderator agent compares conclusion to ground truth diagnosis at the end of the simulation. (Right) Example dialogue between agents in the AgentClinic benchmark.
A recent study published in the journal npj Digital Medicine introduced a multi-modal agent benchmark, AgentClinic, for clinical artificial intelligence (AI) agents in simulated clinical environments.
Building interactive systems capable of solving a wide range of problems is one of the main goals of AI. Many recent large language models (LLMs) have solved difficult problems, some that are challenging even for humans, and also surpassed the mean human score on medical licensing examinations. However, several limitations prevent their application in real-world clinical settings.
Clinical work is multiplexed, involving sequential decision making that requires handling uncertainty with finite resources and limited information. This capability is not reflected in current evaluations, in which all necessary data are presented in case vignettes and LLMs are tasked with either answering or selecting the most plausible option.
The authors noted that strong performance on static medical question-answering tasks was only weakly predictive of performance in the interactive AgentClinic setting. In some cases, diagnostic accuracy dropped sharply when static cases were converted into AgentClinic’s sequential format.
AgentClinic Study Design and Benchmark Structure
In the present study, researchers presented AgentClinic, a multi-modal agent benchmark for LLM evaluation in simulated clinical settings. The benchmark encompassed four language agents: a measurement agent, a doctor agent, a patient agent, and a moderator. Each agent has specific instructions and is provided with unique information unavailable to other agents. The doctor agent is the model whose performance is assessed by other agents.
Questions from the MedQA dataset based on United States Medical Licensing Exam-style cases, New England Journal of Medicine (NEJM) Case Challenges, and de-identified MIMIC-IV electronic health records were used to build agents grounded in medically relevant scenarios. The questions were concerned with diagnosis based on symptoms, which were used to build a template for prompts. For AgentClinic-MIMIC-IV and AgentClinic-MedQA, questions were selected from the MIMIC-IV and MedQA datasets, respectively.
A structured input file containing case information was generated using GPT-4, and the case scenarios were manually validated. In general, the doctor agent was provided an objective; the patient agent received the patient's symptoms and history; the measurement agent received the physical examination results; and the moderator received the correct diagnosis. The accuracy of 11 LLMs was evaluated on AgentClinic-MedQA, with each acting as the doctor agent to diagnose the patient agent (GPT-4) through dialogue.
Twenty interactions were permitted for the doctor agent with the patient and measurement agents before making a diagnosis. In addition, the performance of three human physicians was assessed using the same constraints and instructions, although this small clinician sample should be interpreted cautiously. Claude 3.5 Sonnet demonstrated the highest accuracy of 62.1%, followed by OpenBioLLM-70B (58.3%) and physicians (54%).
AgentClinic Performance Across Models, Tools, and Modalities
Moreover, the accuracy on AgentClinic-MIMIC-IV was highest for Claude 3.5 Sonnet (42.9%), followed by GPT-4 (34%) and GPT-3.5 (27.5%). Reducing the number of interactions to 10 significantly decreased the accuracy to 25%, while increasing it to 30 interactions also decreased accuracy. The doctor agent’s accuracy varied by patient agent; GPT-4 patient agents achieved higher accuracy than Mixtral-8x7B or GPT-3.5 patient agents.

Accuracy of various doctor language models and human physicians on AgentClinic-MedQA using GPT-4 patient and measurement agents (left). Accuracy of GPT-4 on AgentClinic-MedQA based on patient language model (middle). Accuracy on AgentClinic-MIMIC-IV by number of using GPT-4 patient and measurement agents (right).
Next, the researchers assessed the impact of six agent tools on diagnostic accuracy: Reflection Chain-of-Thought (CoT), Notebook, Zero-Shot CoT, Adaptive Retrieval Augmented Generation using textbook sources, Adaptive Retrieval Augmented Generation using web sources, and One-Shot CoT. Claude 3.5 Sonnet demonstrated the best performance with mean and peak accuracies of 51.3% and 56.1%, respectively, with the Notebook tool. GPT-4o and GPT-4 gained moderate improvements across most tools, but tool use was not uniformly beneficial across all models.
Further, implicit biases (unconscious associations influenced by cultural and societal norms, e.g., gender bias) and cognitive biases (systematic patterns of deviation from rationality or norms in judgment, e.g., recency bias) were included in prompts to assess their effects on diagnostic accuracy. For GPT-4, accuracy decreased to 48% and 50.3% for patient and doctor cognitive biases and to 51.3% and 50.5% for patient and doctor implicit biases, respectively. The benchmark also assessed simulated patient confidence, treatment compliance, and willingness to consult the same doctor again, but these ratings came from LLM-simulated patients rather than real patients.
Next, the team examined specialist cases using case report questions spanning nine medical specialties from the MedMCQA dataset. Consistently, Claude 3.5 Sonnet was the best-performing model, with a mean diagnostic accuracy of 66.7%, demonstrating strong performance in internal medicine, otolaryngology, and gynecology. Performance varied by specialty, suggesting that dialogue-based diagnosis may differ from static multiple-choice medical testing. Next, the team evaluated four multi-modal LLMs in a diagnostic setting that additionally required understanding image readings.
The researchers also evaluated multilingual cases across seven languages: English, Chinese, French, Spanish, Hindi, Persian, and Korean. Most models performed best in English and showed substantial variability across other languages, while Claude 3.5 Sonnet maintained the strongest overall multilingual performance.
To this end, 120 questions from the NEJM Case Challenges were used. When the image was initially provided to the doctor agent, Claude 3.5 Sonnet had a diagnostic accuracy of 37.2%, followed by GPT-4 (27.7%), GPT-4o (21.4%), and GPT-4o-mini (8%). When images were provided upon request by the agent, the accuracies were 35.4%, 25.4%, 19.1%, and 6.1% for Claude 3.5 Sonnet, GPT-4, GPT-4o, and GPT-4o-mini, respectively.

Accuracy of Claude 3.5 Sonnet, GPT-4, GPT-4o, and GPT-4o-mini on AgentClinic-NEJM with multimodal text and language input. (Pink) Accuracy when the images are presented as initial input. (Blue) Accuracy when images must be requested from the image reader.
AgentClinic Implications for Clinical AI Evaluation
Together, LLMs need to be evaluated with novel strategies beyond static question-answer benchmarks. AgentClinic, which provides a simplified clinical environment including agents that represent a moderator, a patient, a doctor, and measurements, represents a step towards building dialogue-driven, more interactive benchmarks that assess the sequential decision-making ability of LLMs across distinct, multi-modal, and challenging settings. However, the authors cautioned that AgentClinic remains a simplified simulation of clinical care, using LLM-based patient, measurement, and moderator agents. They also noted potential data leakage risks for proprietary models and emphasized that the human-comparison data came from only three clinicians.
These findings should therefore be interpreted as benchmark performance, not evidence that any model is ready for autonomous clinical diagnosis.