In a blinded virtual study, Google’s AMIE matched primary care doctors overall and outperformed them on several management-reasoning measures, but researchers caution that the system remains experimental and untested in real clinical care.

Study: Towards Conversational AI for Disease Management. Image Credit: Krot_Studio / Shutterstock
New Google research, published as an Accelerated Article Preview in the journal Nature, describes the potential clinical value of a large language model-based research artificial intelligence system, Articulate Medical Intelligence Explorer (AMIE), for simulated, multi-visit disease-management reasoning.
Background
Large language model (LLM)-based artificial intelligence (AI) systems are showing growing promise in clinical settings, not only for accurate diagnosis but also for collecting medical history through conversations in a natural, empathetic style that helps build trustworthy relationships with patients.
Although several AI models have been developed for diagnostic reasoning, their capabilities in multi-visit disease management, such as monitoring disease progression and therapeutic response across multiple clinical visits and safe medication prescription, largely remain unexplored.
A team of researchers at Google DeepMind and Google Research, California, USA, evaluated the capabilities of Articulate Medical Intelligence Explorer (AMIE), which is an LLM-based research AI system with physician-like performance on conversational diagnostic tasks, in disease management over time.
To advance AMIE for management reasoning, the team developed an LLM-based agentic system comprising an empathetic dialogue agent for synchronous text-chat patient conversations and a management reasoning agent that performs more extensive inference-time reasoning and cross-references up-to-date clinical practice guidelines and drug formularies.
The disease-management version of AMIE used the Gemini models' long-context capabilities to track longitudinal patient data across follow-up visits.
To benchmark medication reasoning, the team developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (OpenFDA and the British National Formulary) and validated by board-certified pharmacists.
The team next conducted a randomized, blinded, virtual Objective Structured Clinical Examination study to compare the multi-visit disease-management reasoning capabilities of AMIE with those of 21 primary care physicians across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice clinical practice guidelines.
Key findings
The comparative analysis revealed that AMIE was non-inferior to primary care physicians on overall management reasoning and scored significantly higher than physicians on appropriateness of the overall plan and treatment recommendations across all three visits.
In addition to treatment precision, AMIE’s precision in recommending investigations was significantly higher than that of physicians across all three visits.
For at least one of the three visits, AMIE scored significantly higher than physicians on being free of significant errors, providing appropriate follow-up recommendations, and avoiding inappropriate treatments.
Regarding the use of clinical guidelines, both AMIE and physicians scored similarly high on selecting applicable guidelines. However, AMIE scored significantly higher than physicians in recommending treatments and investigations that aligned with the guidelines and in explicitly grounding recommendations in guideline references.
To compare medication reasoning accuracy, the research team used lower-difficulty and higher-difficulty question benchmarks (RxQA) and an “open-book” and a “closed-book” setting. The “open-book” setting allowed both AMIE and physicians to search for relevant information. In the “closed-book” setting, neither physicians nor AMIE had access to external knowledge resources.
The comparative analysis revealed that access to external drug information was beneficial for both physicians and AMIE. However, AMIE outperformed physicians on greater difficulty questions in both “open-book” and “closed-book” settings.
Study significance
The study highlights the potential of the LLM-based research AI system, AMIE, as a promising future tool for multi-visit disease management. The findings reveal that AMIE can perform with similar quality, or in some cases better, than physicians across a variety of disease management reasoning challenges.
Globally, health care systems are experiencing increased care fragmentation, meaning that a patient’s care is spread across several physicians, settings, or systems that share little or no information with one another. Such care fragmentation is associated with worsened morbidity for patients with chronic diseases. Based on current findings, the Google research team suggests that AMIE may one day serve as a point of continuity in otherwise fragmented health systems, either independently or in collaboration with physicians.
The team also believes that, with rigorous clinical testing, such systems can address the growing unmet clinical needs caused by global shortages and inequalities in physician availability, physician burnout, and increasingly complex patient populations.
However, the study was conducted in simulated, text-chat consultations with trained patient actors, not in real clinical care, and the authors state that AMIE is not ready for clinical use. The scenarios were constructed for evaluation, the case mix was not representative of routine primary care, and the study did not test effects on patient outcomes.
The observed capabilities of AMIE reflect the rapid advancement of LLMs in clinical conversation and reasoning. The rapid improvement of state-of-the-art LLMs may help mitigate current limitations, such as confabulations (the generation of false, misleading, or entirely fabricated responses), which otherwise pose considerable risks in clinical medicine.
Overall, the study demonstrates the evolution of Google’s AMIE research system from conversational diagnostic AI toward a multi-visit disease-management reasoning system. Although the model system has been tested using global measures of management reasoning, the researchers urge that it be seen as a first step in measuring management reasoning and highlight the need for future work to explore the reasoning traces of medical AI systems in a comprehensive, quantitative manner.