A massive new multimodal AI system trained on tens of millions of medical images could help unify fragmented radiology tools and assist doctors in interpreting scans and generating reports more efficiently.

Study: MedVersa: A Generalist Foundation Model for Diverse Medical Imaging Tasks. Image Credit: Thitisan / Shutterstock
In a recent study published in the journal NEJM AI, researchers introduced “MedVersa”, a generalist artificial intelligence (AI) model capable of ingesting and interpreting a wide range of medical imaging modalities and task types. Unlike traditional AI models trained for specific, limited tasks, MedVersa was built on tens of millions of medical imaging instances, allowing it to detect pathologies and generate reports within a unified analytical framework.
Encouragingly, when MedVersa’s performance was compared with that of a human radiologist in a blinded evaluation of chest radiograph reports, the model produced reports that were judged clinically comparable to human-written reports in many cases, particularly for scans with normal findings, while significantly reducing the time human radiologists spend documenting their findings. Together, these results posit MedVersa as a promising step toward developing a new generation of unified, multimodal foundation models that may help consolidate the currently fragmented ecosystem of AI tools currently used in clinical care settings.
Background: Fragmentation of Medical Artificial Intelligence Tools
While recent advances in computational power and artificial intelligence (AI) model logic have allowed several of these tools to be approved for use in the medical field, their use is often fragmented. Models trained on X-ray datasets can accurately detect pneumonia in patient chest X-rays, but cannot use MRI or ultrasound data for holistic patient evaluation.
These "specialist" models often struggle to adapt to complex clinical workflows where a patient’s diagnosis involves multiple data types. Computational biologists sought to address this discrepancy by introducing the concept of Generalist Medical Artificial Intelligence (GMAI).
Their goal was to create a "foundation model" (similar to the “agentic” technology adopted by ChatGPT, Google Gemini, and other large language models [LLMs]) that can process multimodal inputs and outputs. Unfortunately, previous attempts to realize this concept largely focused on text-based inputs and proved incapable of elucidating the complex visual tasks indispensable in radiology.
Development of the MedVersa Multimodal AI Model
The present study aimed to address this functional gap by engineering “MedVersa,” a radiology-focused generalist AI model capable of ingesting, annotating, diagnosing, reporting, and documenting multimodal clinical imaging data. The model was trained using “MedInterp”, a massive dataset aggregating 91 public datasets that together comprised over 29 million medical instances, including images, bounding-box annotations, segmentation masks, captions, and other vision–language supervision signals used across diverse imaging tasks.
The model features a unique architecture that uses a trained LLM as an “orchestrator”, evaluating users' requirements (e.g., "Where is the patient’s tumor?") and dynamically selecting appropriate internal vision modules within the MedVersa framework for request execution. Unlike previous GMAIs, which were primarily text-based, MedVersa was designed to either generate a text response or deploy specialized "vision modules" for object detection or segmentation.
MedVersa can consequently process inputs as diverse as 2D X-rays, 3D CT and MRI scans, and patients’ clinical history text simultaneously. Following model training, MedVersa’s performance was validated against two separate traditional competitors across nine distinct imaging tasks: 1. Approved specialist AI models, 2. Board-certified radiologists (n = 10).
Evaluation Framework and Comparative Testing
Performance evaluation required the expert (an AI model or a human radiologist) to review reports generated by humans, ChatGPT-4o, and MedVersa for chest X-rays. Crucially, experts were blinded to the data source. Performance was scored based on the clinical accuracy of expert output and evaluation efficiency (time taken to complete the evaluation and generate a report).
Study Findings: Performance Across Imaging Tasks
Study findings revealed that MedVersa’s GMAI architecture was competitive with and frequently exceeded traditional “gold standard” specialist models across many object-detection and segmentation evaluation metrics.
When evaluating model report generation, in the BLEU-4 test (higher is better, measures text similarity), MedVersa achieved a score of 17.8, compared with MAIRA’s 14.2, BiomedGPT’s 12.0, and Med-PaLM M’s 11.5. In the RadCliQ test (lower is better, measures deviation from human clinical reporting), MedVersa achieved a score of 2.71 versus MAIRA’s 3.10 and BiomedGPT’s 3.25. While Med-PaLM M reported a slightly better RadCliQ score (2.67), this was statistically indistinguishable from MedVersa.
Comparison With Human Radiologist Reporting
When compared with human experts, researchers found that MedVersa’s reports were clinically comparable to human-written reports in 64% of cases. For scans with normal findings, this equivalence increased to 91%. However, for scans with abnormal findings involving more complex pathology, equivalence was substantially lower, and human-written reports were more often preferred by reviewing radiologists.
Researchers also demonstrated that using MedVersa as an assistant enabled doctors to complete report-drafting workflows more quickly. It reduced report-writing time and, crucially, resulted in fewer "urgent" discrepancies (errors requiring immediate attention) than reports drafted by GPT-4o (a 20% reduction in the 5-to-10-minute reporting interval).
Conclusions: Toward Unified Clinical AI Assistants
The present study reveals that MedVersa represents an important step toward developing a unified clinical assistant rather than relying on traditionally fragmented AI tools. Its architecture, which leverages an LLM to orchestrate specialized vision tools, enabled this novel model to achieve performance competitive with or exceeding specialized AI models across several tasks while significantly streamlining and accelerating expert human radiologists’ workflows.
However, the study emphasizes that while MedVersa excelled at routine cases, board-certified radiologists remain preferred for complex, abnormal cases involving intricate pathologies, underscoring the importance of expert supervision. The authors also note that broader generalizability across imaging modalities remains an ongoing challenge because several non–chest X-ray datasets in the study were dominated by segmentation tasks rather than full diagnostic interpretation.
Consequently, while the present study validates MedVersa as a powerful proof-of-concept, future GMAI models should be trained with expanded datasets that include more modalities (e.g., genetic information and electronic health records [EHRs]) to fully realize the potential of AI-assisted, human expert-mediated patient care.
Journal reference:
- Zhou, H.-Y., Acosta, J. N., Adithan, S., Datta, S., Topol, E. J., & Rajpurkar, P. (2026). MedVersa: A Generalist Foundation Model for Diverse Medical Imaging Tasks. NEJM AI. DOI – 10.1056/aioa2500595. https://ai.nejm.org/doi/full/10.1056/AIoa2500595