A new AI-driven diagnostic framework combines clinical, genetic, and phenotypic data to help shorten the rare disease diagnostic journey while providing transparent, evidence-based reasoning for clinicians.

Study: An agentic system for rare disease diagnosis with traceable reasoning. Image Credit: Mdisk / Shutterstock
In a recent study published in the journal Nature, researchers developed DeepRare, an agentic system powered by large language models (LLMs) for the diagnosis of rare diseases.
Global Rare Disease Burden and Diagnostic Delays
Rare diseases affect more than 300 million individuals worldwide, yet diagnosis remains challenging due to clinical heterogeneity, limited physician familiarity, and low disease prevalence. Patients frequently experience a prolonged diagnostic odyssey that can exceed five years and involves repeated referrals, misdiagnoses, unnecessary interventions, treatment delays, and poor clinical outcomes. These delays impose significant economic and emotional burdens on patients and families, underscoring the urgent need for accurate, scalable rare disease diagnostic tools.
DeepRare Agentic System Architecture and Core Components
In this study, researchers introduced DeepRare, a large-language-model–based agentic system for rare disease diagnosis. DeepRare consists of three primary components: (1) an LLM-powered central host equipped with a memory bank, (2) specialized agent servers that execute analytical tasks, and (3) heterogeneous data sources supplying diagnostic evidence from web-scale medical knowledge bases and scientific literature. The system uses DeepSeek-V3 as the default LLM powering the central host.
DeepRare processes diverse patient inputs, including genomic test results, free-text clinical descriptions, and Human Phenotype Ontology (HPO) terms. The central host coordinates agent servers to retrieve relevant evidence tailored to patient data, generates preliminary diagnostic hypotheses, and performs a structured self-reflection phase to validate or refute them through additional searches. If no hypothesis satisfies the predefined criteria, the system iteratively repeats the reasoning cycle until a resolution is reached. The final output is a ranked list of candidate rare diseases accompanied by a traceable reasoning chain linking each inference to supporting evidence.
Benchmark Comparisons Against LLMs, Bioinformatics Tools, and Agentic Systems
The researchers evaluated DeepRare against state-of-the-art general-purpose LLMs, reasoning-enhanced LLM variants, medical domain-specific LLMs, bioinformatics diagnostic tools, and other agentic systems. General-purpose models included Claude-3.7-Sonnet, GPT-4o, Gemini-2.0-flash, and DeepSeek-V3, along with reasoning-enhanced versions such as Claude-Sonnet-3.7-thinking, o3-mini, Gemini-2.0-FT, and DeepSeek-R1. Medical-specific LLMs included MMedS-Llama 3 and Baichuan-14B. Bioinformatics tools comprised PubCaseFinder and PhenoBrain, while other agentic systems included MDAgents and DS-R1-search.
DeepRare was evaluated on 6,401 clinical cases spanning 2,919 diseases across seven public datasets and two in-house datasets. Public datasets included the Deciphering Developmental Disorders Study, RareBench, Matchmaker Exchange (MME), RareBench-LIRICAL, RareBench HMS, MIMIC-IV-Rare, MyGene2, and RareBench-RAMEDIS. In-house datasets consisted of clinical cases from Xinhua and Hunan hospitals in China. These datasets encompassed literature-derived case reports, curated repositories, and real-world clinical center data across diverse populations.
Diagnostic Accuracy Metrics and Recall@K Performance
For each diagnostic task, the system generated five ranked predictions. Performance was assessed using Recall@K, which measures the probability that the correct diagnosis appears within the top-K predictions. Recall@1 reflects the proportion of cases where the correct diagnosis ranked first, while Recall@3 and Recall@5 indicate whether it appeared within the top three or five predictions, respectively.
In HPO-based analyses, DeepRare achieved a Recall@1 of 57.18%, outperforming Claude-3.7-Sonnet-thinking, the second-best model, by 23.79%. Across 14 body systems representing multiple medical specialties, DeepRare consistently maintained superior diagnostic performance. When analyses were stratified by disease representation, DeepRare demonstrated strong performance for both well-represented diseases, with more than 10 cases per disease, and underrepresented diseases, with 10 or fewer cases, highlighting robustness across variable case distributions.
Performance Versus Rare Disease Specialists
DeepRare was evaluated against five expert rare disease specialists using identical HPO inputs. Clinicians were permitted to consult search engines but were not allowed to use AI-based diagnostic tools. DeepRare achieved Recall@1 and Recall@5 rates of 64.4% and 78.5%, respectively, compared with specialists’ average Recall@1 of 54.6% and Recall@5 of 65.6%. These results suggest the system outperformed human experts under standardized benchmarking conditions.
Integration of Genetic Data Improves Diagnostic Accuracy
The researchers assessed DeepRare using combined genetic and HPO inputs, including whole-exome sequencing data from Xinhua and Hunan hospitals. Incorporating genetic data significantly improved performance. Recall@1 increased from 33.3% to 63.6% in the Hunan dataset and from 39.9% to 69.1% in the Xinhua dataset.
When compared with Exomiser, a bioinformatics tool integrating genetic and HPO data, DeepRare achieved higher Recall@1 values of 63.6% (Hunan) and 69.1% (Xinhua), versus 58.0% and 55.9% for Exomiser, respectively.
Different LLMs, including DeepSeek-R1, Gemini-2.0-flash, Claude-3.5-Sonnet, and GPT-4o, were tested as the central host. LLM choice had minimal impact on overall performance, suggesting architectural robustness. The authors noted these findings reflect controlled retrospective evaluations rather than prospective real-world deployment.
Transparent Reasoning and Clinical Decision Support Implications
DeepRare represents an agentic LLM-powered system capable of generating transparent reasoning chains for rare disease diagnosis. The system consistently outperformed existing LLMs, bioinformatics tools, agentic frameworks, and expert clinicians across diverse datasets in retrospective benchmarking. Clinician review of generated reasoning chains demonstrated high reference accuracy, although occasional hallucinated or irrelevant citations were observed.
Future research may extend this framework to treatment selection, prognosis prediction, and prospective clinical validation to assess real-world clinical utility.