Advanced reasoning-based AI systems are showing physician-level performance on select diagnostic tasks, but researchers warn that real-world safety, bias, and clinical accountability remain major barriers to healthcare deployment.
Study: AI can reason like a physician - what comes next? Image Credit: Thandon88 / Shutterstock.com
A recent Perspective article published in Science explores whether advanced artificial intelligence (AI) systems are approaching physician-level reasoning, while considering the implications and safety of their integration into clinical practice.
Progress in AI and diagnostic reasoning
Large language models (LLMs) are AI algorithms trained on substantial amounts of data to learn patterns that are then used to generate human-like responses. Reasoning models add to these capabilities by evaluating possible approaches before generating a response, thereby mimicking structured cognitive processing.
Numerous studies have evaluated healthcare applications of LLMs, including their performance on medical licensing examinations and other relevant assessments. These evaluations often extend beyond standard tests to include simulated clinical scenarios such as diagnostic case vignettes, specialty-specific exams, and problem-solving tasks designed to approximate clinical decision-making processes.
Discussing findings from Brodeur et al., the authors note that GPT-4 by OpenAI has achieved exact or very close diagnostic accuracy in up to 73 % of cases, with the company’s first reasoning model, o1-preview, exceeding that performance at 88.6 % on clinicopathological cases.
Moreover, o1-preview achieved close or exact diagnostic accuracy in 67 % of emergency department (ED) cases at initial triage, exceeding that of two expert physicians in specific text-based diagnostic scenarios.
Since reasoning models were originally developed, their reasoning capabilities, deliberation times, and processing of multimodal inputs have significantly improved. Whereas o1-preview only accepted text inputs, recent models can increasingly process combinations of text, images, audio, and video to support more complex clinical assessments.
Read our interview with Dr Rahul Goyal to learn how AI is changing clinical decision-making in real-world healthcare settings
How AI is being integrated into clinical practice
It is important to emphasize that AI systems are not being proposed as replacements for physicians. Rather, research in this area considers LLMs and other advanced models as collaborative tools, with clinicians providing accountability, oversight, and contextual judgment.
However, the authors also note that some well-defined healthcare tasks may ultimately be performed more effectively by AI systems operating independently. AI applications in healthcare have the potential to significantly reduce the human and financial costs associated with diagnostic errors, delays, and limited access.
The Medical Holistic Evaluation of Language Models (Med-HELM) defines five healthcare domains for AI use, including administrative workflows, clinical note generation, clinical decision support, patient communication, and medical research assistance. Across these domains, AI has evolved to analyze patient records, monitor clinical encounters, and interact with predictive models, thereby minimizing delays, reducing diagnostic errors, and improving access to care.
Nevertheless, it remains unclear whether advanced AI models would operate more effectively for specific tasks or independently across healthcare. As clinicians increasingly integrate AI tools into their practice, with some already doing so without institutional oversight, randomized trials are urgently needed to establish how these models are improving real-world applications.
Requiring clinical certification of AI models has also been proposed to expand the role of AI in medicine while ensuring transparency and accountability. The proposed pathway would gradually advance AI systems from medical knowledge assistants to supervised clinical practice and, potentially, to broader autonomous responsibilities. The implementation of robust monitoring frameworks can complement these initiatives to further support the safety, efficiency, and cost of AI clinical decision support systems.
Despite these efforts, AI has had limited real-world success due to inadequate benchmark performance and unclear clinical benefits. Although newer multimodal systems can now integrate images, audio, and video, many medical AI evaluations remain focused on text-only tasks, which limits their ability to fully capture the complexity of clinical decision-making.
The authors also highlight concerns surrounding the rapid deployment of consumer-facing health AI systems. In one example, an independent evaluation found that a publicly available health-focused AI tool under-triaged more than half of emergency cases presented to it.
Beyond diagnostic accuracy, the perspective emphasizes that clinical AI systems must demonstrate real-world effectiveness, equity, safety, transparency, and accountability before they can be widely adopted. The authors also note that previous healthcare algorithms have exhibited racial bias and that biased AI systems can negatively affect clinician decision-making.
Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use.
Download your PDF copy by clicking here.