In a recent study published in npj Digital Medicine, researchers developed diagnostic reasoning prompts to investigate whether large language models (LLMs) could simulate diagnostic clinical reasons.
LLMs, artificial intelligence-based systems trained using enormous amounts of text data, are known for human-simulating performances in tasks like writing clinical notes and passing medical exams. However, understanding their clinical diagnostic reasoning abilities is crucial for their integration into clinical care.
Recent studies have concentrated on open-ended-type clinical questions, indicating that innovative large-language models, like GPT-4, have the potential to identify complex patients. Prompt engineering has begun to overcome this issue, as LLM performance varies based on the type of prompts and questions.
About the study
In the present study, researchers assessed diagnostic reasoning by GPT-3.5 and GPT-4 for open-ended-type clinical questions, hypothesizing that GPT models could outperform conventional chain-of-thought (CoT) prompting with diagnostic reasoning prompts.
The team used the revised MedQA United States Medical Licensing Exam (USMLE) dataset and the New England Journal of Medicine (NEJM) case series to compare conventional chain-of-thought prompting with various diagnostic logic prompts modeled after the cognitive procedures of forming differential diagnosis, analytical reasoning, Bayesian inferences, and intuitive reasoning.
They investigated whether large-language models can mimic clinical reasoning skills using specialized prompts, combining clinical expertise with advanced prompting techniques.
The team used prompt engineering to generate prompts for diagnostic reasoning, converting questions into free responses by eliminating multiple-choice selections. They included only step II and step III questions from the USMLE dataset and those evaluating patient diagnosis.
Each round of prompt engineering involved GPT-3.5 accuracy evaluation using the MEDQA training set. The training and testing sets, which contained 95 and 518 questions, respectively, were reserved for assessment.
The researchers also evaluated GPT-4 performance on 310 cases recently published in the NEJM journal. They excluded 10 that did not have definitive final diagnoses or surpassed the maximum context length for GPT-4. They compared conventional CoT prompting with the best-performing clinical diagnostic reasoning CoT prompts (reasoning for differential diagnosis) on the MedQA dataset.
Every prompt consisted of two exemplifying questions with rationales using target reasoning techniques or few-shot learning. The study evaluation used free-response questions from the USMLE and NEJM case report series to facilitate rigorous comparison between prompting strategies.
Physician authors, attending physicians, and an internal medicine resident evaluated language model responses, with each question assessed by two blinded physicians. A third researcher resolved the disagreements. Physicians verified the accuracy of answers using software when needed.
The study reveals that GPT-4 prompts could mimic the clinical reasoning of clinicians without compromising diagnostic accuracy, which is crucial to assessing the accuracy of LLM responses, thereby enhancing their trustworthiness for patient care. The approach can help overcome the black box limitations of LLMs, bringing them closer to safe and effective use in medicine.
GPT-3.5 accurately responded to 46% of assessment questions by standard CoT prompting and 31% by zero-shot-type non-chain-of-thought prompting. Of prompts associated with clinical diagnostic reasoning, GPT-3.5 performed the best with intuitive-type reasonings (48% versus 46%).
Compared to classic chain-of-thought, GPT-3.5 performed significantly inferiorly with analytical reasoning prompts (40%) and those for developing differential diagnoses (38%), while Bayesian inferences fell short of significance (42%). The team observed an inter-rater consensus of 97% for MedQA data GPT-3.5 evaluations.
The GPT-4 API returned errors for 20 test questions, limiting the size of the test dataset to 498. GPT-4 displayed more accuracy than GPT-3.5. GPT-4 showed 76%, 77%, 78%, 78%, and 72% accuracies with classical chain-of-thought, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inferences, respectively. The inter-rater consensus was 99% for GPT-4 MedQA evaluations.
Regarding the NEJM dataset, GPT-4 scored 38% accuracy with conventional CoT versus 34% with that for formulating differential diagnosis (a 4.2% difference). The inter-rater consensus for the GPT-4 NEJM assessment was 97%. GPT-4 responses and rationales for the complete NEJM dataset. Prompts promoting step-by-step reasoning and focusing on a single diagnostic reasoning strategy performed better than those combining multiple strategies.
Overall, the study findings showed that GPT-3.5 and GPT-4 have improved reasoning abilities but not accuracy. GPT-4 performed similarly with conventional and intuitive-type reasoning chain-of-thought prompts but worse with analytical and differential diagnosis prompts. Bayesian inferences and chain-of-thought prompting also showed worse performance compared to classical CoT.
The authors propose three explanations for the difference: the reasoning mechanisms of GPT-4 could be integrally different from those of human providers; it could explain post-hoc diagnostic evaluations in desired reasoning formats; or it could attain maximum precision with the provided vignette data.