In a recent study published in the journal Radiology, researchers evaluated the effectiveness of Generative Pre-trained Transformer (GPT)-4 in identifying and correcting common errors in radiology reports, analyzing its performance, time efficiency, and cost-effectiveness compared to human radiologists.
Study: Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy. Image Credit: Soloviova Liudmyla / Shutterstock
Background
Radiology reports are essential for accurate medical diagnoses but often struggle with consistency and minimizing errors. Typically, residents draft these reports, which are then scrutinized by board-certified radiologists, a process that, while necessary, demands significant resources. Challenges such as heavy workloads, high-pressure clinical environments, and unreliable speech recognition contribute to frequent errors, including incorrect laterality and descriptor misregistrations. GPT-4, a sophisticated language model by OpenAI, offers potential solutions by standardizing and generating radiology reports and has shown promise in educational applications for enhancing diagnostic accuracy. Further research is crucial to ensure GPT-4's reliability and effective integration into radiological practices.
About the study
The present retrospective study, which received ethical approval and had informed consent waived due to its design, did not expose any patient-identifying information to GPT-4. Conducted at the University Hospital Cologne, the study involved 200 radiology reports from radiography and cross-sectional imaging, randomized into two groups of 100 correct and incorrect reports. Errors were intentionally introduced into the incorrect group by a radiology resident and categorized into omissions, insertions, spelling mistakes, side confusion, and other errors.
A team of six radiologists with varied experience and GPT-4 evaluated these reports for errors. The study utilized zero-shot prompting for GPT-4's evaluations, instructing it to assess each report's findings and impressions sections for consistency and errors. The time taken for GPT-4 to process the reports was also recorded.
Costs were calculated based on German national labor agreements for the radiologists and per-token usage for GPT-4. Statistical analysis, including error detection rates and processing time, was conducted using SPSS and Python, comparing the performance of GPT-4 with human radiologists through chi-square tests, with significance marked by P < .05 and effect sizes measured by Cohen's d.
Study results
In the detailed evaluation of error detection in radiology reports, GPT-4 showed varying performance compared to human radiologists. Although it did not surpass the best-performing senior radiologist, with GPT-4 detecting 82.7% of errors compared to the senior's 94.7%, its performance was generally comparable to other radiologists involved in the study. The study found no statistically significant differences in average error detection rates between GPT-4 and the radiologists across general radiology, radiography, and Computed Tomography (CT)/ Magnetic Resonance Imaging
(MRI) report evaluations, except in specific cases such as side confusion where GPT-4's performance was lower.
Additionally, GPT-4's ability to detect side confusion was notably less effective than that of the top radiologist, marking a detection rate of 78% against 100%. Across other error categories, GPT-4 demonstrated similar accuracy to the radiologists, showing no significant shortfall in identifying errors. Interestingly, both GPT-4 and the radiologists occasionally flagged reports as erroneous when they were not, although this occurred infrequently and without significant differences between the groups.
The interrater agreement between GPT-4 and the radiologists ranged from slight to fair, suggesting variability in error detection patterns among the reviewers. This highlights the challenges of consistent error identification across different interpreters and technologies.
Time efficiency was another critical aspect of this study. GPT-4 required significantly less time to review all 200 reports, completing the task in just 0.19 hours, compared to the range of 1.4 to 5.74 hours taken by human radiologists. The fastest radiologist took approximately 25.1 seconds on average to read each report, while GPT-4 took only 3.5 seconds, showcasing a substantial increase in processing speed.
The study showed that the total average cost of proofreading 200 radiology reports by six human readers was $190.17, with individual costs ranging from $156.89 for attending physicians to $231.85 for senior radiologists. In stark contrast, GPT-4 completed the same task for just $5.78. Similarly, the cost per report was significantly lower with GPT-4 at $0.03, compared to $0.96 by human readers, making GPT-4 more time-efficient and vastly more cost-effective, as demonstrated by a substantial cost reduction and statistical significance in the findings.
Conclusions
To summarize, this study evaluated GPT-4's ability to detect errors in radiology reports, comparing its performance with human radiologists. Results showed that GPT-4's error detection was comparable to that of humans, proving exceptionally cost-effective and time-efficient. However, despite these benefits, the study highlighted the need for human oversight due to legal and accuracy concerns.
Journal reference:
- Roman Johannes Gertz ,Thomas Dratsch, Alexander Christian Bunck, et al. Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy, Radiology (2024), DOI - 10.1148/radiol.232714, https://pubs.rsna.org/doi/10.1148/radiol.232714