The range of applications of artificial intelligence (AI) and deep learning (DL) has increased significantly since 2015, especially in ophthalmology. DL uses ophthalmic data, such as optical coherence tomography and fundus photographs, for image recognition. The key features of DL have been recently combined with AI for natural language processing (NLP) in ophthalmology, which has enabled interaction with human language.
Scientists have developed a large language model (LLM) that produces human-like text. OpenAI, for example, has developed ChatGPT, a generic LLM based on the Generative Pre-trained Transformer 3 (GPT-3) series. Several experiments have shown that the overall accuracy of ChatGPT is above 50%.
A recent Ophthalmology Science study assessed the performance of ChatGPT in ophthalmology.
Study: Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings. Image Credit: metamorworks / Shutterstock.com
NLP has gained attention due to the recent release of foundation models, which can be adjusted according to the given application through a process known as transfer learning. Foundation models can include billions of parameters due to the advancements in computer hardware, availability of large amounts of training data, and transformer model architecture.
GPT-3, an LLM, was trained on a large dataset of text comprising more than 400 billion words from the internet, including articles, books, and websites. Recently, LLM was assessed for its capacity to understand and generate natural language in medicine. However, the medical domain challenges the performance of LLMs due to its high demand for clinical reasoning, which requires years of training and experience.
In 2022, the performance of PaLM, a 540-billion parameter LLM, was evaluated based on its performance in multiple-choice questions from the United States Medical Licensing Exam (USMLE), which revealed an accuracy of 67.6%. Interestingly, ChatGPT was also able to provide insightful explanations to support their answers.
About the study
Limited studies have assessed the performance of LLMs in ophthalmology question-answering space. Considering this gap in research, the current study investigated the performance of ChatGPT in ophthalmology using two popular question banks, including the OphthoQuestions online question bank and the American Academy of Ophthalmology's Basic and Clinical Science Course (BCSC) Self-Assessment Program.
ChatGPT functions beyond predicting the next word, as it has been trained using human feedback. Two versions of ChatGPT were evaluated; the first was released on January 9, 2023, known as the legacy model, whereas the other upgraded model was launched on January 30, 2023. The updated model comprised "enhanced factuality and mathematical capabilities."
OpenAI also launched ChatGPT Plus, which offers faster response. The authors used ChatGPT Plus for their analysis, as previous versions were inaccessible.
Multiple experiments were conducted using ChatGPT Plus, which established the reproducibility of the results. A set of 260 test questions was generated from the BCSC Self-Assessment Program and another 260 questions from OphthoQuestions.
Twenty random questions were selected from thirteen sections of the standardized Ophthalmic Knowledge Assessment Program (OKAP) exam. ChatGPT's performance was analyzed based on the subject, type of question, and level of difficulty.
The current study provided evidence of ChatGPT's performance in responding to questions from the OKAP exam. A significant improvement in ChatGPT's performance was observed during experimentation. ChatGPT Plus showed an accuracy of 59.4% on the simulated OKAP exam based on the BCSC testing set and 49.2% using the OphthoQuestions testing set.
Based on the aggregated historical human performance data, humans score 74% on the BCSC question bank. In addition, the group of ophthalmology residents who completed their training in 2022 scored 63% on OphthoQuestions.
It is worth noting that ChatGPT's performance in ophthalmology is promising as it matches the accuracy levels of advanced LLMs in general medical question answering, which usually falls between 40-50%, as stated in recent publications from 2022.
The accuracy of the legacy model depended on the exam section, irrespective of accounting for question difficulty and cognitive level. However, this effect was less prominent in the updated ChatGPT version.
Importantly, ChatGPT performance consistently improved in Fundamentals, General Medicine, and Cornea, which could be due to the massive amount and availability of training data and resources on the internet.
ChatGPT performed poorly in Ophthalmic Pathology, Neuro-ophthalmology, and Intraocular Tumors. These are highly specialized domains, which could even be challenging within the ophthalmology community. It must be noted that around 40% of patients referred to neuro-ophthalmology and ocular oncology services are misdiagnosed.
Although the updated ChatGPT Plus Model showed improved performance in Intraocular Tumors and Pathology as compared to earlier versions, its performance remained unchanged in Neuro-ophthalmology. In addition, ChatGPT predictions were found to be more accurate when a higher percentage of humans answered correctly for specific questions. This finding indicates that ChatGPT's responses align with the collective understanding of ophthalmology trainees.
In the future, the authors plan to conduct qualitative analysis to identify areas that require improvement in the ophthalmic space. The accuracy of ChatGPT could be improved by incorporating other specialized foundation models trained with domain-specific sources, such as EyeWiki.
Currently, ChatGPT cannot be implemented in ophthalmology because of its inability to process images. A new application programming interface (API) for ChatGPT would help validate this technology and alleviate the tedious nature of the process.
- Antaki, F., Touma, S., Milad, D., et al. (2023) Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings. Ophthalmology Science. doi:10.1016/j.xops.2023.100324