In a recent article published in JAMA Neurology, researchers tested whether the artificial intelligence (AI) model, SCORE-AI, interprets routine clinical electroencephalograms (EEGs) with comparable accuracy as human experts.
EEG is a widely used tool for epilepsy diagnosis, a neurological disease affecting over 70 million people globally. It helps distinguish epilepsy from other neurological conditions impairing consciousness.
In many parts of the world, human experts for reading and interpreting clinical EEGs are unavailable, which raises the need for AI-based tools. Experts in reading clinical EEGs, even in countries with the most advanced health care systems, are physicians who lack fellowship training in EEG interpretation, thus, increasing the cases of epilepsy misdiagnosis. Ever-increasing EEG referrals have overburdened trained experts in tertiary care centers who interpret high volumes of EEG recordings.
Since previous AI models addressed limited aspects of EEG interpretation, e.g., identifying epileptiform activity, there is a need for fully automated AI-based tools with the potential to improve patient care in remote areas where EEG experts are minimal or unavailable for comprehensive clinically relevant assessment of routine EEGs.
About the study
In the present multicenter diagnostic accuracy study, researchers developed and trained a convolutional neural network model, SCORE-AI, using a large dataset of highly annotated EEGs recorded between 2014 and 2020. The Standardized Computer-based Organized Reporting of EEG (SCORE EEG) system, a software tool, annotated all EEGs.
The study encompassed all patients aged more than three months. The team validated SCORE-AI using three independent test data sets, as follows:
i) a multicenter data set of 100 representative EEGs evaluated by 11 experts;
ii) a single-center data set of 9785 EEGs evaluated by 14 experts; and
iii) a data set of 60 EEGs with an external reference standard for benchmarking with previously published AI models. However, the team limited the benchmarking to the comparison of the ability to identify epileptiform discharges by combining focal and generalized categories.
Further, the team tested how well SCORE-AI distinguished an abnormal from a normal EEG recording and classified abnormal recordings into the four most clinically relevant categories, viz., epileptiform-focal, epileptiform-generalized, non-epileptiform-focal, and non-epileptiform-diffuse abnormalities.
The assessment of the EEG recording as normal, abnormal, or a combination of the four abnormal categories, as assessed by SCORE-AI, was considered the study output. This tool performed a fully automated analysis with no human interaction involvement.
The integration of SCORE-AI with autoSCORE highlighted the abnormal epochs within the EEG recording so that the human expert could adjust the automated assessment if needed. They determined the interrater agreement between SCORE-AI and the human experts for the multicenter test dataset.
The team used a fixed model and thresholds during the clinical validation phase. Additionally, they avoided overfitting the AI model during this phase. Moreover, the test dataset was independent of the development dataset.
A larger test data set of ~10,000 EEGs ensured generalizability. Different human experts provided the reference standard in the validation and development phase of the study. So, they recorded EEGs in the multicenter test dataset using different EEG equipment.
The study development data set had 30,493 EEG recordings annotated by 17 experts. With no human involvement, SCORE-AI attained human expert-level performance in interpreting routine clinical EEGs. In addition, it achieved high accuracy, with an area under the receiver operating characteristic curve (AUC) ranging between 0.89 and 0.96 for four types of EEG abnormalities.
Recordings shorter than 20 minutes had a lower AUC than recordings longer than 20 minutes, with mean AUCs of 0.887 and 0.903. Moreover, AUCs of longer duration recordings showed minimal relative variations concerning recording duration in 2,549 holdout test set EEGs and 9,785 EEGs of the larger clinical data set.
Compared to the multicenter test data set, SCORE-AI had markedly higher specificity, accuracy, positive predictive value, but lower sensitivity. Since SCORE-AI appears to identify normal EEGs with nearly 100% precision, experts might spend less time evaluating these recordings and more time on the more difficult aspects of epilepsy monitoring.
To summarize, the SCORE-AI is one of the first AI-based models to perform a comprehensive clinically relevant assessment of routine EEGs providing a more complex classification of EEG abnormalities than previously published AI models. Identifying epileptiform activity in the EEGs helps in accurately and timely diagnosing epilepsy.
The expert-level performance of SCORE-AI primarily decreased EEG misinterpretation, which could help evade the problem of low interrater agreement in sites where physicians with limited experience read EEGs, i.e., common neurology practice settings.
Accordingly, researchers have integrated the SCORE-AI with widely used clinical EEG equipment systems. It could help them convert SCORE-AI into other computer-based interfaces because it does not need specialized hardware for implementation.