New AI-driven research reveals how advanced machine learning models not only confirm known Alzheimer’s genes but also spot six new risk variants.
Study: Machine learning in Alzheimer’s disease genetics. Image credit: Kateryna Kon/Shutterstock.com
Statistical tools are essential in unpacking the genetic basis of complex medical conditions. Not much advance has occurred beyond linear additive models; however, a recent paper published in Nature Communications describes the outcome of applying machine learning (ML) to genomic data from a large cohort of Alzheimer’s disease (AD) patients in Europe.
Introduction
Genome-wide association studies (GWAS) have pioneered deeper insights into genetic variation as a risk factor for AD. These variants are factored into polygenic risk scores (PRS) that help predict disease risk.
These tools are designed on the assumption that variants uniformly predict the outcome. Risks associated with individual variants are added, whether these variants occur at the same or other genetic loci. This ignores the knowledge that risks are modified by interactions between the variants and with other risk factors.
AD research has shown, for instance, that different APOE variants alter disease features and the type of immune cellular response to abnormal neuronal proteins. Genetic studies indicate that differences in APOE expression result in different AD-gene associations and varying age at diagnosis.
As the sample sizes for GWAS increase and the power of PRS plateaus, newer platforms applying advanced computational resources are essential to squeeze the maximum benefit from currently available large data, providing a better look at the genetic basis of AD. Artificial intelligence in ML models has been applied in several studies; however, small sample sizes have caused a significantly high risk of bias.
The current study sought to address this using the largest currently available genome-wide dataset for AD.
About the study
In this study, the researchers trained three types of models, which are well-known and high-performing in this field:
- Gradient Boosting Machines (GBMs)
- Biological pathway-informed Neural Networks (NNs)
- Model-based Multifactor Dimensionality Reduction (MB-MDR).
The aim was to assess the effectiveness of each algorithm at performing three types of tasks:
- Replicating prior findings
- Finding new disease-associated loci overlooked by GWAS
- Predicting high-risk individuals
The study used rigorous cross-validation, multiple random train-test splits, and careful adjustment for confounders such as sex, age, genotyping center, and population structure.
Results
Replicating earlier findings
Regarding the first objective, the findings showed that ML captured all genetic variants spanning the entire genome in the training set. Moreover, it identified 22% of AD-associated variants reported in larger GWAS meta-analyses, though the sample size was only a twentieth of theirs. Thus, this study sets a benchmark for ML-based genome-wide methods.
The ML models’ ability to replicate findings from much larger GWAS highlights that flexible models can recover a substantial fraction of known genetic risk with a smaller number of samples.
Identifying genetic loci
Secondly, ML correctly identified APOE as a risk factor for AD. It correctly captured the lead single-nucleotide polymorphisms (SNPs) causally related to AD. Across methods, ML highlighted the lead SNPs for multiple important genes in AD. MB-MDR 1 d found 20 highly stable SNPs, mostly mapped to the APOE region, with every possible train-test split.
The models also identified six new loci that were replicated in an unrelated dataset. These loci encode genes like ARHGAP25, LY6H, and COG7. GBMs identified most novel loci.
A novel association was detected in AP4E1, close to the already known SPPL2A locus. AP4E1 encodes part of a protein key to amyloid metabolism, and its deficiency may promote beta-amyloid formation, increasing AD risk. The neural network approach also highlighted an additional novel locus (SOD1) with possible biological links to AD pathology.
Predicting AD status
All models predicted AD status with comparable accuracy. GBM was most strongly correlated with NN and MDRC 1 d. Though weakly correlated with NNs, PRS was strongly linked to GBMs.
GBM and PRS were better at predicting cases that differed from controls. The predictions were validated using random training and testing data rearrangements, indicating high reproducibility.
Females were overrepresented among predicted cases, as expected from the data's female majority. GBM was the exception, with similar proportions of males and females in both cases and controls.
All model predictions remained stable across different cohorts and repeated random splits, suggesting that the findings are not driven by overfitting or technical artifacts.
Comparison with GWAS
The investigators compared the primary ML-detected variants with all important AD-associated SNPs reported in meta-analyses. Of 130 previously reported genes corresponding to 86 loci, one or more ML algorithms picked up 19. All models identified APOE, while two models detected seven loci.
Leaving the APOE region out of the training dataset led to the identification of more known AD risk genes but with lower accuracy. When only the current data was used, one or more ML models identified each GWAS-detected SNP in the training dataset.
The ML-identified SNPs with high priority were more concentrated in microglial and astrocytic regions. These were involved in various AD-related pathways, such as regulation of the AD-hallmark beta amyloid protein, or changes in the concentration of proteins such as Ly6h. This molecule binds to acetylcholine receptors involved in neurotransmission, and its level in the cerebrospinal fluid correlates with AD severity. Others are traced to glycosylation abnormalities implicated in AD tau protein processing.
The way ML models rank SNP importance (e.g., via SHAP values for GBM, permutation p-values for MB-MDR, or network weights for NN) does not always translate directly to conventional GWAS significance, reflecting fundamental differences in feature selection between ML and traditional statistics.
Importance of the study
This well-powered, sophisticated study emphasizes that ML can predict AD-linked genetic variants comparably with traditional genome-wide methods, given the large datasets available.
The moderate predictive accuracy of GWAS meta-analyses could be due to the heterogeneity of included studies, reflecting differences in multiple relevant characteristics. More homogeneous samples provide higher odds ratios than clinical samples. Some SNPs identified by ML models may only have detectable effects in particular cohorts or under specific conditions, which may not be visible in large, heterogeneous external datasets.
This also explains why all SNPs identified by the ML models could not be replicated in external datasets. Their effects may be significant only in specific situations, failing to show genome-wide significance across very different studies with different contexts.
Despite this, the novel SNPs here affected biologically plausible pathways. Further research is essential to understand how to identify important SNPs from those captured by different methods.
Conclusions
“Our results demonstrate that machine learning methods can achieve predictive performance comparable to classical approaches in genetic epidemiology.” Besides predicting risk, they identified new loci missed by traditional GWAS approaches. The reproducible approach used here minimizes the chances of bias.
Overall, this work demonstrates the promise and current limitations of ML in AD genetics. It offers a valuable addition to GWAS but also underscores the need for careful interpretation, replication, and further methodological refinement.
The current study opens the way for future development and validation of ML models to complement conventional methods in AD genetic research.
Download your PDF copy now!