In a recent article published in the journal Science, researchers presented AlphaMissense, a highly accurate protein structuring model adapted from AlphaFold (AF) to predict and characterize human proteome-wide missense variants' pathogenicity at a single amino acid substitution level.
Study: Accurate proteome-wide missense variant effect prediction with AlphaMissense. Image Credit: ArtemisDiana / Shutterstock
It was an adaptation of AF except for minor architectural differences; moreover, it incorporated AF's capacity to inherently understand multiple sequence alignments (MSAs) and learn evolutionary constraints from related sequences.
Of over four million missense variants identified by genome sequencing efforts, barely 2% have been classified as pathogenic and benign. Both these variants change the amino acid sequence of proteins to exert their effects; however, only pathogenic missense variants significantly disrupt protein function to reduce an organism's fitness.
A lack of predictive models for precisely predicting missense variants' functions, especially those whose significance is unknown, is a big challenge in human genetics. It has limited the diagnostic rate of rare diseases and the development of clinical therapies that target the genetic cause of a disease.
Multiplexed assays of variant effect (MAVEs) systematically quantify missense variant effects to predict their clinical outcomes; however, it is laborious and costly, hindering a proteome-wide survey of missense variant pathogenicity.
Likewise, machine learning approaches leverage previous knowledge to annotate missense variants, thereby inheriting biases. In addition, they are prone to data leakages between training and test arms.
Another class of methods used unsupervised approaches to model amino acid distribution at a given naturally evolved protein site to interpret pathogenicity as the difference in predicted log-likelihood between alternate and reference sequences. However, these methods failed to attain an understanding of protein structure as AF did.
To overcome potential human curation biases, researchers used such methods in AlphaMissense that trained with weak labels, such as benign variants and approximated pathogenic ones with hypothetical variants previously unseen in the human population.
Models trained on clinical databases, e.g., ClinVar, inherit human biases and often fail to generalize multiple clinical benchmarks. The authors tested AlphaMissense on ClinVar missense variants post-balancing the number of pathogenic and benign variants per gene.
It achieved an area under the receiver operator curve (auROC) of 0.940 on 18,924 ClinVar test variants versus auROC of 0.911 achieved by the Evolutionary model of Variant Effect (EVE), a model that did not train directly on ClinVar. AlphaMissense also outperformed models trained directly on ClinVar. Furthermore, AlphaMissense distinguished pathogenic from benign ClinVar variants within regions of high evolutionary constraint, suggesting it captured differences in the effect of individual variants within constrained domains.
Distinguishing benign and pathogenic variants within specific disease-associated genes is a clinically relevant task for predictive models. In this aspect, too, AlphaMissense fetched favorable predictions. In the analysis of 612 genes with five pathogenic and five benign variants in the ClinVar test set, it attained an average gene-level auROC of 0.950 versus 0.921 of EVE. For the 34 clinically actionable American College of Medical Genetics (ACMG) genes, 26 genes (77%) showed improvements in AlphaMissense over EVE pathogenicity predictions. Indeed, calibrated AlphaMissense predictions could expand the number of confidently classified missense variants compared to other methods.
Genetic researchers have consistently observed that disease-causing missense variants reside in more thermally stable proteins. Thus, variants in structured (versus disordered) regions are associated with higher pathogenicity scores. Consequently, AlphaMissense predicted more pathogenicity in evolutionarily constrained vs. unconstrained genes. In addition, it captured domain conservation within a protein, if not protein-level evolutionary conservation.
As expected, mutations in aromatic amino acids or cysteine are more likely pathogenic, given their role in maintaining protein structure. The predicted substitution scores were asymmetric, suggesting that AlphaMissense used both the structural and evolutionary MSA information to make predictions consistent with known biological principles.
AlphaMissense per-position average pathogenicity agreed strongly with the MAVE per-position average for disease-relevant proteins like SHOC2. In fact, AlphaMissense is the only model to accurately predict the pathogenic effects of mutations in the functionally important first 80 amino acids of SHOC2, of which positions 63 to 74 were pathogenic according to the MAVE assay. SHOC2 forms a complex with MRAS and PP1C to activate the Ras-MAPK (mitogen-activated protein kinase) signaling pathway in cancer.
Furthermore, AlphaMissense was trained in two stages, wherein first, it performed single-chain structure prediction like AF with protein language modeling to predict the identity of the amino acids masked at random positions in the MSA. After pretraining, it was fine-tuned on human proteins to help improve the variant pathogenicity score. An ablation study systematically removing components of the model found that both training stages were essential for optimal performance.
A gene's average AlphaMissense pathogenicity shares similar properties with loss-of-function observed/expected upper bound fraction (LOEUF) across a wide range of biological measures of intolerance in humans. Most of the properties of genes in the most pathogenic decile of AlphaMissense predictions remained consistent among genes underpowered for LOEUF, favoring the generalizability of the scores to an extra 4252 small genes.
Overall, a methodology combining AlphaMissense predictions and population cohort–based approaches could effectively quantify the functional significance of short human genes for which the latter (alone) lacks statistical power.
The researchers released four resources comprising millions of human proteome-wide missense variant predictions for the research community. The first dataset had 71 million missense variants, of which 32% and 57% were likely pathogenic and benign, respectively. Herein, each missense variant showed a single nucleotide change, resulting in an altered amino acid.
The second resource was gene-level AlphaMissense pathogenicity predictions. The third comprised 216 million possible single amino acid substitutions across 19,233 human proteins. The final and fourth resources had predictions for all possible missense variants and amino acid substitutions across 60,000 alternative transcript isoforms for future research.