In a recent preprint* uploaded to the bioRxiv server, researchers developed and trained a foundational model to predict tissue-specific RNA expression, splicing, RNA binding protein specificity, and microRNA sites from genomic DNA sequences. Their model, termed "BigRNA," could identify and predict pathogenic non-coding DNA variants across a broad spectrum of mechanistic cases. Notably, BigRNA was able to accurately predict the effects of steric-blocking oligonucleotides (SBOs), nucleic acids capable of modulating gene expression. Their results suggest that BigRNA and similar foundational models might allow for personalized RNA therapeutics in the future.
Study: An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. Image Credit: Joyisjoyful / Shutterstock
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
RNA modeling and the advantages of deep learning
Research aimed at designing machine learning (ML) algorithms capable of predicting RNA outcomes from DNA sequences is longstanding and plentiful yet has to be proven successful. Recent advancements in deep learning (DL) have allowed for significant strides in current research in the field of RNA predictions and present computational approaches that would have been impossible just a decade ago.
Unfortunately, most current research in the field focuses on predictions of absolute or overall RNA expression, with research into the regulatory mechanisms underpinning RNA expression lacking. Since regulatory interventions such as splicing and polyadenylation are far more critical for clinical therapeutic interventions, studies into the specific transcriptional perturbations governing RNA expressions are crucial for any future medical application in the field.
Advancements in next-generation sequencing techniques have made RNA sequencing (RNA-seq) data widely available. The large amount of RNA-seq data in circulation provides the ideal resource for high-resolution analyses of RNA expression while also allowing for training deep learning models capable of identifying and predicting complex transcriptional regulatory events from a wide variety of distinct DNA genotypes. Hybrid datasets, including the Genotype-Tissue Expression (GTEx) project, are especially useful since they combine both high-resolution RNA-seq and Whole Genome Sequencing (WGS), allowing for a direct DNA-to-RNA comparison.
About the study
The present study used extensive WGS and RNA-seq data to design and train a deep learning model named "BigRNA" aimed at predicting RNA expression and the mechanistic interactions that result in observed RNA expression levels. Researchers began by compiling GTEx consortium data comprising both WGS and RNA-seq information from 70 individuals with diverse hereditary. The sequence data was aligned and passed through a 128bp-window pipeline since the transformer-based model architecture was optimized for 128bp (base pair) reads.
"Each RNA-seq sample was processed into two data tracks: coverage and junction, where the junction track contains a subset of read counts at splice junctions."
The 128bp-window pipeline was then further optimized, taking into account the two RNA data tracks – coverage data was processed using 128bp-window average-pooling, while junction data was processed using 128bp-window sum-pooling. RNA data was aligned to the corresponding genomic data, with particular attention paid to individual-specific insertions and deletion.
BigRNA was then trained on the 70 DNA-RNA pairs separately, allowing for independent learning from each of the sampled individuals after accounting for phenotypic differences arising from haplotypes. Researchers added individual-agnostic per-tissue outputs to BigRNA's training regiments, encouraging the model to begin predicting the genotype resulting in the observed RNA-seq data.
Following model training, BigRNA was fine-tuned on RNA-binding protein (RBP) and microRNA datasets obtained from enhanced crosslinking and immunoprecipitation (eCLIP) assays and the Encyclopedia of DNA Elements (ENCODE) database. For model performance testing, protein-coding genes completely separate from those used for training were selected. In order to validate BigRNA's performance and accuracy, the difference between the model predictions and previous experimental results was computed for each tissue. Differential gene expression prediction performance was verified using pairwise comparisons between predictions and observations and calculated using the log2 fold-change metric (correlation coefficient between predicted and target coverage data per gene for all genes).
Study findings
BigRNA was able to predict both tissue-specific RNA expression and potential protein and microRNA binding sites with high accuracy. Notably, for unknown genes not included in the training or validation datasets, correlation coefficients (r) of ~0.70 were obtained (range 0.47 – 0.77). Accuracy was even more remarkable when focused on RNA expression in the brain, at around 74%. BigRNA notably outperformed the current gold standard in RNA prediction models, 'DeepRiPe,' for all 142 datasets tested. When focused on microRNA predictions, BigRNA showed an accuracy of 84%. This is promising, given the drug discovery applications of microRNAs.
"A key challenge in human genetics is to predict the impact of sequence variants that may be found within the human population. Many deep learning models that do well on unseen genes using certain metrics, such as AlphaFold, struggle to predict variant effects. While some accurate methods exist for predicting the pathogenic impact of rare missense variants, non-coding variants, such as those located within the 3' and 5' untranslated regions (UTRs) of genes, remain difficult to interpret."
BigRNA alleviates these concerns – when tested using a sample dataset from ClinVar (a dataset of non-transmissible genetic diseases), BigRNA was able to predict disease outcomes from input RNA-seq data with an area under ROC curve (AUC) score of 0.95. The average false positive rate (FPR) of the model was consistently <0.5%, suggesting that BigRNA and other foundational models might aid clinicians in diagnosing hereditary and genetic diseases in the future.
Most conventional models cannot identify splicing pathogenetic variants, and the few that can still fail to distinguish between benign mutations and pathogenetic variants. BigRNA was evaluated on its ability to predict and flag the splicing impacts of exon skipping using data from a massively parallel splicing assay (MaPSy). The model showed impressive performance, with AUC scores of 0.89. To evaluate the impacts of splicing on intronic variants and BigRNA's performance therein, data from the ABCA4 gene was used. Once again, BigRNA was found to accurately identify and flag the splicing event, with an AUC of 0.9.
"The ability of BigRNA to understand regulatory mechanisms affecting splicing and gene expression may allow it to design therapeutic interventions that rescue pathogenic variant effects."
Conclusions
In the present preprint, researchers developed a novel deep machine-learning model called BigRNA to identify and predict RNA-seq defects from genomic DNA datasets. Their results suggest that BigRNA presents the best and most accurate model to date in identifying RNA-seq aberrations, including splicing, from DNA datasets. BigRNA was additionally shown to be capable of predicting tissue-specific gene expressions and identifying the underlying mechanism resulting in differential expression levels across genotypes.
As a machine learning algorithm, BigRNA's accuracy has the potential to improve even further with additional WGS and RNA-seq data. Foundational models, including BigRNA, might pave the way for personalized RNA therapeutics in the future.
"Our results show that different drug discovery tasks can be assisted by deep learning. We believe that BigRNA and deep learning systems like it have the potential to transform the field of RNA therapeutics."
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.