In a recent study published in Scientific Reports, researchers developed a novel algorithm to analyze large genomic datasets of ribonucleic acid (RNA) viruses, applying it to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and severe acute respiratory syndrome coronavirus (SARS-CoV).
As a list of codon frequencies was the major input to this algorithm, the number of genomic sequences did not substantially affect its performance. Having more sequences with greater variability to analyze helped the algorithm limit the loci pairs for analysis as possible start and end points for conserved regions.
Consequently, this algorithm ran slightly faster when analyzing data from a larger dataset of genomic sequences. Most importantly, it maximized the signal-to-noise ratio during the analysis.
Moreover, it was scale-agnostic in determining regions of nucleic acid conservation like its predecessors. It increased its ability to find previously unidentified features of interest, which might lack analogs in host organisms and therefore had reduced risk of toxicity as drug targets.
About the study
In the present study, researchers first analyzed 5,121,523 SARS-CoV-2 genomes retrieved from the Global Initiative on Sharing All Influenza Data (GISAID) database. Then, they analyzed 119 SARS-CoV genomic sequences, which helped them validate their findings in a related virus.
The team first analyzed the main open reading frames (ORFs); next, the sequences encoding the individual protein products of the 1a/1ab ORFs. They also examined 1a/1ab protein product sequences separately because the algorithm marked the first 4100 nucleotides of the large ORFs within non-structural proteins 1\2\3 regions as significantly conserved.
SARS-CoV and SARS-CoV-2 have low overall inter-sequence variability. So, the researchers made three key improvements to their investigation protocol. First, they applied a weighting to each gene loci, with weights proportional to data on nucleotide conservation provided by a gene loci beyond that needed for amino acid conservation.
They noted that data from emerging pathogenic microorganisms are highly skewed. High nucleotide conservation at most loci makes the few loci where a mutation has occurred into outliers, disproportionately affecting any analysis. Thus, secondly, they moved from a parametric test to a more appropriate non-parametric equivalent, i.e., ranked data.
Third, they adopted a hypothesis-testing framework to deal with a gene containing more than one conserved region. They compared a null hypothesis that a random mutation gives rise to the most conserved region to another hypothesis stating that the most constrained region is markedly more conserved than the background.
When they did not find the most conserved region in a sequence, researchers re-run the analysis after removing the next most conserved region as it likely was causing interference. They marked regions found to be significant after a re-analysis because the false positive rate for such genomic regions might be slightly higher.
Finally, the researchers benchmarked the weighting and ranking data processes using housekeeping genes from Escherichia coli.
Results
Upon analyzing the SARS-CoV-2 nsp16 region, the authors found a set of conserved stem-loops, and RNAalifold analysis uncovered it coincided with the region associated with RNAs packaging into virus-like particles.
They generated folds of a series of conserved regions with slightly different lengths from the identified conserved region. One stem-loop in the middle of the larger conserved region remained consistently predicted. So, they postulated that this stem-loop (or one of the two adjacent stem-loops) is a candidate for the RNA packaging signal in this region.
The predicted fold of the 19,920–20,031 conserved region in the SARS-CoV nsp15 sequence also had a three-stem-loop structure.
In SARS-CoV-2, the algorithm also identified smaller regions forming the 5′ regions of “body” sequences or 3′ regions of “leader” sequences in subgenomic RNAs (sgRNAs). These regions worked as the transcription-regulatory sequences-body (TRS-B), leading to RdRp pausing and switching to the 5′ TRS-leader during negative-strand synthesis. The conserved region identified within SARS-CoV-2 membrane (M) nucleotides 27,159–27,191 have been identified as an ORF in ribosomal profiling experiments by Finkel et al.
Note that the study protocol highlighted the presence of primer sites only within regions shorter than 250 nucleotides identified as conserved. So, a conserved region overlapping a primer binding site should be viewed as a possible contribution to observed conservation, not as a sole explanation.
Conclusion
To summarize, the researchers presented a method for in silico analysis of biological genomes, in this case, large genome datasets of two RNA viruses, SARS-CoV-2 and SARS-CoV.
The methodology did not elucidate the molecular explanation for observed conservation but highlighted genomic regions that require further investigation.
Nonetheless, it could serve as a broad guide toward possible ways to proceed, understanding the role of heavily conserved genomic regions in an organism’s life cycle, and ultimately finding drugs to disrupt these roles.
For newly emerging, fatal viral pathogens like SARS-CoV-2, obtaining information on contiguous regions of relatively conserved nucleic acid is key to developing new treatments.