In a recent study posted to the medRxiv* preprint server, researchers developed a computational pipeline for the early identification of emerging severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VOI) by analyzing SARS-CoV-2 genome data and allocating risk scores on the basis of functional and epidemiological parameters.
The continual emergence of SARS-CoV-2 variants with enhanced immune-evasiveness, transmissibility, and replication warrants the need to monitor the genomic evolution of the virus. Early detection of SARS-CoV-2 VOIs could enable the prioritization of variants for experimental evaluation, risk assessment, and public health optimization against SARS-CoV-2.
About the study
In the present study, researchers developed a computational heuristic framework to rapidly detect novel emerging SARS-CoV-2 VOIs and prioritize them for wet-lab experiments.
Genomic data for each variant mutation were obtained from the global initiative on sharing all influenza data (GISAID), GenBank, and BV-BRC (bacterial and viral bioinformatics resource center) databases. The sequences were processed to identify high-priority VOIs for wet-lab experimentation. Variant prioritization was based on their epidemiological dynamics and their functional characteristics estimated based on the sequence prevalence scores, functional impact scores, and composite scores.
The framework ranked variant constellations (or covariates) for determining the mutational combinations to be evaluated, and the Omicron variant was detected for validating the computational approach. Genomes were aligned pairwise with the reference (Wuhan-Hu-1 strain) genome, and variant constellations were extracted mainly for SARS-CoV-2 S. Variants were categorized into geographic and temporal groups, and variant constellation counts and total isolate counts by date and region were used for computing spatiotemporal epidemiological dynamics viz. the monthly variants’ growth rates and prevalence rates.
Sequence prevalence scores were calculated from GISAID data of November 2021 (Omicron dominance period) for three most recent months for identifying epidemiologic parameters for scoring heuristics component of the pipeline to detect SARS-CoV-2 lineages that may raise concerns. Each country and month combination with >5% sequence prevalence or more than five-fold increase in growth rate from the previous month was assigned score 1. The scores were summed to obtain the final sequence prevalence score for all countries/month combinations.
Functional impact scores (FIS) were derived based on positional overlapping of SARS-CoV-2 S regions and by summing up the sequence features of concern (SFoC) scores. SFoC scores were calculated based on variant impact on replication, immune evasion, or binding to angiotensin-converting enzyme 2 (ACE2) receptors or monoclonal antibodies and variant neutralization by vaccination or previous infection. Composite scores (CS) were calculated by summing up the sequence prevalence scores (SPS) and functional impact scores. Emerging lineage scores were calculated from GISAIDA data between December 2021 and January 2022 by summing up scores of lineages with growth rates >15.
The team identified 75 regions on SARS-CoV-2 S RBD that significantly impacted the binding of ≤4 antibodies and 36 regions with a significant impact on the binding of vaccine or convalescent sera antibodies. Twelve sites with ≥1 mutations exceeding the threshold (>0.1) were identified as indicative of enhanced ACE2 affinity, of which site number 501 was a site of multiple conformational changes in SARS-CoV-2 S RBD binding interactions with ACE2.
Important sites of adaptive immune responses and SARS-CoV-2 tropism were N-terminal domain (NTD) sites 14 to 20, 140 to 158, 245 to 264, site 614 of SARS-CoV-2 S, and sites 671 to 692 of cleavage of furin protein. Epidemiological data for Omicron showed low SPS but considerably high FIS and resultant high CS values. CS could also quantify slight differences in covariates of a single clade. BA.1 was the predominant Omicron lineage in December 2021 and showed the highest emerging lineage score.
By January 2022, Omicron lineages such as BA.1, BA.1.1, and BA.2 evolved with multiple covariates. BA.2 variant constellation was identical to Omicron BA.1 with multiple unique mutational sites. Mutant BA.1 (with R346K mutation) exhibited higher functional impact scores than Omicron BA.1. Contrastingly, many covariates showed sequence prevalence scores as 0, indicative of no significant threat by their growth changes.
Before January 2022, the N440K, G446S, L24-, R346K, A701V, and L452R mutations appeared sporadically, and mutation dynamics plotting showed that G446S and R346K mutations were less prevalent, whereas L24- was concomitantly more prevalent. The finding indicated a fitness advantage for variants containing L24- and could aid in distinguishing between BA.2 and BA.1.
Overall, the study findings highlighted a novel computational spatiotemporal framework for early detection of SARS-CoV-2 variants based on their sequence prevalence, mutation prevalence, and mutational impacts on SARS-CoV-2 functions such as binding with ACE2 receptors. There were a few challenges in framework development, such as ambiguity fluctuations in sequence data during Delta and Omicron variant emergence, accurate data quantification for computation, and analyzing data that is enormous and continually increasing.
medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.