Researchers updated a previous version of an automated tool to include severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome analysis. Genome analysis using the freely available software could help track the evolution of the virus and promptly identify variants that increase viral transmission or virulence.
The SARS-CoV-2 pathogen has been mutating since it was first discovered in late 2019. Some of these mutations have increased virus fitness, which may affect COVID-19 disease outcomes, its transmissibility, and, subsequently, could impact the efficacy of current vaccines. Thus, thorough and continual sequencing of as many genomes as possible across the world will be crucial in keeping on top of the pandemic.
There are some national initiatives for active SARS-CoV-2 genome surveillance, such as COVID-19 Genomics UK Consortium in the United Kingdom and the Indian SARS-CoV-2 Genomics Consortium, which are tasked with identifying new variants. Any increase in cases because of a new variant will require immediate action to contain the spread. This will also require automated methods to analyze and identify new strains.
In a paper published in the bioRxiv* preprint server, researchers report the second generation of a computation tool, Infectious Pathogen Detector (IPD), to determine the abundance and mutation of SARS-CoV-2, which has an expanded variant database and revised clade assessment.
Finding frequency of mutations
The authors analyzed 200,865 SARS-CoV-2 genome sequences from 155 countries, which had 2.58 million mutations as of 28 December 2020 compared to the reference Wuhan strain. They found about 39% synonymous mutations, mutations that are usually minor and do not change the amino acids. About 51% were non-synonymous mutations, which are mutations that change the amino acids. About 9% of the mutations were in the intergenic region with the coding 5’ and 3’ UTRs. Among the non-synonymous mutations, about half were missense mutations, or mutations in a single nucleotide.
The researchers noted 13 hotspot residues that occurred in more than 40,000 samples. The most frequent synonymous mutation occurred 186,189 times in the NSP3 gene followed by a mutation in the RNA-dependent RNA polymerase gene 185,945 times. The non-synonymous mutations D614G and A222V occur 176,436 and 47,971 times, respectively, in the spike protein S gene. The next frequent mutation is a 2-amino acid change R/G203K/R. A220V mutation in the N gene occurs 48,426 times, the third most frequent mutation.
The D614G mutation causes higher viral loads in the respiratory tract but does not alter disease severity. The team did not find a significant frequency of the other spike protein mutations N439K, S477Y, E484K, and N501Y. The 13 most frequent mutations comprise five synonymous mutations that likely affect mRNA splicing or selection on codon usage bias, stability and folding translation or co-translational protein folding.
Upon further analyzing the data, the team found that the S, N, M, ORF7a, and ORF10 genes, about 21% of the genome, account for 54.36% of all the SARS-CoV-2 nonsynonymous mutations. The S and M genes have the smallest proportion of total variable bases in the virus genome, suggesting a strong positive selection of nonsynonymous mutations in these genes.
Among the other new variants of the SARS-CoV-2 virus, the B1.1.7 mutant from the United Kingdom had 32 mutations, the B.1.351 mutation from South Africa had 25 mutations, and the Brazilian P.1 variant had 25 mutations.
Tool for genome surveillance
Upon comparing the variants predominant in the three new strains, along with those from India, the authors found four common hotspot mutations that included D514G. N501Y was the base mutation in all the three variants, with the South African and Brazilian strains showing additional E484K mutation in the spike protein.
Neither of these two mutations were seen in the Indian samples, and only two out of 3,361 Indian samples showed the S477N mutation. It is unknown if the absence of these mutations, which have increased binding affinity to the human angiotensin-converting enzyme 2 (ACE2) receptor, could account for the lower transmission in India compared to the UK, Brazil, and South Africa.
Clade analysis revealed 20E, 20B, and 20A to be the most dominant. All the analysis resulting in variant and clade information was included in the database for the second generation of IPD. The team found that IPD 2.0 assigned clades with high accuracy when tested using simulated sequence dataset generated from the genomes of different clades.
The database with updated variants and clade assessment module enables quantification and phylogenetic assessment of the SARS-CoV-2 genome. The authors write, “This makes IPD 2.0 a pertinent tool for analysis of diverse SARS-CoV-2 sequence datasets and facilitate genomic surveillance to identify variants involved in breakthrough infections.”
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.