The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that emerged in Wuhan, the capital city of Hubei province, China, has now spread rapidly to more than 187 countries and territories across the world, creating global panic. It has now affected over 3.65 million cases and caused over 256,000 deaths.
The virus genome has been a focus of intensive study ever since the outbreak began in order to develop diagnostic, therapeutic, and vaccine applications.
Now, a new study published on the preprint server bioRxiv reports on a large scale analysis of SARS-CoV-2 genomes and reveals a clonal geo-distribution and rich genetic variations.
Novel Coronavirus SARS-CoV-2 Colorized scanning electron micrograph of a VERO E6 cell (purple) exhibiting elongated cell projections and signs of apoptosis, after infection with SARS-COV-2 virus particles (pink), which were isolated from a patient sample. Image captured at the NIAID Integrated Research Facility (IRF) in Fort Detrick, Maryland. Credit: NIAID
The SARS-CoV-2 coronavirus is an enveloped positive-sense single-stranded RNA virus and member of a large family named coronavirus, which has been classified under three groups. Two of them are responsible for infections in mammals), such as Bat SARS-like coronavirus, Middle East respiratory syndrome coronavirus (MERS-CoV). Many recent studies have suggested that SARS-CoV-2 has diverged from Bat SARS-like coronavirus.
The size of the SARS-CoV2 genome is approximately 30 kb, and its genomic structure has followed the characteristics of known genes of coronavirus; the polyprotein ORF1ab also known as the polyprotein replicase covers more than 2 thirds of the total genome size, structural proteins, including spike protein, membrane protein, envelope protein, and nucleocapsid protein.
Characterizing viral mutations can help discover the mechanisms of disease, immune evasion, and antiviral resistance. This can also help trace the spread of the virus into different types.
Earlier, a study of 103 genomes showed the presence of two main types, type L and type S, with the latter being nearer to the original strain. Another study of 32 strains from China, Thailand, and the US revealed growing genomic diversity over time.
How was the study done?
The present study looked at over 3,000 strains of SARS-CoV-2 to track the accumulation of mutations over time. They also analyzed the data to look for selective pressure, both negative and positive, to find out which residues could be used to design treatment targets. The comparative genomic analysis of the SARS-CoV-2 was used to create a database for further research.
The genetic sequences were collected from the GISAID and NCBI data banks, using only complete genomes, from 59 countries. The most frequent source was America, followed by England, Iceland, and China. All the strains were from the first 3 months of the outbreak, the majority being from March.
The first task was to prepare a profile of non-synonymous mutations and to find their relative frequency in each population. The non-synonymous mutations were then analyzed separately.
What did the study find?
The researchers found that there were over 700 mutations, of which almost two-thirds resulted in a change in the amino acid sequence of the protein. The rest were in the intergenic regions. There were 39 non-synonymous mutations with prevalence more than 0.06%, or at least 20 of the analyzed genomes.
These mutations were found in 6 genes, namely, replicase polyprotein (ORF1ab), spike protein, membrane glycoprotein, nucleocapsid phosphoprotein, ORF3, and ORF8. The most significant number of non-synonymous mutations was in the ORF1ab gene, which encodes 16 non-structural proteins.
Among these, NSP3, NSP12, and NSP2 have a high number of mutations, numbering 117, 61, and 61, respectively. The gene itself displays over half of the frequent mutations, with 22 mutations in the RNA-dependent RNA polymerase, helicase, proteinase, endo-RNAase, exonuclease, and transmembrane domains. Replication errors must be corrected rapidly and accurately, and both NSP2 and NSP3 are required for this to happen.
There were ten hotspot mutations at hypervariable domains, found at a frequency of over 0.10. One especially frequent mutation was the D614G mutation within the gene encoding the spike protein in 44% of genomes. Another major hotspot mutation was the L84S at ORF8, in 32%. Four of them were in the ORF1ab gene represented in 11% to 17% of the genomes in each case.
Mapping the geolocations
Only about 100 of the large number of genomes analyzed were wildtype, mostly of Chinese origin. Still, the mutant virus genomes came from all over, being seen in almost 3,000 strains with varying genotypes.
The highest number of mutations was in the USA, with 316 mutations. This included US-specific singleton mutations (occurring only once in a population), seen in a quarter of all the mutations, while Chinese mutations accounted for half this number. Almost every American genome had one or more of seven mutations.
The singleton mutations result from the single strain that diverged from the original strain as a result of environmental, host, and serial passage factors, because of the inaccuracies introduced by the reverse transcriptase enzyme.
Among the 59 countries that contributed to mutant genomes, 26 had singleton mutations. Most of the genomes had multiple mutations.
Three of these mutations were found on every continent, namely the G251V (in ORF3a), L84S (in ORF8), and S5932F (in ORF1ab), except Africa and Australia. On the other hand, there were 3 others (F924F, L4715L (in orf1ab), and D614G (in spike) as well as an intergenic variant that was present in all except Asian strains.
Again, common mutations were observed in Algerian and European strains, as in European and Dutch genomes, which showed ten recurrent mutations. African and Australian genomes shared mutations at four positions, and two positions by Asian genomes.
The most significant variability was seen in Australia, New Zealand, and the US.
Tracking mutations over time
The researchers saw a constant rate of accumulation of mutations over time, but the strains collected last showed a small increase compared to the rest. On the other hand, more mutations appeared at the end of January and in early April. The mutations with the highest frequency were seen in late February for the first time.
When the mutations were used to align the viral strains phylogenetically, 3 clades were distinguished, with several closely related strains being found in different countries. This can be used to identify how and when the viral transfers occurred, as well as routes for spread. The phylogenetic tree also shows that the virus reached the US by multiple routes multiple times, with the first introduced genome being similar to the strain that caused the second wave of cases in China.
The researchers found that the ORF1ab gene was subject to selective pressure due to the high rate of mutations. The spike protein gene also showed the same phenomenon. In both cases, purifying selection was apparent, as indicated by the analysis.
There were 8 sites with negative selection pressure and 3 with positive selection pressure in the ORF1ab gene. With the spike gene, there were 7 and 1 sites under negative and positive selective pressure.
Modeling shows a single negatively selected site on the receptor-binding domain, indicating a lack of strong selective pressure on this part of the genome.
Analyzing genome variation within and between species
The researchers built a pan-genome from the almost 1,200 protein sets encoded in the publicly available 115 genomes on the NCBI website. Of these, 83 genomes belonged to the SARS-CoV-2.
There were 94 clusters of proteins, of which ten were shared between the SARS-CoV-2 and three other beta coronaviruses – the SARS-CoV and two bat CoV.
How are mutations important?
Mutations generate variation in the genome, allowing viruses to evade host defenses and antiviral drug targets. The SARS-CoV-2 is relatively slow to mutate, which may make it easier to develop effective vaccines.
Mutations in the endosome-associated-protein-like domain of the NSP2 protein may make the novel coronavirus more easily transmissible than earlier epidemic viruses from this virus.
The frequency of recurrent and non-synonymous mutation in the non-structural proteins NSP12 to NSP15 that are essential for the correction of virus replication errors may present difficulties in developing vaccines based on these genes that are potential targets.
In most situations, the genomic variation causes an increase in viral spread and ability to cause disease, due to the accumulation of mutations that increase the virulence of the virus. Spike mutations may present as changes in pathogenicity, with the V367F mutations, for instance, causing enhanced affinity of the protein with the ACE2 receptor.
Moreover, the study of the genomic variation among strains allows the occurrence of the mutation over time and place to be visualized. The current findings, for instance, show that the distribution of single nucleotide polymorphisms (SNPs) is not random, but dominates in those genes that are essential for the virus.
Co-occurring mutations are also common. The ‘founder mutation’ that arose in the US gave rise to multiple singleton mutations. On the other hand, many specific mutations are found in the strains circulating in Spain, Italy, and the US, accounting for the high rate of rapid spread and the severity of illness.
The negative selection site at the Mac1 domain on NSP3 is not essential for RNA replication but may be required for immune evasion. It could also be involved in viral replication in the presence of a host influence.
Negatively selected sites could be a drag on viral functioning, which indicates their usefulness in drug or vaccine design, since these are more likely to be conserved and hence persist unchanged.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.