As the COVID-19 pandemic circulates the world, scientists are still trying to understand the complexities of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of COVID-19 disease. However, there is a long way to go to understand its genomic content. Now, a new study by researchers at the Massachusetts Institute of Technology and the Center for Computational Biology, Flatiron Institute and published online on the bioRxiv* preprint server describes how the use of comparative genomics helps to identify protein-coding and non-coding functional genes.
Novel Coronavirus SARS-CoV-2 This scanning electron microscope image shows SARS-CoV-2 (yellow)—also known as 2019-nCoV, the virus that causes COVID-19—isolated from a patient in the U.S., emerging from the surface of cells (pink) cultured in the lab. Image captured and colorized at NIAID's Rocky Mountain Laboratories (RML) in Hamilton, Montana. Credit: NIAID
Reading the Viral Genome
Over two-thirds of the genome of the SARS-CoV-2 virus comprises a large open-reading frame called ORF1ab with some sequences that are conserved among coronaviruses. This segment is translated to a large protein precursor that is then split into several non-structural proteins (nsp) nsp1-nsp10 and nsp12-nsp16.
This segment contains a frameshift for translation, the failure of which, in ORF1a, causes the termination of the translation four codons later. This is then translated into a different protein that is cleaved into nsp1-nsp11. ORF1 encodes several mature proteins, including RNA-dependent RNA polymerase (Pol), a helicase (Hel), and proteins required for transcription, cleavage, viral assembly. It prevents the host cell response as well as immunosuppression.
Viral RNA is translated in the human host cell using human translation machinery, which transcribes the first ORF. But to get at the genes in the remaining one-third of the genome, the process is more complicated. The virus first generates an RNA-dependent positive-to-negative subgenomic transcript from the 3’ end to a transcription-regulatory sequence or TRS, and then from the 5’ end. This is followed by RNA-dependent negative-to-positive transcription as a second step.
Genomic Annotation – What is Known
To understand how an organism functions, it is important to annotate the genome correctly for protein-coding segments. This will help predict how variants affect the phenotype by showing how they change the amino acid sequence first of all.
This last third of the genome contains genes for the spike protein, envelope protein, and membrane proteins, on ORF2, ORF4, and ORF5, respectively. These drive viral assembly. The nucleocapsid protein then packages the viral RNA.
The rest of the ORFs are unknown, and their annotation is chiefly on the basis of gene homology and algorithms, leading to considerable disagreement as to which gene encodes functional proteins. Experimental techniques to clearly identify which genomic locations transcribe specific genes, and the protein products associated with them, are desperately needed to understand the virus better.
Over 1800 mutations and gene variants have been identified in the current pandemic, but it is not clear which of them are functional.
How the Study Was Done
The current study aims to address these three challenges using comparative genomics to conduct a systematic analysis. This will help identify those of the still unknown ORSs which encode functional proteins and find those genetic variants with functional and therapeutic importance.
The study included 44 complete genomes from closely-related coronaviruses, which were then aligned on a genome-wide basis to include all the known genes and putative ORFs. This helped the researchers to classify the 1,800 unknown single nucleotide variants (SNVs) into those which are probably benign vs. those that will be harmful to conserved gene functions.
The researchers found that ORFs 3a, 6, 7a, 7b, and 8 are conserved functional regions that code for protein. ORF 10 is non-coding for protein but probably subserve important functions nevertheless. ORF 14 is probably non-coding for functional proteins.
Functional and Medical Significance of the Findings
One important finding was that many of the variants in the spike protein gene that have come into recent existence, as the virus spread more widely, disrupt perfectly conserved amino acids. Several of these variants have been identified as possibly promoting increased transmission or increased viral load. The researchers hypothesize that this could be how the virus adapted to the human host.
The identification of a region in the nucleocapsid protein, with 20 amino acids, that displays many variants for conserved amino acids throughout the sarbecovirus clade. These variants could help understand how the virus has adapted to the human host.
The study exposed some limitations of current experimental approaches, which may capture only the currently existing transcripts but not the time-related pattern of changes in the genome due to exposure to a variety of hosts in the past. These techniques, though used here to classify SNVs only, should be useful for other types of variants as well, to clarify the genotype-phenotype linkages.
Finally, the researchers call for further work to identify the functions of still-unnamed genes and the effects of different variants. They say this might “lead to the identification of weaknesses of the virus.” They conclude: “These comparative genomics annotations provide a general resource for prioritizing functional variants and strains, for vaccine development and specialization, and for untangling the molecular biology of SARS-CoV-2.”
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.