Improved diagnostics are important in containing the ongoing coronavirus disease 2019 (COVID-19) pandemic as well as understanding the biology of the causative pathogen, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus.
An interesting new study by researchers at the University of Liverpool, University of Bristol and Public Health England in the UK describes a novel bioinformatics tool used to detect subgenomic messenger ribonucleic acid (mRNA) sequences from the virus. This may help develop new diagnostic tools and model the transmission of the virus.
Many studies have reported the results of viral sequencing, and published databases are available for reference. These can be further analyzed using bioinformatics to explore viral biology.
The researchers released their findings on the bioRxiv* preprint server.
Leader-TRS sequences in viral RNA
The viral RNA genome comprises two-thirds at the 5’ end, and the remaining one-third at the 3’ end. The former encodes several viral structural, non-structural and accessory proteins, many of which are key to the synthesis of viral RNA.
The latter is expressed as a set of subgenomic mRNA sequences, nested one within the other. They have the same 5’ and 3’ ends as the coronavirus genome, and have a leader sequence. The sgmRNA at the 3’ end encodes the nucleoprotein and is more abundant than that found just after the sgmRNA that encodes the open reading frame (ORF) 1ab. Its frequency is also higher than that of the genome itself.
The synthesis of the sgmRNAs is via a transcriptional regulatory sequence (TRS) found immediately adjacent to the 5’ leader sequence. Such TRSs are found along the genome, proximal to the start codons of the ORFs, containing a short core ACGAAC motif with flanking sequences.
Most researchers consider that sgmRNA is synthesized by discontinuous transcription, leaving a footprint in the form of leader/sgmRNA complexes. These can be sequenced to assess sgmRNA abundance.
For instance, in a nasopharyngeal swab or other infected clinical samples, the detection of such complexes could allow conclusions about active viral RNA synthesis.
The current study is based on the use of the LeTRS bioinformatics tool, developed to identify the unique leader-TRS gene junction sites for each of the viral sgmRNAs, from available datasets of viral sequences in infected cell cultures.
These were used as surrogate measures for sgmRNA abundance. After the tool’s validation, this was tested on infected cells from a cell culture carried out for this study and from nasopharyngeal samples obtained from infected humans and non-human primate (NHP) models.
What are the results?
The results showed that in all three cases, the sgmRNA encoding the nucleoprotein was the most abundant, while ORF7b and ORF10 were the least. The leader-TRS junction sites with the highest read counts mirrored the reference positions, except for ORF7b obtained by the Nanopore sequencing tool.
Potential novel sgmRNAs
The researchers also observed several leader-TRS complexes at low abundance, indicating that they represent novel sgmRNAs of lower abundance or already known sgmRNAs with different leader-TRS junctions.
Another possibility is that they come from a shift in viral transcription or are simply accidental results of the various sequencing processes used. Novel leader-TRS junctions have been found with other coronaviruses.
To validate the presence of a novel sgmRNA, proteomics should correspond with the sgmRNA to ensure that the sequence reflects a real ORF. For instance, earlier researchers have sometimes identified supposed novel RNA, representing the 5’ region of ORF1ab. However, these are likely to be replication templates containing defective RNA.
These authors have also suggested that promiscuous transcription of leader-TRS junctions occur in late infection. This might explain the lack of correspondence between the cells from late nasopharyngeal samples in the NHP models and in humans, and from published data, compared to cells obtained at earlier stages of infection.
Other advantages of LeTRS
The researchers used amplicon sequencing to correctly identify genuine leader-TRS junctions in the sequencing reads. However, different sequencing methods performed differently in this respect. The presence of poly-adenine (polyA) sequences and leader-TRS junctions are useful clues to the presence of full-length sgmRNA in the test samples.
The study also shows that LeTRS can screen out false positives since it had no positive reads when tested against sequencing data from healthy control cells.
The advantages of LeTRS include the lower runtime on the central processing unit of the computer system, with a greater wealth of information, compared to the tools available until now. It is therefore ideal for high-throughput analysis of large amounts of sequencing data from different sources.
Applications of LeTRS
The presence of sgmRNAs in clinical material indicates infected cells are present and thus provides signs of active viral RNA synthesis ongoing at sampling time. However, the current study shows that while leader-TRS junctions are identified in these samples, their ratio differs from that in infected cells..
The earlier assumption may be modified thus:
If the abundance of leader-TRS gene junctions follows an expected pattern of the nucleoprotein gene leader-TRS gene junction being the most abundant followed by a general gradient in sequence data from nasopharyngeal samples, then this may be indicative of an active infection.”
The use of NHP models allows the SARS-CoV-2 infection to be traced from a known exposure time. The use of LeTRS on RNA sequenced from two NHP models showed the phasic synthesis of sgmRNA, with the abundance dropping significantly after day 8 or 9.
This may indicate a synchronous infection of respiratory epithelial cells at the beginning, leading to cell death. The new virions propagate to other epithelial cells. This leads to exponential increases in infection with asynchronous waves.
The drop in sgmRNA coincides with the rise in neutralizing antibody titers, as humoral immunity kicks in. This pattern follows that observed in the serologic follow-up of COVID-19 patients.
What are the implications?
The findings of this study may help identify new targets for nucleic acid-based diagnostic tests, which often target the ORF1ab, the nucleoprotein and the spike genes. If all three are equivalent, the detection of nucleoprotein at higher abundance than the spike gene could indicate the presence of infected cells.
The lower cycle threshold (Ct) values obtained when testing with the current gold standard, reverse transcriptase-polymerase chain reaction (RT PCR), may indicate not just higher viral loads but a higher number of infected cells. To distinguish these possibilities, the relative ratios of sgmRNAs should be identified.
Finally, the phasic nature of transcription and the abundance of the N leader-TRS junction in many human samples warns against viral transmission models based solely or to a major extent on viral Ct values, which will vary enormously in different phases.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.