Even as the COVID-19 pandemic enters its ninth month, scientists continue to debate the origin of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) intensely. A new preprint* published in September 2020 by molecular biologists at the All India Institute of Medical Sciences, New Delhi, and the Indraprastha Institute of Information Technology Delhi discusses the current issues with the bat coronavirus (CoV) strain that is often considered to have very close homology with the above-mentioned virus, concluding that there are inadequate grounds to consider it to be the ancestral pool of SARS-CoV-2.
Many scientists mention the genome sequence of this bat CoVs, RaTG13, as being part of the ancestral descent of the current virus. A recent paper in the journal Nature also mentions its 96.2% homology with SARS-CoV-2, considering it to be a fossil record of a strain whose current existence is doubtful, but which may have been the original pool from which the current virus developed.
The scientists assembled the viral genome from scratch, performed a metagenomic analysis, and looked at data quality. They concluded that the RaTG13 genome had serious issues and all data related to it required a full review.
The researchers say, “This work is a call to action for the scientific community to better collate scientific evidence about the origins of SARS-CoV-2 so that future incidence of such pandemics may be effectively mitigated.”
COVID-19 is a complex disease, and so is its ancestry. However, several groups have discussed the similarities and dissimilarities between these two viruses. The focus of the current paper is on the accuracy of the data on these sequences.
The same group of researchers initially published both sequences, but there is a qualitative difference between the papers. Full experimental details backed up the published genomic sequence of SARS-CoV-2, but not so that of the RaTG13. This is documented by several papers that have shown up the holes in the dataset underlying the published genome of the bat virus.
The researchers comment, “Since this has been the single most important piece of evidence about the origins of the SARS-CoV-2, our work highlights the need to examine this data closely prior to basing further scientific studies on it.”
The dataset that has been published in support of the RaTG13 genome, almost 30 kb long, has been found inadequate to reproduce the sequence or the experimental observations based on this dataset. While the dataset is unique and contains much information beyond the fragmented coronavirus sequence, not much is known about how it was generated. This information is vital to the quest for the origin of the SARS-CoV-2.
De novo RaTG13 Assembly Not Possible
The researchers found that using the available data, they were unable to detect any contiguous sequences larger than 17 kb, using several different settings. Several matching sequences were found, but none over a fifth of the length of the reported sequence. A gap spanning 111 positions was found, and it is unclear on what basis this was filled in the published sequence.
The researchers also uncovered proof that DNA contamination is likely to have occurred. For instance, the largest contig contains genetic material with 98% similarity to the full-length mitochondrial sequence of the Chinese rufous horseshoe bat (Rhinolophus sinicus), an unlikely event since a complete assembly of such a sequence is typically interrupted by stop codons.
Secondly, non-adapter-related repetitive sequences were found in most reads, often at the same end of the read, comprising one G-quadruplex sequence and its reverse complement. This is unlikely to happen on the same end of an RNA sample since only one strand is dominant. The researchers say more information about how the experiments were carried out is crucial to rule out the possibility of gross RNA sample contamination by DNA.
Horseshoe Bat (Rhinolophus sp.). Image Credit: Hugh Lansdown / Shutterstock
Poor Data Quality
The researchers also calculated that the average coverage is 9.73, indicating a low value. This may be why only partial segments of the RaTG13 sequence are assembled. The coverage is only 2 or less for about 3,000 bases, which could markedly impair the accuracy. They draw attention to multiple ambiguous bases in the first end that could prevent de novo assembly, and to many unreliable second end reads as well.
Again, the researchers point out that sequence length distribution in the first end is quite different in one segment concerning the rest, lacking sequences of read length 151, 149, or between 18 to 39. While this might be attributed to post-generation processing or to sequence trimming, the unusual distribution is unlikely to be explicable this way. Ambiguous base calls are also found to be distributed in a non-random tile-wise manner.
Another example is the presence of a 150 bp 18S rRNA segment, which is present in almost 15,800 times in the sequence, of which ~4,300 are 151 bp long. In all of the latter, a base-calling error was found at position 151, indicating a non-random error. The same is reflected on end 2, with another read number.
Experimental Procedural Concerns
The significantly large differences in the bacterial content of the two referenced datasets are surprising, say the researchers, since both purport to be from similar sites, fecal and oral samples. One has only 0.65% bacteria, and ~68% Eukaryota, with the rest being unidentified. The other is ~91% bacteria and ~4% Eukaryota. This concern has been raised before.
Again, 0.1% of the first dataset is similar to plant genomes like rice and maize, which is unexpected from bat samples from creatures like the intermediate horseshoe bat Rhinolophus affinis. The researchers attribute this to contamination by possible index hopping because of evidence that the same platform has been used to sequence maize earlier. Multiplex sequencing of maize and the CoV genome of interest could lead to such contamination.
Again, the dataset also contains material identical to that of the Malayan pangolin Manis javanica, a totally different order. This again could be due to index hopping of some fragments for the same reason. This could have misdirected the discussion on the origin of the novel CoV, as some have reported that pangolin CoV genomic sequences also have close homology with that of the former.
Pangolin (Manis javanica). Image Credit: Artem Avetisyan / Shuttersock
Thus, the inference could also be that contamination accounts for the presence of various portions of the RaTG13 in the dataset, accounting for 0.0008% of the total.
The second run also has sequences resembling another virus accession number, apart from its own accession number. This dataset is supposed to have a separate lane, and index hopping may be supposed not to have occurred here, but cross-contamination still seems to have occurred. The researchers note that this “raises a distinct possibility that sample from previous runs might not have been guarded against either index hopping or cross-contamination.”
This could explain the discrepancies in the earlier dataset. Furthermore, some sequences seem to have been derived from retroviruses such as the greater horseshoe bat Rhinolophus ferrumequinum, but a whole virus could not be assembled.
While most work on the origins of SARS-CoV-2 has focused on the human CoV sequence, the current study shows that equal importance must be given to the other half of the equation, namely, RaTG13, in order to justify giving it a role in the narrative. Secondly, discussions may instead be withheld, while the precise details of the methods used to generate the RaTG13 are awaited. And thirdly, this genome should not be used in further studies until its scientific reliability is established in entirety, by independent researchers with access to the full dataset and methods used for its generation.
The researchers conclude: “In this paper, we report that the currently specified level of details are grossly insufficient to draw inferences about the origin of SARS-CoV-2. This work is a call to action for the scientific community to better collate scientific evidence about the origins of SARS-CoV-2.”
Preprints publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
- Singla, M.; Ahmad, S.; Gupta, C.; Sethi, T. De-novo Assembly of RaTG13 Genome Reveals Inconsistencies Further Obscuring SARS-CoV-2 Origins. Preprints 2020, 2020080595 (doi: 10.20944/preprints202008.0595.v1).https://www.preprints.org/manuscript/202008.0595/v1
- Zhou, P., Yang, X., Wang, X. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020). https://doi.org/10.1038/s41586-020-2012-7