As the coronavirus disease 2019 (COVID-19) pandemic continues to spread all over the world, its causative pathogen, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is being monitored for the emergence of new variants and alterations in viral biology.
In the interests of cost-effectiveness and rapidity, amplicon-based sequencing of the viral genome has been used to provide a sensitive means of genomic surveillance. A new preprint, which was released on the bioRxiv* server, shows how this can be tweaked to make it a reliable source of genomic data by reducing the known potential for contamination.
Importance of genome sequencing
Genome sequencing was taken up early in the pandemic in order to rapidly identify the virus and develop precise diagnostic tests in many places. Since then, genomic surveillance has been essential to track viral diversity, adaptation and transmission, which in turn shapes policy on viral containment.
Clusters and superspreading events have been especially susceptible to identification by this method. However, this use has been overtaken by the application of genomic sequencing in monitoring new variants of concern (VOCs), and their spread, especially as they are shown to demonstrate changes in their biological behavior.
Multiplexed amplicon-based genome sequencing methods have shown unusual utility in the genomic surveillance of this virus, relative to unbiased, low-amplification RNA sequencing techniques. The fruits are evident in the form of the uploading of hundreds of thousands of complete viral genomes over this year and a half, most of them by this approach.
The most extensively used method for specific genome amplification is based on an open-access tiled primer set developed by the ARTIC network. The amplified genome is then sequenced on Illumina or nanopore platforms, leading to library construction.
The major problem with high amplification is the inherent risk of contamination. The polymerase chain reaction produces trillions of SARS-CoV-2 amplicons in about 35 cycles, comprising one reaction. Some of these may be aerosolized, as a matter of course, and contaminate other samples.
The effect of this on experiments where the sensitivity of viral detection is measured in tens of molecules is potentially devastating. And it is only exacerbated by the fact that almost a hundred batches of samples may be processed in parallel, increasing the risk of sample interchange or direct cross-contamination.
The fact that SARS-CoV-2 is already thought to have comparatively low levels of viral diversity, and rapid extensive dissemination, makes it difficult to distinguish the expected identical genetic pattern from that caused by contamination. The impact of this on phylogenetic studies is evident, as is the clinical importance of misidentifying a variant in a sick patient.
The risks are enormously multiplied by the fact that thousands of researchers and research facilities are engaged in genomic surveillance, in view of the policies laid down by governmental health officials. With the emergence of VOCs, this number will only increase, underlining the need for strict monitoring in order to ensure that the genomic data is valid.
Synthetic DNA spike-ins help reliable sample identification
The current study uses synthetic DNA spike-ins (SDSIs), as the basis of a new method of sample identification, that can be retrofitted to current sequencing techniques. Such spike-ins are commonly used in other RNA sequencing methods to detect contamination or sample swaps.
The researchers termed their approach SDSI+ARTIC. They report that its application to amplicon sequencing will allow increased reliability of the results while costing little additional time or costs, and can be used to investigate the epidemiology and clinical features of the virus.
Criteria for SDSIs
The SDSIs used here have a core sequence with unique identifiability, with constant priming regions on either side. With this design, it is possible to incorporate one more set of primers into a multiplex PCR assay and thus amplify both the SDSI and the primary target of the reaction.
The criteria essential for these additional primers include their compatibility with many different PCR reactions, their high specificity for amplifying SDSIs, and amplification of SDSI amplicons at rates that are similar to each other and to the primary target.
Again, the unique core SDSI sequences should be distinguishable, both from one another and from the genomic sequences most commonly expected in laboratories. With this approach, multiple amplified samples could be processed in parallel because their identifiable SDSIs associate with them in a dependable manner.
Thus, each sample has its own specific internal control to help detect sample swaps. The sample-SDSI association also sheds light on viral contamination.
The SDSIs used came from relatively rare and diverse Archaea to avoid false detection of variants based on homologous sequences. They found that 43 out of 48 sequences showed over 75% homology within the Archaea domain, and the rest to rare bacteria that are not commonly found within laboratories.
Thus, they could be used over a breadth of applications. The researchers also observed that each unique SDSI was probably not likely to be mistaken for the clinical content and were not homologous with SARS-CoV-2 or Homo sapiens.
The use of additional SDSIs did not have a deleterious effect on viral amplification, nor did the SDSI primers seem to be similar enough to misidentify the genomes. The final SDSI primer pairs had a 24 bp length and 46% GC content, whereas that of the SARS-CoV-2 genome is about 38%.
With the common size and priming region, and the similar GC content, the amplification rate was expected to be similar across the spectrum of SDSIs.
What were the results?
The study demonstrated the absence of nonspecific amplification, as expected, even with complementary DNA in a nasopharyngeal swab sample, lending support to the hypothesis that SDSIs are not homologous with this clinical genomic material. The amount of SDSI added to each PCR reaction is optimized to grant amplification and sequencing advantage to the SARS-CoV-2 amplicons themselves.
At 1μl of a 1fM SDSI, more than 96% of reads mapped to SARS-CoV-2 at cycle threshold values from 20 to 35, while still achieving the same genomic coverage. When they tested SDSIs on a set of 48 clinical samples of the virus, they found that 47 were found solely in the expected sample.
The sole exception, sample 47, showed 4% of reads to another neighboring SDSI, indicating potential contamination of the first by the second. However, it was also noted that there were two single nucleotide variations (SNVs) distinguishing it from sample 48, ruling out contamination as the sole source of the viral reads.
This case reveals the potential prevalence of undetected contamination and underscores the importance of a method for identifying it.”
Optimizing the protocol
The researchers also tested and selected modifications to the ARTIC PCR reaction to improve the recovery of complete genomes with the least number of reads. These included doubling primer concentrations for low-efficiency amplicons, and increasing the number of cycles to 40, which produced the highest amplification over a range of viral titers with the least potential for erroneous SNVs.
Reducing the overall library construction reagents helped prevent unnecessarily high amplification beyond the amount required for sequencing.
Comparing with the gold standard
We observed near perfect sequence concordance when comparing SDSI+ARTIC to unbiased sequencing, which has served as the gold standard for generating error-free viral genomes and for capturing divergent SARS-CoV-2 strains.”
When used to detect a suspected cluster of cases caused by nosocomial transmission in a real-life situation, the researchers were able to confidently identify an infection cluster within 52 hours, with 17/22 genomes being over 80% complete. The incomplete genomes were from samples with a cycle threshold below 30.
Of these, 11 samples were from the cluster, and ten showed near-identity, indicating nosocomial transmission. The remaining samples showed significant variation, indicating an independently acquired infection.
What are the implications?
Our in silico design generated robust synthetic targets while mitigating inter-spike-in sequence homology as well as homology with human, SARS-CoV-2, and common laboratory reagents. SDSIs can readily be adopted by laboratories and platforms of all sizes with only minor changes to existing methodologies, little additional cost per sample ($0.006 in our hands), and no interruption or addition of time to standard workflow methodologies.”
The use of SDSIs overcomes the key issue of contamination with amplicon-based sequencing, allowing its advantages to be exploited.
The SDSIs allow the detection of contamination within batches, as well as within a laboratory at large. More targets can be synthesized for larger formats, and for other tiled amplicon platforms. It is also versatile enough to be sued with newer amplicon sequencing technologies.
The researchers conclude:
SDSIs could serve as a broad tool for tracing potential contamination across a plethora of fields that employ amplicon based genomic sequencing, such as food safety, species identification and environmental sampling.”
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.