As the genomic sequencing of SARS-CoV-2 assumes greater importance in epidemiological surveillance and evaluation of possible escape mutations, a new study published in the preprint server bioRxiv* in December 2020 describes much-needed guidelines for reliable viral sequencing, validated in a large group of clinical samples.
Image Credit: https://www.biorxiv.org/content/10.1101/2020.12.01.405738v1.full.pdf
The use of data on the genotypes of a circulating pathogen is called genomic epidemiology and is critical for monitoring public health and contact tracing. One recent instance is the detection of a rise in the prevalence of the D614G mutant of the SARS-CoV-2, while another is the identification of transmission of the virus among minks and humans, bidirectionally.
The importance of the genotype is its potential to cause alterations in the severity of the disease, to confer antibody resistance, or to enable immune evasion, and this is why it needs to be monitored. During the current COVID-19 pandemic, thousands of viral genomes have been sequenced so far.
However, this is associated with uneven read quality varying with the protocol, and the practices in use, the quality of the sample, the method of genotyping, and of processing the data. This is true even when it comes to complete genomes assembled after strict selection criteria are met, from public databases.
Need for Guidelines
The issue here relates to the introduction of errors in the sequenced genomes, which in turn biases the conclusions that are reached. The most common methods used for genotyping are amplicon-based, being both simple and inexpensive. Nonetheless, they have their flaws, arising from errors at each of the stages involved, which may introduce inaccuracies into the genotypes.
This underlines the need for broad guidelines that account for the performance standards of the various genotyping technologies, thus recognizing the method-dependent sensitivity, specificity, and limit of detection. This will improve the accuracy of the result and allow comparisons to be made with confidence across genotypes from multiple laboratories.
The current study, published in the preprint server bioRxiv in December 2020, sets out to lay down such guidelines in the area of amplicon-based SARS-CoV-2 genotyping. The researchers used synthetic reference genomes of this virus to understand how the viral load, the depth of sequencing, and the uniformity of coverage affect the results. They then used their experimental workflow on over 200 clinical samples in 6 independent laboratories to validate it.
Benchmarking the Process
The investigators used 343 partially overlapping amplicons split into two pools covering ~99.9% of the SARS-CoV-2. After the genomic RNA is converted into complementary DNA, it is amplified by polymerase chain reaction (PCR) technology using two paired primer sets, the pools are combined and again amplified using sequencing adapters.
They found efficient amplification of the product in direct relation to the viral load. The total fraction of the effective reads correlated with the number of viral copies. In samples with 20 or fewer genome copies per reaction (gcpr), not even half of the reads mapped to the reference synthetic genome. Also, there were false positives and false negatives because of sequencing artifacts, causing inaccurate identification of variants and assignment of genotypes.
Secondly, they established that genotyping accuracy depends on the breadth of coverage, that is, the percentage of the genome covered, and on the sequencing depth or number of reads per position, and that both are proportional to the number of viral genome copies. When the number of gcpr fell below 50, for instance, the genome was only partly covered.
Using defined samples with less than 11% artifacts, they found that with 200K or more mapped 150 bp reads, they could achieve 98% genome coverage, the average depth being 683x, and with 93% uniformity of coverage. Beyond this, there was no improvement in any of these parameters. Moreover, these guidelines ensure the accurate identification of single nucleotide variants (SNVs) that do not significantly affect amplification.
They also found the sequencing depth required for an adequate number of mapped reads to be reciprocally related to the number of viral genome copies. Below 1,000 gcpr, the mean uniformity of coverage was only ~51%, and the coverage breadth only about 68%, but with high variability. This is largely due to the difficulties of genotyping with little genomic material.
Conversely, reliable identification of SNVs is possible in samples with 1,000 or more viral copies gcpr. With such samples, they achieved a stable proportion of effective reads at about 740K per million reads. At or above this viral load, the minimum recommended sequencing depth is 270K to achieve 200K mapped reads.
They also tested to define the lowest allele fraction which can be called a variant and not an error at the minimum viral load recommended, that is, 1,000 gcpr. They found that in samples with 10% or higher variant frequency, at a median sequencing depth of 200K, the variant allele fraction (VAF) of false-positive variants was lower in 95% of cases, compared to the VAF of true positives.
The accuracy is higher as the expected VAF increases, but beyond this depth, the increase is modest, while, with VAF 0.05 or lower, the detection of variants is insensitive and unreliable. Again, while using 100 gcpr sharply reduced the reliability of variant calling, at 10K gcpr there was no significant rise in accuracy.
Validation in Clinical Settings
In the majority of clinical samples, the exact gcpr is unknown, and therefore the researchers validated their protocol across 6 laboratories, each of which used their discretion as to the details of the collection, extraction of RNA, PCR, and replicate sample choice. Here, the RT-qPCR cycle threshold (Ct) value was used as a proxy for the gcpr, as is common in such settings. The Ct was between 12 and 38, and as expected, showed a high correlation with the gcpr.
They found that 96% of the samples with a Ct below 29 had a mean coverage breadth of 99.6%, while more than 80% had 750k mapped reads or more per million. With Ct less than 26, this increased to 70-90% of mapped reads, and a minimum sequencing depth of 280K reads was found to ensure adequate coverage and sequencing depth. They also recommended a higher sequencing depth for Ct between 26 and 30 to reach the cut-off of 200K mapped reads while covering 95% of the genome. Most of the variants were reliably detected at this level.
Over 98% of the clinical samples had the same four alternative alleles, with expected clonal variant frequencies. Some variants clustered within samples from a specific source, and the pattern differs depending on the date of collection before or after July. In the former situation, almost all the samples are from clade 20A, with the D614G mutation, the same that spread all over Europe, or to its daughter clades 20B and 20C.
After this date, there is a greater variation in genotype, with almost 82% belonging to subclades 20A.EU1 and 17 20A.EU2. Both of these were first seen in Europe in the first part of summer but then seeded in many regions of Europe, especially as travel increased across borders with the coming of summer vacation.
The researcher's comment, “The above results support the validity of clinical sample genotypes determined using our recommendations.”
We demonstrate that SARS-CoV-2 variants with frequencies of 10% or higher can be reproducibly detected with sufficient input material and sequencing depth. Our study provides general recommendations for reliable determination of viral genome sequences using amplicon-based methods for SARS-CoV-2 genotyping.”
This in turn will allow tracing of samples by place of origin and time of collection, by reliable variant identification.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
- Kubik, S. et al. (2020). Guidelines for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples. bioRxiv preprint. doi: https://doi.org/10.1101/2020.12.01.405738. https://www.biorxiv.org/content/10.1101/2020.12.01.405738v1