A new study published on the preprint server bioRxiv* in October 2020 describes the gap frequencies and positions in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome and a set of alternative primers with a sequencing scheme to provide better genomic coverage and quality.
The viral RNA sequences can help decipher the transmission of the virus, its evolution in different regions, and to trace it back to the origin. Monitoring of diagnostic tests and the development of therapeutics, as well as vaccines, are also dependent on having complete and reliable sequences of the viral genome.
At present, there are over 130,000 almost complete or complete sequences of the viral genome in the GISAID database, mostly derived using targeted amplicon methods in next-generation sequencing. A glance over these genomes shows that there are many gaps across the genome, which obviously affects the quality and the interpretation of genomics studies. The current study sets forward an alternate primer set that may provide complete coverage of a high-quality genome by whole-genome sequencing.
Identifying the Issue
The researchers obtained genomes uploaded to the GISAID database in September 2020, nine months from the start of the pandemic. They looked for complete genomes and sorted the genomes using the metadata that included information on the platforms used for sequencing.
The researchers illustrated the positions of the gaps in the large genome, about 30 kb, in stretches of 200 NS each, which nearly corresponds to the size of an amplicon, in all of the first 48 genomes deposited in this month. They found that rather than being scattered at random across the genome, the gaps occur more often in certain positions.
The patterns of the gaps also showed some variation across the two platforms. This gave rise to the suspicion that the gaps left in the genome were due to unpredicted interactions between the primers, some regions of the sequence which had an unusual structure or composition, or if the primers were not correctly trimmed during the quality control process of the read data, singly or in combination.
The same patterns are observed in genomes uploaded each month, indicating the generalized nature of these gaps. However, when another platform was used, which employs an alternative set of primers, the N200 levels are much lower.
Gaps and sequencing errors continue to be reported, and towards the end of March 2020, the commonly used ARTIC primers were updated to attempt to resolve this problem. Nonetheless, well over a quarter of the genomes reported in GISAID as complete show one or more N200 gaps, as of September 2020.
The current study analyzed the gaps more closely, showing the gap pattern between nt 19,000 and 24,000 concerning the reference genome NC_045512. Every peak occurred between Forward from Amplicon X and Reverse from Amplicon X-1, but some gaps covered the whole amplified region between the primers.
This could be because the paired primers do not function properly at the amplification so that there is no amplicon for sequencing purposes. Alternatively, the original data contains amplicons sequences that are incorrectly trimmed out, perhaps, during the data analysis process, leaving a gap.
The Solution: Alternate Primers
The researchers used another set of primers to initiate amplification. This is called the Entebbe primers and adapted them with methods used earlier for the MERS-CoV, norovirus, respiratory syncytial virus (RSV), and the yellow fever virus. Particular attention was paid to the size of the amplicon and the primer positioning.
The primers were used to carry out reverse transcription, and the amplicons were multiplexed, in two alternately running sets, these steps being important for PCR. The optimal amplicon size was ~1,500 bp, since, at this size, the total content of primers was lower while the efficiency of the PCR was maintained at a high level.
The primers designed for this whole-genome sequencing were described in addition to the methods used in the laboratory to carry out reverse transcription, PCR amplification, and library preparation for the sequencing platform in use for the generation of a complete SARS-CoV-2 genome sequence.
Developing the New Process
The researchers began with all the complete sequences in the GISAID database as of June 22, 2020, coming to well over 21,500. After cleaning up the sequences by removing troublesome spaces and characters, they looked over them to remove all genomes that had gaps of 6 or more Ns. This yielded a final number of ~17,000 clean genomes.
These were cut up into strings of 33 nucleotides at a 1 nucleotide step. This generated ~600,000 unique 33mers. They then identified the 33mers that were highly conserved by counting the frequency at which they appeared, thus obviating the commonly used sequence alignment as part of the primer design process. This can become a worry if the genome sets are either large or very different.
The advantage of avoiding alignment was that all the useful genome sequences could be included, rather than just one set of sequences that allows proper alignment. The final step was to trim the sequences of 33mers, resembling primers, until they reached the defined melting temperature, meanwhile removing all primers above 26 nucleotides.
After then defining the forward and reverse bins containing the primer target regions, they selected 20 amplicons for SARS-CoV-2, which overlapped by 300 nt. These were evenly spread over the genome. Beginning with the most conserved primer sequences, they examined the mapping in the 185 nucleotides of each amplicon at the 5' or 3' ends. They also used the two primers in each bin that had the top frequency in each bin. This allowed unexpected structure-driven or target-driven alterations to be overridden to some extent. They finally came up with amplicon lengths calculated at 1495-2093 nucleotides.
They adapted the methods used for reverse transcription, PCR amplification, and library preparation to the alternative primer set.
Testing the Primers on Viral RNA
When these primers were tested on SARS-CoV-2 sequencing using material from COVID-19-positive samples on the same platform as before (MinION), they were able to map the read data, following trimming of the primers and adapters, and quality control, to the Wuhan1 reference genome NC_045512 with complete sequence coverage across the entire genome.
They showed all 20 amplicons in the coverage sequence, showing small peaks at overlapping sites, but with no missing amplicons, and easily assembled data to form complete genomes with good coverage. Alteration of the primer mixes allowed the concentrations of amplicon 1 and 16 primers to be increased for reverse transcription and PCR, with better yields compared to the other amplicons.
The authors sum up: "The current study documents the genome gap frequencies and their positions in the currently available data." The researchers also present the alternative Entebbe primers and sequencing protocol, the use of which may allow more efficient gap-less sequencing of the SARS-CoV-2 genomes, an important aspect when the cost and labor invested into this task is taken into account. This will hopefully improve the value and coverage of genomic sequencing data in the future.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.