In a recent study posted to the bioRxiv* preprint server, researchers predicted severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleotide sequences using outlier detection.
The emergence of novel SARS-CoV-2 variants has raised concerns about the currently administered coronavirus disease 2019 (COVID-19) vaccines. Therefore, the identification and sequencing of newly emerging variants need timely attention.
About the study
In the present study, researchers applied outlier detection to different SARS-CoV-2 nucleotide sequences before and after the emergence of a novel variant.
The team collected a total of 2,11,167 SARS-CoV-2 nucleotide sequences. The sequences selected satisfied the following criteria: (1) being complete with a length of at least 29,000bp; (2) collection data complete with sequences having a complete year-month-day collection date; (3) high coverage in sequences having less than 1% N-bases; (4) with patient status having metadata comprising the patient's age, gender, and clinical status; and (5) low coverage excluded with sequences having more than 5% N-bases excluded. They also collected the time stamp for all the nucleotide sequences.
The team investigated the possibility of detecting the sequence of a novel SARS-CoV-2 variant among the eight SARS-CoV-2 variants, namely, SARS-CoV-2 Alpha (B.1.1.7), Beta (B.1.351), Delta (B.1.617.2), Gamma (P.1), GH (B.1.640), Lambda (C.37), Mu (B.1.621), and Omicron (B.1.1.529) variants.
Two reference datasets were generated for each variant to determine the time point T1 at which the sequences of each variant emerged on the global initiative on sharing all influenza data (GISAID). The first reference dataset was produced using the GISAID sequences having a timestamp before T1. The second dataset subsequently represented the emergence of a novel variant for which time stamp T2 was determined wherein 10% of the variant sequences were mentioned in the GISAID. The second reference dataset was generated using the sequences having a timestamp up to T2.
The team used an alignment tool called multiple alignment using fast Fourier transform (MAFFT) and the SARS-CoV-2 reference sequence to align the sequences to the reference genome. All the sequences were later converted into a binary Hamming sequence in order to compare the viral reference genome to each of the aligned nucleotide sequences. The team also used the Jaccard similarity measure to explore the similarity of all the sequences.
Outlier detection was performed by defining a local environment around every sequence present in a principal component plot. The timestamp of the tested sequence was subsequently compared to the distribution of that timestamp in the defined local environment.
The study results showed that viral genomes in the GISAID displayed a specific progression pattern with the older sequences clustering in the middle of the Jaccard matrix lot and the newer sequences at the bottom part of the plot. The progression pattern began from the early point cloud to the viral genomes having intermediate timestamps to newer samples. The team also noted that the genomes of the SARS-CoV-2 Omicron strain were the most comparable to those found in the early stages of the pandemic.
Calibrating the outlier detection to align with the Omicron sequences showed a two-dimensional elbow plot with the number of outliers as a function of the local environment and a factor f that defined the number of standard deviations needed to determine that a sequence is an outlier. The researchers observed a distinct shape formed by the reducing pattern in the number of outliers as the factor f increased; however, a sharp decline was observed at f=1.2. This highlighted that f=1.2 was a consistent choice for all the variants.
Local detection of outliers showed that the outliers were present in a local epsilon environment with 19 out of 25 Omicron genomes detected. The team also noted that while many sequences detected in this calibration were not Omicron-related, they belonged to the SARS-CoV-2 Delta variant. Moreover, for the SARS-CoV-2 Delta, Beta, GH, and Omicron variants, the number of outliers detected significantly increased after the emergence of that variant. On the other hand, when other variants were considered, the difference in the number of outliers was less substantial. Notably, for the SARS-CoV-2 Gamma variant, the number of outliers detected reduced after the Gamma variant emerged.
Overall, the study findings showed that outlier detection could serve as an important tool to recognize novel emerging SARS-CoV-2 variants using machine learning techniques as well as statistical methods.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Georg Hahn, Sanghun Lee, Dmitry Prokopenko, Jonathan Abraham, Tanya Novak, Julian Hecker, Michael Cho, Surender Khurana, Lindsey R. Baden, Adrienne G. Randolph, Scott T. Weiss, Christoph Lange. (2022). Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. bioRxiv. doi: https://doi.org/10.1101/2022.05.16.492178 https://www.biorxiv.org/content/10.1101/2022.05.16.492178v1