Genomic sequencing supports a long-term infection as source of SARS-CoV-2 Omicron

In a recent study published on the bioRxiv* preprint server, researchers investigate the nucleotide sequences of the newly emerged severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Omicron (B.1.1.529) variant of concern (VOCs) with other nucleotide sequences in the open-access, publically available GISAID database.

Study: Unsupervised genome-wide cluster analysis: nucleotide sequences of the omicron variant of SARS-CoV-2 are similar to sequences from early 2020. Image Credit: PHOTOCREO Michal Bednarek / Shutterstock.com

Study: Unsupervised genome-wide cluster analysis: nucleotide sequences of the omicron variant of SARS-CoV-2 are similar to sequences from early 2020. Image Credit: PHOTOCREO Michal Bednarek / Shutterstock.com

As of December 2021, the GISAID database consisted of more than six million genome sequences of SARS-CoV-2. These include genomic sequences of the recently discovered Omicron variant, as well as previous strains that have been collected from coronavirus disease 2019 (COVID-19) patients around the world since the beginning of the pandemic in 2020. Several previous studies have performed a genome-wide analysis of SARS-CoV-2 genomes using model-based approaches that assume an underlying phylogenetic tree structure.

About the study

In the current study, the researchers utilized an unsupervised genome-wide cluster analysis based on the Jaccard similarity matrix. Herein, a given set of nucleotide sequences were assigned to a reference sequence, which was followed by a principal component analysis (PCA).

Subsequently, all sequences were translated into a Hamming matrix, which indicated all the mismatch mutations as compared to the reference sequence. The Hamming matrix serves as an input to the Jaccard similarity matrix, which results in a similarity index between zero and one for all pairwise comparisons of sequences. PCA is then applied to the Jaccard similarity matrix to identify clusters of SARS-CoV-2 genomes.

The study results were displayed as the first two principal components of the Jaccard matrix, which show a progression of all nucleotide sequences in time. These components are color-coded by the World Health Organization (WHO) region, the location where each sequence was submitted from, submission date, and clade, respectively. Note that there are a total of 11 clades of the SARS-CoV-2 genome available on GISAID, which include G, GH, GK, GR, GRA, GRY, GV, L, O, S, and V.

Study findings

The researchers initially identified 132,065 genomic sequences, which satisfied all five data quality attributes offered by GISAID. This included complete (sequences with a minimum length of 29,000 base pairs), low coverage excluded (sequences with more than 5% N-bases), collection data complete (submissions with a complete year-month-day collection date), high coverage (sequences with less than 1% N-bases), and with patient status (sequences with meta information in the form of age, sex, and patient status).

First two principal components of the Jaccard matrix, color coded by WHO region (AFRO in red, EMRO in blue, EURO in purple, PAHO in orange, SEARO in green, WPRO in black). Displayed are the 10013 sequences from GISAID, one point per sequence. The omicron samples are depicted as triangles.

Later, the dataset was down-sampled to 10,000 sequences due to the limit imposed on computing by the Jaccard similarity matrix and PCA. Finally, the researchers added all 287 sequences of the Omicron variants available on GISAID as of December 26, 2021, leading to a total of 10,287 genomic sequences for this study's analysis. The metadata information used for the study was the geographic location where sequences were collected.

First two principal components of the Jaccard matrix, color coded by the clade of each sequence (clade G in red, GH in blue, GK in purple, GR in orange, GRA in green, GRY in black, GV in yellow, L in maroon, O in light green, S in turquoise, V in brown). Displayed are the 10013 sequences from GISAID, one point per sequence. The omicron samples are depicted as triangles.

Using the multiple sequence alignment program MAFFT, all sequences were aligned to the official SARS-CoV-2 reference published on GISAID. All other parameters were set to the default values for establishing a well-defined window for comparison of 29,891 base pairs.

The study analysis showed that the SARS-CoV-2 nucleotide sequences extended from the origin (0,0) on the Jaccard matrix in a distinctive way and formed numerous distinct clusters according to their geographical origin. Genomic clusters from Africa were identified in the upper left quadrant of the plot, whereas those from Europe were found in the lower left quadrant. Notably, Omicron genomic sequences were somewhat far off the European cluster and closer to the origin.

Conclusions

The study relates the emergence of new COVID-19 cases due to the Omicron variant using a non-parametric PCA on single-stranded nucleotide sequences of the SARS-CoV-2 genomic sequences collected from the publicly available GISAID database during the past two years of the pandemic. The study demonstrated that the new Omicron genomic sequences were closely related to sequences submitted to GISAID in the early months of the pandemic, around January 2020.

Further, these Omicron sequences in GISAID are spread across the entire range of the first principal component and did not cluster. This supports the hypothesis that the Omicron variant has been in circulation for some time and is responsible for long-term SARS-CoV-2 infections.

The study findings also established that unsupervised cluster analysis is a great tool for continuous data monitoring from public databases such as GISAID due to its simplicity and computational speed. This tool has also proven essential in classifying all SARS-CoV-2 emerging variants of interest for further follow-up analyses.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Neha Mathur

Written by

Neha Mathur

Neha is a digital marketing professional based in Gurugram, India. She has a Master’s degree from the University of Rajasthan with a specialization in Biotechnology in 2008. She has experience in pre-clinical research as part of her research project in The Department of Toxicology at the prestigious Central Drug Research Institute (CDRI), Lucknow, India. She also holds a certification in C++ programming.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Mathur, Neha. (2022, January 03). Genomic sequencing supports a long-term infection as source of SARS-CoV-2 Omicron. News-Medical. Retrieved on May 20, 2022 from https://www.news-medical.net/news/20220103/Genomic-sequencing-supports-a-long-term-infection-as-source-of-SARS-CoV-2-Omicron.aspx.

  • MLA

    Mathur, Neha. "Genomic sequencing supports a long-term infection as source of SARS-CoV-2 Omicron". News-Medical. 20 May 2022. <https://www.news-medical.net/news/20220103/Genomic-sequencing-supports-a-long-term-infection-as-source-of-SARS-CoV-2-Omicron.aspx>.

  • Chicago

    Mathur, Neha. "Genomic sequencing supports a long-term infection as source of SARS-CoV-2 Omicron". News-Medical. https://www.news-medical.net/news/20220103/Genomic-sequencing-supports-a-long-term-infection-as-source-of-SARS-CoV-2-Omicron.aspx. (accessed May 20, 2022).

  • Harvard

    Mathur, Neha. 2022. Genomic sequencing supports a long-term infection as source of SARS-CoV-2 Omicron. News-Medical, viewed 20 May 2022, https://www.news-medical.net/news/20220103/Genomic-sequencing-supports-a-long-term-infection-as-source-of-SARS-CoV-2-Omicron.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post
You might also like...
Unique antibody responses after third COVID-19 mRNA vaccination