Genome modeling framework for coronavirus pathogenicity prediction

Download PDF Copy

By Michael Greenwood, M.Sc.Sep 23 2021

Viral genomic information collected during disease outbreaks is often used in epidemiological studies well after the outbreak to infer statistics relating to mutations and their influence on transmissibility, transmission chains, and public health risk. Experimentation in the creation and optimization of pathogenic viruses is extremely limited due to safety concerns related to accidental escape, leaving the scientific community unable to predict and unprepared for large and sudden outbreaks, as seen in the coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

Study: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses. Image Credit: peterschreiber.media/ Shutterstock Study: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses. Image Credit: peterschreiber.media/ Shutterstock

*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Recent machine learning techniques have allowed viral genome sequencing to be utilized in a predictive capacity, pre-outbreak, shedding light on the emergence of certain viral phenotypes. In a paper recently uploaded to the preprint server bioRxiv*, such techniques are further refined and applied to a wide range of coronaviruses, creating a model that can identify known human pathogens with 100% reliability and be applied to animal viruses to predict their propensity for human transmission.

A preprint version of the study is available on the bioRxiv* server, while the article undergoes peer review.

Identifying human pathogen features

The group developed a “feature extraction” technique that computationally assigns words to specific genomic features and applies to a range of viruses in the coronavirus family known to be human or non-human pathogens. This included SARS-CoV-2 and ten variants of concern, consisting of 42 complete genomes intended to represent the diversity observed during the COVID-19 pandemic.

In addition, 34 complete swine acute diarrhea (SAD) virus genomes were employed to represent non-human pathogens. The group notes porcine coronaviruses such as SAD to have seemingly never undergone zoonotic transfer to humans. Thus this virus was chosen due to the apparent genetic barrier to transition.

The group restricted their genome search to 11, 13, 15, and 17 monomer long sequences and identified several that correlated with human pathogen probability, for most of which, as expected, SARS-CoV-2 scored more highly than SAD. 15- and 17-mer models are more accurate by the group, with each of those tested correctly classifying all 42 SARS-CoV-2 genomes as human pathogens.

To take the technique beyond identifying genomic features and develop a predictive model, the group firstly organized genomic sequences taxonomically, simulating the problem of having several species of each class of virus in the training set to maximize successful species in the validation set. Secondly, a stratified resampling technique was utilized to avoid bias otherwise often generated using this technique, such as a skewed representation of human pathogens.

Some SARS-CoV-2 motifs in particular that are associated with human pathogenicity were noted across a variety of coronavirus species where the motif is prone to drifting, being found on a variety of loci. Other motifs promoting human pathogenicity were either binary or frequency-dependent, acting as a switch requiring just one or multiple occurrences. For example, the sequence NTRNWRNTSNWSHTA appears in human coronavirus HKU1 isolates as many as 45 times but only four times in sparrow coronavirus HKU17.

Predicting human pathogenicity

Another motif, RATGTTRTTMDWCDA, is found in various genomic contexts in human alpha- and beta-coronaviruses, being located in, for example, the spike protein in human coronavirus 229E (alpha), non-structural proteins 3 and 15 in middle eastern respiratory syndrome (beta), and non-structural protein 5 in SARS-CoV-2 (beta).

In all cases, the motif always appears in the same reading frame, consisting of five triplets (15-mer). However, in turkey gammacoronavirus, the motif has frameshifted one position, indicating that it is unnecessary to the non-human pathogen.

Additionally, in human pathogens, this motif is also found closer to N-linked glycosylation sites, evidence of adaptation towards translational efficiency as rare codon enrichment at these sites is associated with robust expression under certain conditions.

Conclusion

Models developed by the group that had correctly classified the SARS-CoV-2 and SAD test sequences were applied to a wide variety of other coronaviruses, and several with known phylogenic similarities were highlighted.

Bat coronavirus WIV16 shared 96% sequence similarity with SARS-CoV-1 and gave a human pathogenic probability of 0.78, while civet coronaviruses gave probabilities of around 0.89. Civets are closely related to minks, to which zoonotic transmission of SARS-CoV-2 from humans has been observed during the pandemic.

Other bat coronaviruses RmYN02, RpYN06, and RaTG13, exhibited human pathogen class probabilities rivaling that of SARS-CoV-2, and each of these are known to bear high sequence similarity.

Using the model, several more bat viruses sourced from a cave in Guangdong were routinely classified as human pathogens. The group speculated that this could explain the many otherwise unidentified SARS-like virus outbreaks reported in the region in previous years.

Journal reference:

Preliminary scientific report. Davis, P. and Russell, J. (2021) "A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses". bioRxiv. doi: 10.1101/2021.09.18.460926.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Comments (0)

Written by

Michael Greenwood

Michael graduated from the University of Salford with a Ph.D. in Biochemistry in 2023, and has keen research interests towards nanotechnology and its application to biological systems. Michael has written on a wide range of science communication and news topics within the life sciences and related fields since 2019, and engages extensively with current developments in journal publications.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Greenwood, Michael. (2021, September 23). Genome modeling framework for coronavirus pathogenicity prediction. News-Medical. Retrieved on February 07, 2026 from https://www.news-medical.net/news/20210923/Genome-modeling-framework-for-coronavirus-pathogenicity-prediction.aspx.
MLA
Greenwood, Michael. "Genome modeling framework for coronavirus pathogenicity prediction". News-Medical. 07 February 2026. <https://www.news-medical.net/news/20210923/Genome-modeling-framework-for-coronavirus-pathogenicity-prediction.aspx>.
Chicago
Greenwood, Michael. "Genome modeling framework for coronavirus pathogenicity prediction". News-Medical. https://www.news-medical.net/news/20210923/Genome-modeling-framework-for-coronavirus-pathogenicity-prediction.aspx. (accessed February 07, 2026).
Harvard
Greenwood, Michael. 2021. Genome modeling framework for coronavirus pathogenicity prediction. News-Medical, viewed 07 February 2026, https://www.news-medical.net/news/20210923/Genome-modeling-framework-for-coronavirus-pathogenicity-prediction.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.