Genome modeling framework for coronavirus pathogenicity prediction

Viral genomic information collected during disease outbreaks is often used in epidemiological studies well after the outbreak to infer statistics relating to mutations and their influence on transmissibility, transmission chains, and public health risk. Experimentation in the creation and optimization of pathogenic viruses is extremely limited due to safety concerns related to accidental escape, leaving the scientific community unable to predict and unprepared for large and sudden outbreaks, as seen in the coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

Study: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses. Image Credit: ShutterstockStudy: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses. Image Credit: Shutterstock

Recent machine learning techniques have allowed viral genome sequencing to be utilized in a predictive capacity, pre-outbreak, shedding light on the emergence of certain viral phenotypes. In a paper recently uploaded to the preprint server bioRxiv*, such techniques are further refined and applied to a wide range of coronaviruses, creating a model that can identify known human pathogens with 100% reliability and be applied to animal viruses to predict their propensity for human transmission.

A preprint version of the study is available on the bioRxiv* server, while the article undergoes peer review.

Identifying human pathogen features

The group developed a “feature extraction” technique that computationally assigns words to specific genomic features and applies to a range of viruses in the coronavirus family known to be human or non-human pathogens. This included SARS-CoV-2 and ten variants of concern, consisting of 42 complete genomes intended to represent the diversity observed during the COVID-19 pandemic.

In addition, 34 complete swine acute diarrhea (SAD) virus genomes were employed to represent non-human pathogens. The group notes porcine coronaviruses such as SAD to have seemingly never undergone zoonotic transfer to humans. Thus this virus was chosen due to the apparent genetic barrier to transition.

The group restricted their genome search to 11, 13, 15, and 17 monomer long sequences and identified several that correlated with human pathogen probability, for most of which, as expected, SARS-CoV-2 scored more highly than SAD. 15- and 17-mer models are more accurate by the group, with each of those tested correctly classifying all 42 SARS-CoV-2 genomes as human pathogens.

To take the technique beyond identifying genomic features and develop a predictive model, the group firstly organized genomic sequences taxonomically, simulating the problem of having several species of each class of virus in the training set to maximize successful species in the validation set. Secondly, a stratified resampling technique was utilized to avoid bias otherwise often generated using this technique, such as a skewed representation of human pathogens.

Some SARS-CoV-2 motifs in particular that are associated with human pathogenicity were noted across a variety of coronavirus species where the motif is prone to drifting, being found on a variety of loci. Other motifs promoting human pathogenicity were either binary or frequency-dependent, acting as a switch requiring just one or multiple occurrences. For example, the sequence NTRNWRNTSNWSHTA appears in human coronavirus HKU1 isolates as many as 45 times but only four times in sparrow coronavirus HKU17.

Predicting human pathogenicity

Another motif, RATGTTRTTMDWCDA, is found in various genomic contexts in human alpha- and beta-coronaviruses, being located in, for example, the spike protein in human coronavirus 229E (alpha), non-structural proteins 3 and 15 in middle eastern respiratory syndrome (beta), and non-structural protein 5 in SARS-CoV-2 (beta).

In all cases, the motif always appears in the same reading frame, consisting of five triplets (15-mer). However, in turkey gammacoronavirus, the motif has frameshifted one position, indicating that it is unnecessary to the non-human pathogen.

Additionally, in human pathogens, this motif is also found closer to N-linked glycosylation sites, evidence of adaptation towards translational efficiency as rare codon enrichment at these sites is associated with robust expression under certain conditions.


Models developed by the group that had correctly classified the SARS-CoV-2 and SAD test sequences were applied to a wide variety of other coronaviruses, and several with known phylogenic similarities were highlighted.

Bat coronavirus WIV16 shared 96% sequence similarity with SARS-CoV-1 and gave a human pathogenic probability of 0.78, while civet coronaviruses gave probabilities of around 0.89. Civets are closely related to minks, to which zoonotic transmission of SARS-CoV-2 from humans has been observed during the pandemic.

Other bat coronaviruses RmYN02, RpYN06, and RaTG13, exhibited human pathogen class probabilities rivaling that of SARS-CoV-2, and each of these are known to bear high sequence similarity.

Using the model, several more bat viruses sourced from a cave in Guangdong were routinely classified as human pathogens. The group speculated that this could explain the many otherwise unidentified SARS-like virus outbreaks reported in the region in previous years.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
  • Davis, P. and Russell, J. (2021) "A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses". bioRxiv.  doi: 10.1101/2021.09.18.460926.
Michael Greenwood

Written by

Michael Greenwood

Michael graduated from Manchester Metropolitan University with a B.Sc. in Chemistry in 2014, where he majored in organic, inorganic, physical and analytical chemistry. He is currently completing a Ph.D. on the design and production of gold nanoparticles able to act as multimodal anticancer agents, being both drug delivery platforms and radiation dose enhancers.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Greenwood, Michael. (2021, September 23). Genome modeling framework for coronavirus pathogenicity prediction. News-Medical. Retrieved on August 18, 2022 from

  • MLA

    Greenwood, Michael. "Genome modeling framework for coronavirus pathogenicity prediction". News-Medical. 18 August 2022. <>.

  • Chicago

    Greenwood, Michael. "Genome modeling framework for coronavirus pathogenicity prediction". News-Medical. (accessed August 18, 2022).

  • Harvard

    Greenwood, Michael. 2021. Genome modeling framework for coronavirus pathogenicity prediction. News-Medical, viewed 18 August 2022,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
You might also like...
How does COVID-19 increase clotting risk?