Viral genomic information collected during disease outbreaks is often used in epidemiological studies well after the outbreak to infer statistics relating to mutations and their influence on transmissibility, transmission chains, and public health risk. Experimentation in the creation and optimization of pathogenic viruses is extremely limited due to safety concerns related to accidental escape, leaving the scientific community unable to predict and unprepared for large and sudden outbreaks, as seen in the coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Study: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses. Image Credit: peterschreiber.media/ Shutterstock
Recent machine learning techniques have allowed viral genome sequencing to be utilized in a predictive capacity, pre-outbreak, shedding light on the emergence of certain viral phenotypes. In a paper recently uploaded to the preprint server bioRxiv*, such techniques are further refined and applied to a wide range of coronaviruses, creating a model that can identify known human pathogens with 100% reliability and be applied to animal viruses to predict their propensity for human transmission.
A preprint version of the study is available on the bioRxiv* server, while the article undergoes peer review.
Identifying human pathogen features
The group developed a “feature extraction” technique that computationally assigns words to specific genomic features and applies to a range of viruses in the coronavirus family known to be human or non-human pathogens. This included SARS-CoV-2 and ten variants of concern, consisting of 42 complete genomes intended to represent the diversity observed during the COVID-19 pandemic.
In addition, 34 complete swine acute diarrhea (SAD) virus genomes were employed to represent non-human pathogens. The group notes porcine coronaviruses such as SAD to have seemingly never undergone zoonotic transfer to humans. Thus this virus was chosen due to the apparent genetic barrier to transition.
The group restricted their genome search to 11, 13, 15, and 17 monomer long sequences and identified several that correlated with human pathogen probability, for most of which, as expected, SARS-CoV-2 scored more highly than SAD. 15- and 17-mer models are more accurate by the group, with each of those tested correctly classifying all 42 SARS-CoV-2 genomes as human pathogens.
To take the technique beyond identifying genomic features and develop a predictive model, the group firstly organized genomic sequences taxonomically, simulating the problem of having several species of each class of virus in the training set to maximize successful species in the validation set. Secondly, a stratified resampling technique was utilized to avoid bias otherwise often generated using this technique, such as a skewed representation of human pathogens.
Some SARS-CoV-2 motifs in particular that are associated with human pathogenicity were noted across a variety of coronavirus species where the motif is prone to drifting, being found on a variety of loci. Other motifs promoting human pathogenicity were either binary or frequency-dependent, acting as a switch requiring just one or multiple occurrences. For example, the sequence NTRNWRNTSNWSHTA appears in human coronavirus HKU1 isolates as many as 45 times but only four times in sparrow coronavirus HKU17.
Predicting human pathogenicity
Another motif, RATGTTRTTMDWCDA, is found in various genomic contexts in human alpha- and beta-coronaviruses, being located in, for example, the spike protein in human coronavirus 229E (alpha), non-structural proteins 3 and 15 in middle eastern respiratory syndrome (beta), and non-structural protein 5 in SARS-CoV-2 (beta).
In all cases, the motif always appears in the same reading frame, consisting of five triplets (15-mer). However, in turkey gammacoronavirus, the motif has frameshifted one position, indicating that it is unnecessary to the non-human pathogen.
Additionally, in human pathogens, this motif is also found closer to N-linked glycosylation sites, evidence of adaptation towards translational efficiency as rare codon enrichment at these sites is associated with robust expression under certain conditions.
Models developed by the group that had correctly classified the SARS-CoV-2 and SAD test sequences were applied to a wide variety of other coronaviruses, and several with known phylogenic similarities were highlighted.
Bat coronavirus WIV16 shared 96% sequence similarity with SARS-CoV-1 and gave a human pathogenic probability of 0.78, while civet coronaviruses gave probabilities of around 0.89. Civets are closely related to minks, to which zoonotic transmission of SARS-CoV-2 from humans has been observed during the pandemic.
Other bat coronaviruses RmYN02, RpYN06, and RaTG13, exhibited human pathogen class probabilities rivaling that of SARS-CoV-2, and each of these are known to bear high sequence similarity.
Using the model, several more bat viruses sourced from a cave in Guangdong were routinely classified as human pathogens. The group speculated that this could explain the many otherwise unidentified SARS-like virus outbreaks reported in the region in previous years.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
- Davis, P. and Russell, J. (2021) "A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses". bioRxiv. doi: 10.1101/2021.09.18.460926.