Identifying protein targets in SARS-CoV-2 via machine learning

Download PDF Copy

Revised

By Sreetama Dutt M.Sc.Reviewed by Aimee MolineuxDec 17 2021

The coronavirus disease 2019 (COVID-19) pandemic has affected nearly 271 million people and has claimed 5.32 million lives, the most recent episode being of the delta variant of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The COVID-19 pandemic just adds to the list of infectious diseases that were potential global threats like severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS), Ebola, and Zika. Infections like these highlight the need for the development of therapeutic agents to combat emerging pathogens.

Study: Language Models for the Prediction of SARS-CoV-2 Inhibitors. Image Credit: Andrii Vodolazhskyi/Shutterstock

The process of developing therapeutic solutions to novel viruses is tedious and prohibitively long, taking up to 10 to 15 years of time. The initial step of determining interesting molecules and therapeutic targets for further investigation is crucial due to the vast size of the chemical space, which prevents an exhaustive search using costly experiments and trials. Tools from machine learning (ML) and high-performance computing (HPC) have been increasingly used to guide the selection of promising drug candidates. Although computational methods can partially alleviate some of the associated experimental costs, one normally needs a large compound library with measured properties for training the ML algorithm. Therefore, organizing a timely response to an emerging pandemic also poses a challenge for computational methods, as one needs to generate large datasets with the target of interest.

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

To automate the drug discovery process, one needs an algorithm that, (i) leverages existing large compound libraries without the need for chemical property measurements; (ii) predicts affinities for novel protein targets with very limited available, experimental data; (iii) explores the chemical space of the target pathogen/infection to efficiently identify compounds for further investigation.

To satisfy the three criteria, researchers from the Oak Ridge National Laboratory, used leverage high performance computing (HPC) to train generalizable ML models for both candidate generation and affinity prediction.

Their experiment was recently published in the pre-print server bioRxiv* and gave an insight on using an ML-based algorithm to analyze and predict therapeutic targets in emerging pathogens with a range of mutations.

Background

To take advantage of large existing compound libraries, researchers utilized a text representation for molecule data known as Simplified Molecular Input Line Entry System (SMILES). Using Enamine REAL database as a starting point, they generated a novel dataset of approximately 9.6 billion unique molecules. The dataset was used to pre-train a Transformer model (i.e. BERT), using the mask prediction task commonly found in natural language processing applications. During pre-training, sub-sequences of a given molecule were replaced by a mask, and the model was checked for being able to predict the appropriate sequence based on context. Thus, the model learnt a representation for chemical structure in a completely unsupervised manner that did not require additional property measurements.

On pre-training the deep learning language model (BERT) on ~9.6 billion molecules, the researchers achieved peak performance of 603 petaflops in mixed precision. This experiment was thus able to successfully reduce pre-training time from days to hours, in comparison to the previous efforts with this architecture. This process also increased the dataset size by nearly an order of magnitude.

For scoring, researchers fine-tuned the language model using an assembled set of mutiple of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. They utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model.

Implications

The main constraint in locating therapeutic targets in emerging pathogens and infections is the presence of mutations. The ML algorithm developed by the researchers can be used effectively accelerate the identification of protein-binding sites on surfaces of mutating pathogens and help in modelling inhibitor-based drugs for such emerging therapeutic targets.

Journal references:

Preliminary scientific report. Blanchard AE, Gounley J, Bhowmik D, et al. (2021). Language Models for the Prediction of SARS-CoV-2 Inhibitors. bioRxiv. doi:10.1177/10943420221121804. https://www.biorxiv.org/content/10.1101/2021.12.10.471928v1.
Peer reviewed and published scientific report. Blanchard, Andrew E, John Gounley, Debsindhu Bhowmik, Mayanka Chandra Shekar, Isaac Lyngaas, Shang Gao, Junqi Yin, Aristeidis Tsaris, Feiyi Wang, and Jens Glaser. 2022. “Language Models for the Prediction of SARS-CoV-2 Inhibitors.” The International Journal of High Performance Computing Applications 36 (5-6): 587–602. https://doi.org/10.1177/10943420221121804. https://journals.sagepub.com/doi/10.1177/10943420221121804.

Article Revisions

May 9 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Comments (0)

Written by

Sreetama Dutt

Sreetama Dutt has completed her B.Tech. in Biotechnology from SRM University in Chennai, India and holds an M.Sc. in Medical Microbiology from the University of Manchester, UK. Initially decided upon building her career in laboratory-based research, medical writing and communications happened to catch her when she least expected it. Of course, nothing is a coincidence.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Dutt, Sreetama. (2023, May 09). Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical. Retrieved on February 08, 2026 from https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx.
MLA
Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. 08 February 2026. <https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx>.
Chicago
Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx. (accessed February 08, 2026).
Harvard
Dutt, Sreetama. 2023. Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical, viewed 08 February 2026, https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.