Identifying protein targets in SARS-CoV-2 via machine learning

The coronavirus disease 2019 (COVID-19) pandemic has affected nearly 271 million people and has claimed 5.32 million lives, the most recent episode being of the delta variant of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The COVID-19 pandemic just adds to the list of infectious diseases that were potential global threats like severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS), Ebola, and Zika. Infections like these highlight the need for the development of therapeutic agents to combat emerging pathogens.

Study: Language Models for the Prediction of SARS-CoV-2 Inhibitors. Image Credit: Andrii Vodolazhskyi/ShutterstockStudy: Language Models for the Prediction of SARS-CoV-2 Inhibitors. Image Credit: Andrii Vodolazhskyi/Shutterstock

The process of developing therapeutic solutions to novel viruses is tedious and prohibitively long, taking up to 10 to 15 years of time. The initial step of determining interesting molecules and therapeutic targets for further investigation is crucial due to the vast size of the chemical space, which prevents an exhaustive search using costly experiments and trials. Tools from machine learning (ML) and high-performance computing (HPC) have been increasingly used to guide the selection of promising drug candidates. Although computational methods can partially alleviate some of the associated experimental costs, one normally needs a large compound library with measured properties for training the ML algorithm. Therefore, organizing a timely response to an emerging pandemic also poses a challenge for computational methods, as one needs to generate large datasets with the target of interest.

To automate the drug discovery process, one needs an algorithm that, (i) leverages existing large compound libraries without the need for chemical property measurements; (ii) predicts affinities for novel protein targets with very limited available, experimental data; (iii) explores the chemical space of the target pathogen/infection to efficiently identify compounds for further investigation.

To satisfy the three criteria, researchers from the Oak Ridge National Laboratory, used leverage high performance computing (HPC) to train generalizable ML models for both candidate generation and affinity prediction.

Their experiment was recently published in the pre-print server bioRxiv* and gave an insight on using an ML-based algorithm to analyze and predict therapeutic targets in emerging pathogens with a range of mutations.


To take advantage of large existing compound libraries, researchers utilized a text representation for molecule data known as Simplified Molecular Input Line Entry System (SMILES). Using Enamine REAL database as a starting point, they generated a novel dataset of approximately 9.6 billion unique molecules. The dataset was used to pre-train a Transformer model (i.e. BERT), using the mask prediction task commonly found in natural language processing applications. During pre-training, sub-sequences of a given molecule were replaced by a mask, and the model was checked for being able to predict the appropriate sequence based on context. Thus, the model learnt a representation for chemical structure in a completely unsupervised manner that did not require additional property measurements.

On pre-training the deep learning language model (BERT) on ~9.6 billion molecules, the researchers achieved peak performance of 603 petaflops in mixed precision. This experiment was thus able to successfully reduce pre-training time from days to hours, in comparison to the previous efforts with this architecture. This process also increased the dataset size by nearly an order of magnitude.

For scoring, researchers fine-tuned the language model using an assembled set of mutiple of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. They utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model.


The main constraint in locating therapeutic targets in emerging pathogens and infections is the presence of mutations. The ML algorithm developed by the researchers can be used effectively accelerate the identification of protein-binding sites on surfaces of mutating pathogens and help in modelling inhibitor-based drugs for such emerging therapeutic targets.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Sreetama Dutt

Written by

Sreetama Dutt

Sreetama Dutt has completed her B.Tech. in Biotechnology from SRM University in Chennai, India and holds an M.Sc. in Medical Microbiology from the University of Manchester, UK. Initially decided upon building her career in laboratory-based research, medical writing and communications happened to catch her when she least expected it. Of course, nothing is a coincidence.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dutt, Sreetama. (2021, December 17). Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical. Retrieved on May 17, 2022 from

  • MLA

    Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. 17 May 2022. <>.

  • Chicago

    Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. (accessed May 17, 2022).

  • Harvard

    Dutt, Sreetama. 2021. Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical, viewed 17 May 2022,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
You might also like...
Heightened antibody responses in COVID-19 booster vaccinated individuals following SARS-CoV-2 Omicron