Identifying protein targets in SARS-CoV-2 via machine learning

The coronavirus disease 2019 (COVID-19) pandemic has affected nearly 271 million people and has claimed 5.32 million lives, the most recent episode being of the delta variant of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The COVID-19 pandemic just adds to the list of infectious diseases that were potential global threats like severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS), Ebola, and Zika. Infections like these highlight the need for the development of therapeutic agents to combat emerging pathogens.

Study: Language Models for the Prediction of SARS-CoV-2 Inhibitors. Image Credit: Andrii Vodolazhskyi/ShutterstockStudy: Language Models for the Prediction of SARS-CoV-2 Inhibitors. Image Credit: Andrii Vodolazhskyi/Shutterstock

The process of developing therapeutic solutions to novel viruses is tedious and prohibitively long, taking up to 10 to 15 years of time. The initial step of determining interesting molecules and therapeutic targets for further investigation is crucial due to the vast size of the chemical space, which prevents an exhaustive search using costly experiments and trials. Tools from machine learning (ML) and high-performance computing (HPC) have been increasingly used to guide the selection of promising drug candidates. Although computational methods can partially alleviate some of the associated experimental costs, one normally needs a large compound library with measured properties for training the ML algorithm. Therefore, organizing a timely response to an emerging pandemic also poses a challenge for computational methods, as one needs to generate large datasets with the target of interest.

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

To automate the drug discovery process, one needs an algorithm that, (i) leverages existing large compound libraries without the need for chemical property measurements; (ii) predicts affinities for novel protein targets with very limited available, experimental data; (iii) explores the chemical space of the target pathogen/infection to efficiently identify compounds for further investigation.

To satisfy the three criteria, researchers from the Oak Ridge National Laboratory, used leverage high performance computing (HPC) to train generalizable ML models for both candidate generation and affinity prediction.

Their experiment was recently published in the pre-print server bioRxiv* and gave an insight on using an ML-based algorithm to analyze and predict therapeutic targets in emerging pathogens with a range of mutations.

Background

To take advantage of large existing compound libraries, researchers utilized a text representation for molecule data known as Simplified Molecular Input Line Entry System (SMILES). Using Enamine REAL database as a starting point, they generated a novel dataset of approximately 9.6 billion unique molecules. The dataset was used to pre-train a Transformer model (i.e. BERT), using the mask prediction task commonly found in natural language processing applications. During pre-training, sub-sequences of a given molecule were replaced by a mask, and the model was checked for being able to predict the appropriate sequence based on context. Thus, the model learnt a representation for chemical structure in a completely unsupervised manner that did not require additional property measurements.

On pre-training the deep learning language model (BERT) on ~9.6 billion molecules, the researchers achieved peak performance of 603 petaflops in mixed precision. This experiment was thus able to successfully reduce pre-training time from days to hours, in comparison to the previous efforts with this architecture. This process also increased the dataset size by nearly an order of magnitude.

For scoring, researchers fine-tuned the language model using an assembled set of mutiple of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. They utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model.

Implications

The main constraint in locating therapeutic targets in emerging pathogens and infections is the presence of mutations. The ML algorithm developed by the researchers can be used effectively accelerate the identification of protein-binding sites on surfaces of mutating pathogens and help in modelling inhibitor-based drugs for such emerging therapeutic targets.

This news article was a review of a preliminary scientific report that had not undergone peer-review at the time of publication. Since its initial publication, the scientific report has now been peer reviewed and accepted for publication in a Scientific Journal. Links to the preliminary and peer-reviewed reports are available in the Sources section at the bottom of this article. View Sources

Journal references:

Article Revisions

  • May 9 2023 - The preprint preliminary research paper that this article was based upon was accepted for publication in a peer-reviewed Scientific Journal. This article was edited accordingly to include a link to the final peer-reviewed paper, now shown in the sources section.
Sreetama Dutt

Written by

Sreetama Dutt

Sreetama Dutt has completed her B.Tech. in Biotechnology from SRM University in Chennai, India and holds an M.Sc. in Medical Microbiology from the University of Manchester, UK. Initially decided upon building her career in laboratory-based research, medical writing and communications happened to catch her when she least expected it. Of course, nothing is a coincidence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dutt, Sreetama. (2023, May 09). Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical. Retrieved on July 26, 2024 from https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx.

  • MLA

    Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. 26 July 2024. <https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx>.

  • Chicago

    Dutt, Sreetama. "Identifying protein targets in SARS-CoV-2 via machine learning". News-Medical. https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx. (accessed July 26, 2024).

  • Harvard

    Dutt, Sreetama. 2023. Identifying protein targets in SARS-CoV-2 via machine learning. News-Medical, viewed 26 July 2024, https://www.news-medical.net/news/20211217/Identifying-protein-targets-in-SARS-CoV-2-via-machine-learning.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
High mutation count in new COVID-19 variants does not increase immune evasion, study finds