Scientists at the University of Illinois at Urbana-Champaign have developed deep generative models to predict undiscovered sequences of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein. These hypothetical sequences could be useful for future pandemic preparedness. The study is currently available on the bioRxiv* preprint server.
Study: PandoGen: Generating complete instances of future SARS-CoV2 sequences using Deep Learning. Image Credit: TimeStopper69 / Shutterstock
Deep generative models are used to generate complete and realistic samples of different objects, such as images, language pieces, and computer codes. Among these models, Large Language Models (LLMs) have recently gained immense popularity because of their ability to follow human instructions and perform competitive programming at the human level.
Protein Language Models (PLMs) are based on LLM designs and can model biological sequences and generate samples with interesting properties.
In the current study, scientists explored novel methods to train a PLM to generate complete, self-contained, realistic, and not-yet-known samples of SARS-CoV-2 spike sequences. In general, LLMs are trained using a known data set to parameterize the probability distribution of the targeted data.
The scientists primarily focused on the SARS-CoV-2 spike protein because of its significant involvement in the viral entry process and ability to induce host immune responses. The spike protein initiates SARS-CoV-2 entry into host cells by interacting with the host cell membrane receptor angiotensin-converting enzyme 2 (ACE2).
Many therapeutic and preventive interventions targeting the spike protein have been developed during the coronavirus disease 2019 (COVID-19) pandemic, including therapeutic monoclonal antibodies and COVID-19 vaccines. Thus, advance knowledge of future spike protein sequences would be helpful for developing novel variant-specific vaccines and monoclonal antibodies.
The scientists developed a deep generative model, PandoGen, and trained the model using spike sequences that were deposited in the GISAID (the Global Initiative on Sharing All Influenza Data) database on or before June 15, 2021. Model generation is benchmarked against sequences reported after this date.
The model's functional validation revealed that PandoGen can generate high-quality sample sequences of the spike protein that are significantly different from the training sequences. This could be because the model has explicit training constructs that prevent it from regenerating the training sequences and force it to generate sample sequences with significant differences.
The comparison of model-generated sample sequences with GISAID-derived sequences revealed PandoGen is capable of generating a high fraction of real sequences. The model also showed proficiency in generating novel sequences associated with GISAID cases.
The study describes the development of a new method that can train deep-generating models to generate hypothetical SARS-CoV-2 spike sequences that are not yet discovered but have the potency to create future pandemics. The training pipeline used in the study utilizes information that is available in GISAID and does not require any additional laboratory experiments for sequence characterization.
Comparison of the novel PandoGen model with a standard model reveals that the new model has higher proficiency than the standard model in generating a high fraction of real, salient, and novel sequences. Specifically, the new model outperforms the standard by 4 times for the number of novel sequences and almost 10 times for case counts of the generated corpus. Moreover, the study finds that about 70% of higher-ranked sequences generated by the model are discovered in the future.
As mentioned by the scientists, the study model can be used as a promising platform for generating hypothetical SARS-CoV-2 spike sequences using publicly available resources. In addition, the information obtained from the model could be useful for advance preparation against future pandemic situations.