Designing de novo proteins holds immense potential for achieving superior combinations of novel functions and mechanical properties, thereby advancing biological and engineering applications. However, testing the vast number of probable amino acid sequences, in addition to the experimental costs associated with designing novel proteins with targeted structural properties or features, remains a challenge.
In a recent study published in the journal Chem, researchers utilize attention-based diffusion models to efficiently generate novel protein sequences with prescribed secondary structures.
Study: Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model. Image Credit: PopTika / Shutterstock.com
About the study
In the present study, researchers discuss two generative deep-learning models that predict amino acid sequences and generate folded three-dimensional (3D) structures of proteins based on design constraints of secondary structures through the per-residue structure or overall content.
The team focused on the mechanical properties of proteins for the analysis and mapping between primary amino acid sequences and secondary protein structures. The models considered conditioning descriptions as inputs to produce amino acid sequences through conditional diffusion based on attention.
The AlphaFold and OmegaFold methods were used to generate 3D protein structures. Two models were trained using the Protein Data Bank (PDB) dataset.
Model A received fractional inputs of the proteinaceous secondary structures, whereas Model B considered per-residue data of the secondary structures as inputs to construct 3D protein models and predict amino acid sequences of proteins. The models were capable of producing samples to further narrow down sequences by selecting the best-fit samples that satisfied the conditioning inputs the most or those that showed the least similarity with known proteins.
The diffusion models used U-Net convolutional neural networks with interlinked transformer and convolutional layering, skip connections, and attention modules to identify noise at every step for subsequent removal.
The de novo proteins were compared with the critical assessment of structure prediction (CASP)-14 and 15 target set proteins by performing the Basic Local Alignment Search Tool (BLAST) analysis to assess protein novelty. The generative models constructed protein sequences from random signals under conditioning by reversing the diffusion process in a step-by-step manner. Eight parameters associated with the secondary structure of proteins were assessed using the Define Secondary Structure of Proteins (DSSP) codes.
For model A, the conditioning vector parameters included α helix, extended parallel and/or anti-parallel β sheet conformation, hydrogen-bonded three, four, or five turns, unstructured parameter, β bridge, 3/310 helix, π helix, and bends.
For model B, five cases with varying secondary structure distributions were considered. These included a predominant β sheet, a long α helix with a breaker in the center, a small α helix, a β sheet sandwiched between two α-helical domains, and a partially disordered-helical protein.
The diffusion models were found to efficiently design proteins with secondary structure specifications and de novo amino acid sequences that have not been discovered previously.
The generative models provided robust results, even for imperfect-type inputs and unrealistic designs. As a result, the use of these models has the potential be expanded to generate proteins with other clinically and functionally relevant properties.
The per-residue secondary structure-based model was more accurate and yielded more diverse amino acid sequences, particularly for α-helical structures.
Both models handled variegated design objectives robustly and offered new approaches to discovering superior protein materials and systems. Model A analysis identified several denotive cases, such as those with high β sheet content, a mixture of α-helical and β sheet content, pure α-helical content, significantly disordered α-helices, and completely disordered proteins.
AlphaFold and OmegaFold analysis of the predicted β-strand assembly into higher-order filamentous structures yielded comparable results. The BLAST analysis predicted structures similar to existing amino acid sequences that could be enhanced by increasing conditioning probabilities or adding noise to conditioning vectors during training.
Model B results showed good agreement with the design objectives, thus confirming that the protein generative model could design de novo proteins with geometric specifications and secondary structure localization. Developing models that provide detailed atomic coordinates could improve protein design.
For model B, the BLAST analysis indicated 50% to 60% similarity between existing proteins and the generated proteins. Model B generated proteins more effectively than Model A.
The current study reports two deep-learning models that can predict amino acid sequences and 3D protein structures based on secondary-structure design objectives. These novel models are robust, reliable, and can generate new protein sequences not yet discovered from natural mechanisms or systems.
The models generated protein sequences with desired secondary structure conformations. These data could be integrated to obtain a protein sequence using model A, whereas model B could be used to refine the sequence by specifying the residue-level detail of the secondary structures.
The models not only seek to respect the conditional inputs but also yield to the underlying constraints of physically possible secondary structures learned during training. This approach has the potential to accelerate the design of new proteins for use in medicine, industry, and other bioengineering applications.
Further research must include additional conditioning, explore functional properties of the generated proteins for various properties beyond structural objectives, such as biological activity, and improve sequence diversity from those of existing proteins.
- Ni, B., Kaplan, D. L., & Buehler, M. J. (2023). Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model. Chem. doi:10.1016/j.chempr.2023.03.02