Computers can predict the detailed structure of small proteins nearly as well as experimental methods, at least some of the time, according to new studies by Howard Hughes Medical Institute researchers.
The findings, which were reported in the journal Science, provide a glimmer of hope that scientists eventually may be able to determine the structure of proteins from their genomic sequences, a problem that has seemed insurmountable.
"For more than 40 years, people have known the amino acid sequence of a protein specifies its three-dimensional structure, but no one has been able to translate the sequence into an accurate structure," said senior author David Baker, an HHMI researcher at the University of Washington. "The reason this research is exciting is that we're showing progress in predicting the structure from the sequence. It's not that the problem is solved, but that there is hope."
Proteins are biological machines, and scientists need to determine their structures to understand how the proteins work. Now, scientists determine structures exclusively by measuring the atomic characteristics of proteins in the lab. In contrast, "in this case, we never touched a test tube," Baker said. "We gave it to a computer and said, 'go.'"
In the study, a sophisticated computer program folded 17 short strings of amino acids into 100,000 possible variations. When the researchers compared the best predictions to the actual structures solved earlier by other scientists using experimental techniques, they had the same success rate as the best hitters in major league baseball.
"We achieved almost atomic resolution in structure prediction for about one-third of our benchmark set of small proteins," said first author Philip Bradley, a postdoctoral fellow in Baker's lab. "It is a real step forward to achieve structures that are in some way comparable to what you can get by experiments."
The encouraging results come from a refinement of a sophisticated computer modeling program called Rosetta, first developed several years ago in Baker's lab. The program works on the premise that proteins collapse into their lowest energy state, like a ball that rolls down a hill until it comes to rest on level ground. The energies of hundreds of thousands of possible shapes generated by the computer are computed, and the lowest energy shape is selected as the prediction.
The prediction process happens in two steps, Bradley said. The first stage uses an approximate model which allows rapid calculation of the energy and so can be carried out rapidly, while the second uses a very detailed model for which the energy calculations take much longer but are much more accurate. A large scale search through possible structures is carried out in the first stage, and promising locations are then explored in detail in the second stage.
The first stage takes advantage of the fact that all amino acids have identical sections, which form the protein backbone. The computer adds a fuzzy picture of the protruding side chains that give each amino acid its unique identity. The sequence of side chains ultimately gives each protein its characteristic shape by the environment and neighbors they prefer.
Then the computer randomly twists, loops, and bends each amino acid sequence into 100,000 different shapes based on the preferred location of the amino acids. Some amino acids tend to dive toward the watery world of the protein surface while others take cover inside the protein. The computer also accounts for the social habits of the 20 amino acids; some want to be close to each other and others like their distance.
In stage two, Rosetta replaces the fuzzy picture of the side chains with detailed, physically realistic models with all the atoms represented. From the positions of the atoms in the sidechains and the protein backbone, the computer then uses a detailed physical chemistry based force field which favors close packing of atoms and hydrogen bonding to more accurately compute the energy of the structure.
"What seems to be critical is the packing of the molecule," Baker said. "The protein fits together perfectly with no holes in the middle, and no atoms on top of each other. It's about as densely packed as it could be. It's like a three-dimensional jigsaw puzzle."
The researchers upped their odds of finding the right match by repeating the two-step process with 50 homologs of the proteins from other genomes, such as a mouse or fly. The protocol was first tested on a blind annual prediction test considered to be the highest standard for removing bias from protein structure prediction models.
"We can't compute the energies perfectly, but the biggest problem is the search through possible shapes," Baker said. "Where we were not getting the right answer on the computer, it was almost always the case that the actual structure had the lowest energy, so we would have succeeded if we had explored this part of the space."