Scientists at Lawrence Livermore National Laboratory (LLNL) and the Linnaeus Centre for Bioinformatics (LCB) at Uppsala University in Sweden have developed a new bioinformatics technique for systematically analyzing key regions in DNA that help control gene activity. The cooperative efforts were headed by Krzysztof Fidelis in the United States and by Jan Komorowski in Sweden.
Understanding the complex regulatory mechanisms that tell genes when to switch on and off is one of the toughest challenges facing researchers attempting to discover how life works. "Binding sites," or areas of DNA that interact with the proteins that help control gene expression, can be a long distance on the DNA strand from the genes they influence. Recent research also has shown that gene expression can be controlled by several regulatory proteins working together at a combination of different binding sites.
(Regulatory proteins are known as "transcription factors"; transcription is the first step in the process by which the genetic information in DNA is decoded by the cell to manufacture proteins, the building blocks of life.)
"It's difficult to experimentally observe how transcription factors bind to DNA at a distance from a gene, or how regulation happens," said Fidelis, a computational biologist in Livermore's Biosciences Directorate. "But you can identify their binding sites in a promoter or regulatory region – there are usually a few of these for each gene. We wanted to see if we could somehow deduce how many transcription factors at a time, or combinations of factors, are coming together physically and how these combinations regulate genes."
"To accomplish this," Komorowski said, "we used a machine learning technique called rough sets to mathematically model general rules that could associate known binding sites and gene expression in yeast, which is one of the most widely studied organisms." From the analysis of gene activity under a variety of environmental conditions, the teams were able to develop a set of rules for predicting the location of binding site combinations based on limited binding site and gene expression data.
"We found that the same transcription factors, in slightly different combinations, could be responsible for the regulation of different genes," said Torgeir R. Hvidsten of the LCB. "Thus we now know that binding sites can be combined to allow a large number of expression outcomes using relatively few transcription factors."
Others collaborating in the project were Jerzy Tiuryn of the Faculty of Mathematics, Informatics, and Mechanics at Warsaw University in Poland; Bartosz Wilczynski of the Institute of Mathematics, Polish Academy of Sciences, and LLNL; and Andriy Kryshtafovych of LLNL. A report on the joint work appears in the June issue of the journal Genome Research.
The rough sets technique was developed by Zdzislaw Pawlak in Poland in the 1980s and is particularly suitable to build models from incomplete and uncertain data. It has been used in applications ranging from medical and financial data analysis to voice recognition and image processing. Applied to gene regulation, the approach was able to predict the location of regulatory sites for about one-third of the genes in the yeast genome – a success rate as good as or better than other current techniques.
"The next step is to test this approach on different organisms, including microbes and vertebrates," Fidelis said. The growing number of organisms whose genomes have been sequenced has generated a wealth of DNA sequence information that could provide the raw material for analysis.