Scientists at Lawrence Livermore National Laboratory (LLNL) and the Linnaeus Centre for Bioinformatics (LCB) at Uppsala University in Sweden have developed a new bioinformatics technique for systematically analyzing key regions in DNA that help control gene activity. The cooperative efforts were headed by Krzysztof Fidelis in the United States and by Jan Komorowski in Sweden.
Understanding the complex regulatory mechanisms that tell genes when to switch on and off is one of the toughest challenges facing researchers attempting to discover how life works. "Binding sites," or areas of DNA that interact with the proteins that help control gene expression, can be a long distance on the DNA strand from the genes they influence. Recent research also has shown that gene expression can be controlled by several regulatory proteins working together at a combination of different binding sites.
(Regulatory proteins are known as "transcription factors"; transcription is the first step in the process by which the genetic information in DNA is decoded by the cell to manufacture proteins, the building blocks of life.)
"It's difficult to experimentally observe how transcription factors bind to DNA at a distance from a gene, or how regulation happens," said Fidelis, a computational biologist in Livermore's Biosciences Directorate. "But you can identify their binding sites in a promoter or regulatory region – there are usually a few of these for each gene. We wanted to see if we could somehow deduce how many transcription factors at a time, or combinations of factors, are coming together physically and how these combinations regulate genes."
"To accomplish this," Komorowski said, "we used a machine learning technique called rough sets to mathematically model general rules that could associate known binding sites and gene expression in yeast, which is one of the most widely studied organisms." From the analysis of gene activity under a variety of environmental conditions, the teams were able to develop a set of rules for predicting the location of binding site combinations based on limited binding site and gene expression data.