<< Researchers give mutants another chance | Groundbreaking study reveals how enzymes in the cell cooperate to make fat >>
Read in | English | 繁體中文

From the works of Shakespeare to the genomes of viruses

Published on February 11, 2009 at 9:37 PM · No Comments

What does uncovering the true authorship of plays attributed to Shakespeare have to do with identifying our genetic ancestors or classifying new life forms?

All involve the comparative analysis of long sets of data and all will benefit from a unique new analytical tool developed by researchers at the Lawrence Berkeley National Laboratory (Berkeley Lab).

Sung-Hou Kim, a chemist who holds a joint appointment with Berkeley Lab's Physical Biosciences Division and UC Berkeley's Chemistry Department, led the development of a technique called "feature frequency profiles" (FFP), that makes it possible to compare, classify, index and catalog just about any type of linear information that can be electronically stored. The kinds of information that can be analyzed with the FFP technique include nucleotide base and amino acid sequences, books, documents and possibly images. It could even prove to be the ultimate music organizer.

"I call our technique a tool for demographic phylogeny because it enables us to organize large sets of data into groups and find relationships among these groups," says Kim. "The idea is to organize data sets into groups based on the frequency at which key features occur and then look for relationships. This is the reverse of what is usually done, where you find relationships in the data set then organize the data set into groups based on those relationships."

Using the FFP technique, Kim and his colleagues can create "family trees" that put into easy-to-see perspective the relationships between groups within a data set, whether those groups are books or genomes. The key is to identify the "optimal features" for profiling. For books, the optimal feature consisted of sequences of text about eight letters in length. For mammalian genomes, the optical feature consisted of sequences of nucleotide bases of about 18 base pairs in length. However, to keep their genomic computations manageable, Kim and his colleagues reduced the four-letter DNA alphabet (adenine, guanine, thymine and cytosine) to a two-letter alphabet by using R for the purine nucleic acids and Y for the pyrimidine nucleic acids). In a series of tests run on books and genomes, the FFP technique provided a more comprehensive and in some cases more accurate analysis over the standard analytical tools.

For example, Kim and his colleagues used the FFP technique to create a book tree composed of more than two dozen selected works under the categories of philosophy, mythology, religion, 19th Century fiction, science fiction and children's fiction. Their FFP-based book tree correctly grouped all books by category and author including some, such as the Koran, that were misplaced in a book tree based on a standard word frequency profile analysis. In the case of the Koran, the FFP-based tree placed it in the religion category on the same branch as the King James Bible and the Book of Mormon, whereas the word frequency book tree grouped it in the philosophy category, on the same branch as Plato's The Republic and Socrates' The Apology.

Kim and his colleagues later applied the FFP technique to a comparative analysis of the works of William Shakespeare, contemporaries such as Christopher Marlowe, plus several works from the Jacobean era that were once attributed to Shakespeare but whose authorships are now in question. The results cast new doubt on Shakespeare having been the author of the play Pericles, Prince of Tyre, and point to his authorship of the comedy Two Noble Kinsmen, for which in the past he has only received partial credit.

"I was stunned when I saw how well the technique worked with books," Kim says.

The next step was the successful application of the technique to the whole genomes of mammals whose phylogenic tree is well established, then on to whole genomes of prokaryote organisms (bacteria and Archaea) and finally on to viruses, for which current comparative genomic analytic tools sometimes cannot be applied.

Collaborating with Kim on this project have been biophysicist Gregory Sims, statistical mathematician Se-Ran Jun and theoretical physicist Guohong Wu. Like Kim, they all hold joint appointments with Berkeley Lab and UC Berkeley.

Kim is an internationally recognized authority on protein structures and a pioneer in the field of structural genomics. In 2003, he unveiled a 3-D demographic map of the protein structure universe that for the first time made it possible to organize the structures of this vast assemblage of biological molecules (more than 50 billion known species and growing) into meaningful groups.

"Scientists studying the genomes of different organisms are facing similar problems to those studying protein structures, perhaps even more difficult," Kim says. "Thousands of whole genomes have been or are in the process of being sequenced and we need to have an effective way of comparing and grouping them, and finding relationships among the groups. The FFP method can help us mine the function of gene-coding and non-coding nucleotide base sequences in the genome of a particular species, and can also give us a better understanding of how that species may have evolved, who its closest relatives are and other valuable information."

Currently, comparative genomics studies are based either on measuring the similarities and differences between a set of selected genes in the coding regions of the genomes that are common to the species being compared, or on gene-profiles, in which the presence of certain genes in two or more species yields a similarity score. Species with a higher number of shared genes or similarity scores are presumed to be more closely related than those with a lower number. Both of these methods require an alignable set of common genes in the coding regions, which is not always the case, especially amongst the genomes of rapidly evolving species. Such "gene-centric" comparisons also suffer from an even greater limitation for comparing mammals and other high-order eukaryotes, as Kim explains.

"Coding sequences (exons) total only about one-percent of the entire human genome, with the rest made up of non-coding sequences (introns) whose functions are still largely unknown," he says. "What is needed is an alignment-free method that can be used for comparing entire genomes or genomic regions that may be distantly related, have undergone significant rearrangement, or do not share a common set of genes. We also need a tool that can be used to analyze and compare nongenic regions of genomes as well."

Kim began this quest by turning to the world of books, where comparative analytical tools are well established to ascertain authorship as well as to expose fraud or plagiarism. However, two problems became evident. First, current standard text analysis is based on the frequency at which different words appear, but genomic data consists of long strings of letters not words. Second, analysis based on the frequency of words does not provide local syntax - the relationship between adjoining words, a point that is critical in comparative genomics and turned out to be important in text comparisons as well.

To overcome the limitations of current text comparison techniques, Kim and his colleagues first undertook an analysis of words in a Webster's English dictionary and found that words with eight to nine letters were optimal for frequency profiling. This finding also proved true for all other books as well.

"Text features longer than eight or nine letters do not occur frequently enough for frequency profile comparisons, and text features shorter in length do not give us enough information to distinguish one book from another," Kim says.

Comments
The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News-Medical.Net.



  Country flag

biuquote
  • Comment
  • Preview
Loading