An interesting new study by researchers at the University of Portsmouth, UK, describes the use of a mathematical method to sequence genomes based on information theory. The method offers an alternative to clinical techniques, allowing mutations to be detected and possibly even predicted. In this way, it opens new research opportunities in bioinformatics and genetics.
While living organisms contain information encoded in genetic material, either deoxyribonucleic or ribonucleic acid (DNA and RNA, respectively), there are many other media in which information can be stored and transmitted.
A preprint version of the study is available on the bioRxiv* server, while the article undergoes peer review.
Information theory was first developed on a mathematical basis by Claude Shannon over 70 years ago.
He described a method of measuring information obtained by observing the occurrence of an event. In fact, this gave rise to modern computing. He also coined the word “bit” for a unit of information.
Other than information technology, his theory laid the foundation for advances in a vast spectrum of topics belonging to as diverse subjects as computing, cryptography, linguistics, physiology and biology, and telecommunications.
Using information entropy
The new paper uses information theory to devise a novel method whereby mutations can be both traced and predicted in genomic sequences. This is far from being the first attempt to do this, for DNA sequences have been analyzed through methods built on information theory from the ‘70s onwards.
The approach used in this study centers around information entropy (IE) spectra, which are created from genomic sequences, and the examination of their mutation dynamics. Importantly, this approach is relevant for any sequence of any genome of any size.
The researchers used a program called GENIES (GENetic Entropy Information Spectrum), custom-built for this project and now available for free to other scientists.
The core of the approach is the view of the genome as a coding system, where its diverse functional regions, such as exons, enhancers and promoters, have unique patterns of information entropy. Mutations show up as changes in the sequence of these regions and thus as distinctive alterations in the pattern of information entropy.
This correspondence could be used to identify such mutations, purely from the standpoint of an information storage system, without the need to understand the physicochemical aspects of these changes.
For instance, the four DNA nucleotides in any sequence are represented as adenine (A), cytosine (C), guanine (G), and thymine (T). Adjacent sequences make up a chromosome, with unknown sequences being represented by N. Each nucleotide is represented by two bits, that is, A = 00, C = 01, G = 10, and T = 11.
The distributions of individual symbols and their information entropy values can be found by using an appropriate equation. The correlations between the symbols are further expressed using block information entropies. Such block information entropies have been used in many studies on genomic information content, though not to detect mutations.
The researchers used a framework of three-nucleotide codons, each codon being represented by one symbol (m=3). The probability of each codon was mathematically estimated for defined stretches of the genomic sequence. This yielded the maximum entropy of the studied sequence.
Entropy changes with mutation
When a mutation has occurred, the value of the maximum entropy changes. The presence of a difference between the two entropies that was not equal to zero, or a ratio of original to altered entropy not equal to 1, would then indicate the chance that a mutation was present in the genomic subset of interest.
Based on this concept, the IES method was used on the entire genome.
The genome was first split up into subsets called windows. A window contains a defined number of characters, called its window size, corresponding to the length of the genome subset described above.
By proceeding across the whole genome, the window was moved from one end to the other one “step” at a time, each step size corresponding to a fixed number of characters. The step size is between 1 and the window size.
This yields a predetermined number of windows rounded to the nearest integer.
The IE value is calculated window by window and plotted by location within the genome. This gives the IE spectrum of the genome – “a numerical representation of the genetically encoded information within a given genome.”
This algorithm conveniently permits the information in the IE spectrum to be used in other ways. Further research is on to determine how large the window and step size should be, probably varying with the type of information required. This method will work only with GENIES or another fully automated program.
SARS-CoV-2 mutations detected
By way of illustration, the researchers examined the reference genome of the SARS-CoV-2 using the IE spectrum method. They found that with a step size of 1, larger window sizes increased the mean IE value of the spectrum. The maximum IE value closely corresponds to the maximum expected theoretical value until WS>33.
This point may therefore represent the optimal window size, where the IE changes are large enough to allow useful information to be extracted but do not exceed 33.
An earlier and less detailed version of this method has been reported by other researchers, who nonetheless obtained valuable information by detecting repetitive sequences that helped trace the differences between organisms that emerged as they evolved. The current method should help add to the utility of this tool.
When applied to the SARS-CoV-2 sequence and one randomly chosen variant, the researchers found that the IE spectrum method at various window and step sizes picked up six of seven total mutations identified by direct comparison at the nucleotide level.
However, using m-block values less than 3, corresponding to the number of nucleotides in a codon, they found that m=2 yielded all the seven mutations, while also allowing for the identification of possible correlations between nucleotides. Moreover, this value is independent of window and step size values.
The researchers write:
Our study indicates that the best m-block size is 2 and the optimal window size should contain more than 9, and less than 33 nucleotides.”
What are the implications?
The study reports an early program based on information theory that can detect single point mutations using the ratio of IE spectra. Further work will help identify indel mutations as well. Other algorithms and equations may help identify mutations.
However, this technique may show the greatest value in its reverse application, examining the points where mutations are known to have occurred. This would allow special features in the IE spectrum to be linked to the location of the mutations in the genome and predict possible future mutations.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.