In a recent study published in Nature Biotechnology, researchers developed a whole-genome sequencing (WGS) method that read four canonical deoxyribonucleic acid (DNA) bases, viz., adenine (A), cytosine (C), guanine (G), and thymine (T) plus 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC), epigenetic variants of unmodified C to yield an accurate six-letter digital readout in a single workflow.
In addition, this approach was versatile, which implies it expanded its applicability to varying DNA sample formats. For example, it could analyze a cell-free-DNA (cfDNA) sample obtained from a cancer patient with high precision. This method also had an inherent error suppression capability that helped fetch accurate genetic and epigenetic base calling. Finally, it processed the DNA sample(s) entirely enzymatically, thus preventing DNA degradation.
Mammalian DNA or genome stores multidimensional information needed for sustenance; however, high-throughput sequencing approaches read sequencing of only four DNA bases to interpret this information. These analytic approaches, thus, have failed to uncover epigenetic information stored in DNA, i.e., an altered gene expression despite no alteration in the genome. Though reversible, epigenetic changes (e.g., DNA methylation) change how your body reads a DNA sequence, which, in turn, could change cell fate.
A combined analysis of genetic and epigenetic information could fetch more accurate predictions regarding susceptibility to a disease, e.g., cancer. Sequencing 5mC and 5hmC could help retrieve epigenetic information in human DNA. To this end, researchers have developed three base-conversion methods, whole-genome bisulfite sequencing (WGBS), enzymatic-methyl sequencing (EM-seq), and TET-assisted pyridine borane sequencing for distinguishing unmodified C (unmodC) from 5mC or 5hmC.
However, these methods have several shortfalls. First, they cannot accurately detect genetic C-to-T changes, the most common mutation in mammalian genomes, especially during cancer.
Second, in some cases, they fetch false positive matches that subsequently make mapping of converted reads imprecise, slower, and expensive. Lastly, these methods have failed to distinguish 5mC from 5hmC in a single workflow.
About the study
In the present study, researchers implemented the five-seq workflow on a mixed DNA sample. It comprised a B lymphoblast cell line derived 80 nanograms (ng) of human genomic DNA (gDNA). They used 0.5 ng of lambda (λ) DNA from a bacteriophage that was enzymatically methylated at all Cs. Likewise, they retrieved 0.5 ng of pUC19 (a vector) from a methylation-negative Escherichia coli strain.
They prepared DNA in duplicate and sequenced it on an Illumina Novaseq 6000 to fetch ~550 million paired-end reads. On average, they resolved 98.4% of all DNA reads computationally. Notably, 89.8% of these DNA reads aligned with the genome. All resolved reads comprised the four-state genetic information and the epigenetic information stowed as sequence alignment map (SAM) tags.
The researchers also compared the data quality of the epigenetic and genetic components of five-letter seq with best-practice methods. Further, they pooled counts of modified and unmodC calls at CpGs across both strands in two technical replicates for drawing comparisons. Finally, they only considered CpGs encompassing at least three reads, i.e., 94.24% of all CpGs.
Furthermore, the researchers compared the precision of the genetic sequencing component of the five-seq method. To this end, they calculated sensitivity to detect modC (expressed as a percentage) and evaluated the ratio of modCs to the total Cs, both modC and unmodC. Additionally, they calculated specificity as the ratio of unmodCs to all Cs in the pUC19 reference.
A system using a two-base coding approach enabled the decoding of up to 16 states unambiguously. It made reading all four genetic states and several epigenetic states in a single run. The five-letter seq data had an average polymerase chain reaction (PCR), and cluster duplication rate of 8.5%, with one DNA read covering 15× of the whole genome and a minimum of 90.2% of the genome.
This method reduced execution times compared to WGBS and EM-seq. Accordingly, the Burrows-Wheeler Aligner-minimal exact match (BWA-MEM) completed the execution for genomic alignment of one million 16-state resolved DNA reads in 7.5 minutes. Likewise, the genomic alignment time for one million three-state reads, assessed via BWA-methylation, was 16.5 minutes.
Across the assessed human genome in this study, average levels of modC observed at CHG and CHH sites as measured by five-letter seq, WGBS, and EM-seq were 0.07%, 0.14%, and 0.33%, respectively. The five-letter seq yielded the highest average modC levels at CpG sites, i.e., 54.05%, while the same measured by EM-seq and WGBS was 51.10% and 49.38%.
Five-letter seq attained a half-mean coverage for 87.82% of the bases in the gDNA sample used in the study, while WGBS and EM-seq attained 85.91% and 87.48% half-mean coverages, respectively. However, the researchers noted small drops and peaks in CpG coverage nearby transcription initiation sites relative to the remaining genome.
The sensitivity and specificity of five-letter seq were 98.55% and 99.95%, respectively, higher than EM-seq and WGBS methods, having sensitivities and specificities of 97.89% and 99.5%, and 95.69% and 99.92%, respectively. Interestingly, five-letter seq quantified modC at all CpGs spanning reads and genome level just as WGBS.
The cfDNA analysis is crucial in diagnostics, with several applications in prenatal diagnosis, cancer detection at an early stage, and monitoring many diseases. A standard blood draw typically yields 10 ng/ml of cfDNA. The researchers withdrew cfDNA from a patient with stage III colon cancer and analyzed the same using a five-letter seq workflow.
This sample had barely two or 10ng of cfDNA or 80 ng of gDNA. Yet. the five-seq method retained high accuracy in methylation detection, achieving over 98% sensitivity at 0.05 ng of control DNA in the two ng mixed sample. Also, this method did not disturb the fragment length distribution typical of cfDNA, suggesting the mono and dinucleosomal fractional profiling.
The oxidation of 5mC via enzymes generates 5hmC, a marker of biological states, such as early cancer. Unambiguously distinguishing 5mC from 5hmC without compromising genetic base calling followed a five-letter seq workflow that generated the adapter-ligated sample fragment with the synthetic copy strand. However, for six-letter seq, they used DNA methyltransferase 5 (DNMT5) for its specificity for de novo copy methylation. DNMT5 copied methylation at 5mC across the CpG unit to the C on the copy DNA strand.
The study platform used Watson–Crick base pairing to decode genetic and epigenetic information. Thus, it was easy to embed in any sequencer platform, which would expand opportunities for its applications in the future, for instance, to single-cell analysis. In addition, future work could explore additional epigenetic modifications, such as 5-carboxy cytosine, 5-formyl cytosine, and N6-methyladenine, in various organisms.