In a recent study published in Nature Biotechnology, researchers explored the causes of cancer by mapping somatic mutation rates across the human genome.
To understand cancer, it is essential to identify mutations that drive cancer. While extensive research is being conducted to understand the same, most studies focus on specific non-coding elements and protein-coding sequences due to the difficulty in modeling somatic mutation rates found in different tumor genomes.
About the study
In the present study, researchers described a genome-wide mutation rate model called Dig that allowed rapid testing for the presence of selected driver mutations in a genome.
The team designed the Dig model to represent genome-wide somatic mutation rates for any given type of cancer to enable timely evaluation of excessive mutations anywhere in the genome. This allowed the even distribution of neutral mutations over a group of genomic positions for a set of tumors from that particular type of cancer.
The model used a probabilistic deep learning approach that captured two central determinants of variability in the rates of somatic mutation: (1) kilobase-scale variation, which is affected by epigenomic properties including chromatin accessibility and replication timing that impact the efficacy of deoxyribonucleic acid (DNA) and (2) base-pair-scale variation which is influenced by the sequence context biases of the processes that stimulate somatic mutations, including apolipoprotein B mRNA-editing enzyme, catalytic (APOBEC) polypeptide-driven cytidine deamination as well as ultraviolet (UV) light exposure.
The team subsequently constructed maps of the mutation rates and inferred nucleotide mutation biases for a total of 37 cancer types according to somatic mutations recorded in the pan-cancer analysis of whole genomes (PCAWG) dataset. Mutation rates and inferred biases were also estimated for 723 chromatin marks in 111 tissues as recorded in the Roadmap epigenomics. The accuracy of the somatic mutation rate was further benchmarked using the metric of the proportion of variance.
The team also applied the Dig model to quantify the magnitude to which cryptic splice SNVs exist in excess compared to the mutation rate and assessed its role as a cancer-driving mutation. The impact of indels on gene expressions and subsequent disruption of transcription factor-binding motifs was assessed by searching for promoters in the PCAWG dataset.
The study results showed that the Dig model accurately estimated that the variance in the single nucleotide variant (SNV) rates was a median of 77.3% in the region of 10 kb and 94.6% in one Mb region across a total of 16 cancer types. The highest variation was observed in SNV found in the 10 kb regions in 14 out of the 16 cancer cohorts. On the other hand, all of the 16 cancer groups reported high non-synonymous SNV variation, and 15 had high non-coding ribonucleic acid (RNA) SNV counts.
Furthermore, the Dig model matched or even exceeded the performance exhibited by other methods tailored toward particular classes of elements across whole genomes or whole-genomic samples. Dig also had the highest F1 score as 24 out of 32 tested PCAWG cohorts and was also found to be the most powerful among 14 of the cohorts in terms of burden-based driver gene detection. The team also noted that Dig identified potential driver elements one to five times faster than traditional methods for every element and cohort tested.
Reduction of the size of analyzed elements to comprise tens to hundreds of positions resulted in an almost 20% increase in the power with which driver mutations were identified in less than 1% of the tested samples. The team also found that the cryptic splice SNVs from the tumor suppressor genes (TSGs) recorded in the cancer gene census (CGC) occurred more often than expected under neutral conditions. The cryptic SNVs were enriched in introns and were biased to be incident in sites having a high predicted impact on splicing. Overall, the intronic splice SNVs accounted for approximately 4.5% of the excess SNVs found in the TSGs. The team also noted that the TP53 promoter was the sole element exhibiting a genome-wide significant burden of indels.
Overall, the study findings highlighted the usefulness of Dig as a tool for in vivo and in vitro studies due to its ability to prioritize precise groups of mutations that are potential drivers in the coding and the non-coding genome. The researchers believe that the deep learning approach used in the present study could develop the experimental, computational, as well as the clinical utility of the sequencing data related to the cancer genome