A new study from the University of East Anglia describes a novel approach to detecting bacteria and viruses that may possibly be causing cancer. Many such associations have already been found, such as HPV (human papilloma virus) infection in cervical cancer, and Helicobacter pylori (H. pylori) in stomach cancer. However, the new research published on October 22, 2019, in the journal Genome Biology, provides a template benchmark pipeline to detect bacterial and viral sequences from cancer cell whole-genome sequencing data. This could help find more pathogens linked causally to cancer, and could drive the development of more cancer vaccines.
Researcher Dan Brower says, “When tumor samples are whole genome sequenced, DNA from any pathogens present will also be sequenced, making it possible to detect and quantify pathogens. This gives us a fantastic opportunity to collect data that will help us find new associations between bacteria and viruses and different types of cancer.”
Geneticist placing strips with DNA into the PCR thermal cycler - Image Credit: UvGroup / Shutterstock
The SEPATH computational pipeline
The current study is focused on making use of the abundance of tools now available to identify organisms based on their genetic sequencing data to provide a standardized approach to classifying pathogenic DNA within the human genome. The researchers attempt to examine each step of the pathway involved in computing the origin of each part of the genome sequence in human DNA. To do this, says another researcher Abraham Gihawi, “This new research looks at each of the key computational steps involved in conducting this on human tissue sequencing data.”
The first step was to bring together the computer programs that collate the best tools to process large amounts of sequencing data.
The second step was to build over a hundred simulations of realistic genomes containing DNA sequences that mostly match the human genome. Into these, they inserted small sequences of viral or bacterial DNA similar to the expected findings in the DNA from a cancerous tumor. They then decided on the best program-parameter combinations among more than 70 such tools, by evaluating their performance on these simulations.
The third step was to run the identification program on these mock genomes, to assess exactly how much of the pathogen-derived DNA could be picked up in each case. This helped form an accurate estimate of how precise and sensitive the tools were, since the scientists already knew what and how much of each type of pathogenic DNA they had injected.
Through their efforts, they concluded that mOTUs2 is a classification tool for rapid precise bacterial classification. On the other hand, MetaSPAdes and Kraken are programs that classify both viral and bacterial sequences.
The final step was to test the approaches on DNA from real cancer cells, from cervical and stomach cancers, since these are known to contain specific pathogen-derived DNA sequences.
The outcome was a tool called SEPATH, which can accomplish high-throughput sequencing across many highly performing computing parameters. SEPATH works well in three areas. It converts host-aligned files into FASTQ format which keeps pathogenic sequences intact. It also runs the two top-ranking tools, mOTUs2 and Kraken, which use different pathways of classification on raw bacterial reads, and metagenomics contig assemblies from non-human sequencing reads, respectively, but have similarly excellent performances. SEPATH can be used to identify pathogens that have inserted their sequences into human host DNA, and thus help to pick up the association between bacterial and viral genomes in the human DNA with human cancers.
The results of using SEPATH on cervical and stomach cancer cell genomes were impressive. Not only did the program pick up known cancer-associated pathogens, it also identified others which were hitherto unknown. Says Gihawi, “We are only just beginning to scratch the surface on the role that these other pathogens may play in the development of cancer.”
When used with PathSeq, SEPATH allows almost 100% accurate identification of pathogens at the level of genera. This will clarify the role of various bacteria and viruses in the etiology of cancer.
The role model for this team is the HPV vaccine which protects against 70% of cervical cancers, according to expert predictions. Brewer sums up their hopes: “We hope that by identifying bacteria and viruses associated with other cancers, new vaccines could be developed in the future.” To accelerate these attempts, they have provided high-throughput pipelines for the use of genomics researchers, to help them investigate pathogen role in cancer origins. They have also made their metagenomics simulations available for independent researchers.
SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads totemplate pipelines. Abraham Gihawi, Ghanasyam Rallapalli, Rachel Hurst, Colin S. Cooper, Richard M. Leggett, and Daniel S. Brewer. Genome Biology. https://doi.org/s13059-019-1819-8, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1819-8