Computational biology in rare disease research

By Keynote ContributorProfessor Damian SmedleyQueen Mary University of London and Genomics England

Written by Keynote Contributor, Professor Damian Smedley.

Rare diseases affect 6-8% of the world’s population and, although we know that small changes in the patient’s DNA are responsible for causing the majority of cases, most people wait several years before they are diagnosed and potentially treated. This hunt for an explanation is extremely distressing for the patients and their families, as well as costing healthcare systems large sums of money for medical investigations and treatments.


Even for the simplest cases, where a single change in a patient’s DNA disrupts a gene and always causes the rare disease, identifying which change in the three billion base pairs in each of our genomes is a huge challenge. Prior to the completion of the human genome in 2003, we did not even know what the normal state of affairs was. Even then, the available sequencing technology limited us to only interrogating small parts of a patient’s genome, directed by intelligent guesswork, with mixed results.

Genetic diseaseImage Credit: nobeastsofierce/Shutterstock

The introduction of whole-exome sequencing (WES) over the last decade has allowed us to test the entire protein-coding portion of the genome, leading to a revolution in the diagnosis of rare disease patients. Initially, results were demonstrated in research studies but increasingly WES is being used in mainstream healthcare settings. However, some causes of rare disease reside in the non-coding part of the genome and more recently whole genome sequencing (WGS) has become a cost-effective way of investigating this, as well as improving coverage of the protein-coding regions.

We recently published the results from a UK pilot study that demonstrated the power of WGS to revolutionize the diagnosis and treatment of rare diseases. The project, led by Genomics England and Queen Mary University of London, showed that WGS is the most effective way to quickly find the cause of the patient’s medical problems and that this approach worked well across a broad range of rare diseases. We also showed that by having their whole genome sequenced we were also able to uncover diagnoses that would not have been detectable by existing testing approaches. We were able to end the diagnostic journey for 25% of our participants which on average had lasted 6 years, and involved 68 hospital visits per patient, and at a cost of £87 million to the NHS for the 2183 people in the study.

Computational biology

However, it is not just advances in sequencing technologies that have led to this success. In parallel, novel computational biology software has had to be developed to efficiently identify the variant that is causing the genetic disease from amongst the 8 million variants within the 3 billion base pairs in every sequenced genome. Nearly all the developed tools share some common features such as restricting the initial search to those variants that disrupt the protein-coding regions of genes, leading to either a complete loss of the protein or altered function due to amino acid changes or altered protein levels.

Causative variants are expected to be absent or extremely rare in the general, healthy population and the efforts of the GnomAD consortium to create a resource of allele frequencies in such populations have been critical to effective filtering of variants. However, the majority of population sequencing data still comes from people with a white, European ethnic background, and increased sequencing of underrepresented Asian, African, and South American populations will improve diagnostic outcomes in these backgrounds.

The segregation of variants amongst affected and unaffected family members who have also been sequenced can also help to prioritize the diagnosis but generally, some other strategy to narrow down which genes in the genomes are considered is also required. In our study, we used two parallel approaches in our main pipeline. The first makes use of expert-curated panels of genes that are known to have a strong association with categories of disease used during the recruitment of participants to our project.

Computational biologyImage Credit: Gorodenkoff/Shutterstock

The curation, review, and presentation of these panels are enabled by the PanelApp tool. We also make use of a semantic phenotype similarity approach using the Exomiser software framework. Exomiser takes advantage of the fact that detailed data on the clinical phenotypes (signs and symptoms) of rare disease patients are systematically captured using the Human Phenotype Ontology (HPO; 5) and that we also capture reference knowledge of gene to phenotype relationships using the HPO for human disease and related ontologies for model organisms. Exomiser, HPO, and the reference gene-phenotype associations are maintained by an NIH-funded initiative called Monarch. Exomiser can identify rare, protein-altering, and segregating variants in genes with a similar phenotypic profile to the patient based on prior knowledge of human disease, model organisms, and proximity in protein-protein interaction networks.

Using these purely computational approaches we were able to identify the causative small nucleotide variant (SNV) or indel diagnoses in just a handful of proposed candidates for 77% of samples using the panel-based approach, 88% using Exomiser, and 92% when using both together. A laborious manual review of the evidence and validation of each candidate variant is required before molecular diagnoses can be made by clinical geneticists and this efficient prioritization of candidates was critical to making WGS-based approaches feasible in a healthcare setting.

Conclusion and future directions

Our work has shown it is possible to analyze whole genomes across a broad range of rare diseases and find a diagnosis in about 25 percent of people. Now that we have the infrastructure in place for this, the plan is to continue to build on it, incorporating new knowledge from researchers worldwide to try and increase the number of patients that can be helped. Part of this will be the application of new computational approaches to extend our search into the non-coding parts of the genomes as well as larger structural variants. For example, recently published approaches to predicting intronic variants that are affecting splicing will increase our potential to diagnose more patients and include innovative new artificial intelligence approaches.

Having a single, healthcare system has made it easier for us to apply WGS diagnostics at scale in the UK. Our findings have already had an impact on our National Health Service (NHS) and will continue to do so with ambitions to sequence around 300,000 genomes outlined in the NHS Long Term Plan. However, our hope is that our study will not only transform the UK health system but be adopted by other healthcare systems to change the lives of rare disease patients worldwide.


  • The 100,000 Genomes Project Pilot Investigators. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care — Preliminary Report. 2021. The New England Journal of Medicine. doi: 10.1056/NEJMoa2035790
  • Karczewski, K.J., Francioli, L.C., Tiao, G. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
  • Martin AR, Williams E, Foulger RE, Leigh S, Daugherty LC, Niblock O, Leong IUS, Smith KR, Gerasimenko O, Haraldsdottir E, Thomas E, Scott RH, Baple E, Tucci A, Brittain H, de Burca A, Ibañez K, Kasperaviciute D, Smedley D, Caulfield M, Rendon A, McDonagh EM. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat Genet. 2019 Nov;51(11):1560-1565. doi: 10.1038/s41588-019-0528-2. PMID: 31676867.
  • Smedley D, Jacobsen JO, Jäger M, Köhler S, Holtgrewe M, Schubach M, Siragusa E, Zemojtel T, Buske OJ, Washington NL, Bone WP, Haendel MA, Robinson PN. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Protoc. 2015 Dec;10(12):2004-15. doi: 10.1038/nprot.2015.124. Epub 2015 Nov 12. PMID: 26562621; PMCID: PMC5467691.
  • Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, Baynam G, Brower AM, Callahan TJ, Chute CG, Est JL, Galer PD, Ganesan S, Griese M, Haimel M, Pazmandi J, Hanauer M, Harris NL, Hartnett MJ, Hastreiter M, Hauck F, He Y, Jeske T, Kearney H, Kindle G, Klein C, Knoflach K, Krause R, Lagorce D, McMurry JA, Miller JA, Munoz-Torres MC, Peters RL, Rapp CK, Rath AM, Rind SA, Rosenberg AZ, Segal MM, Seidel MG, Smedley D, Talmy T, Thomas Y, Wiafe SA, Xian J, Yüksel Z, Helbig I, Mungall CJ, Haendel MA, Robinson PN. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217. doi: 10.1093/nar/gkaa1043. PMID: 33264411; PMCID: PMC7778952.
  • Shefchek KA, Harris NL, Gargano M, Matentzoglu N, Unni D, Brush M, Keith D, Conlin T, Vasilevsky N, Zhang XA, Balhoff JP, Babb L, Bello SM, Blau H, Bradford Y, Carbon S, Carmody L, Chan LE, Cipriani V, Cuzick A, Della Rocca M, Dunn N, Essaid S, Fey P, Grove C, Gourdine JP, Hamosh A, Harris M, Helbig I, Hoatlin M, Joachimiak M, Jupp S, Lett KB, Lewis SE, McNamara C, Pendlington ZM, Pilgrim C, Putman T, Ravanmehr V, Reese J, Riggs E, Robb S, Roncaglia P, Seager J, Segerdell E, Similuk M, Storm AL, Thaxon C, Thessen A, Jacobsen JOB, McMurry JA, Groza T, Köhler S, Smedley D, Robinson PN, Mungall CJ, Haendel MA, Munoz-Torres MC, Osumi-Sutherland D. The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2020 Jan 8;48(D1):D704-D715. doi: 10.1093/nar/gkz997. PMID: 31701156; PMCID: PMC7056945.
  • Danis D, Jacobsen JOB, Carmody LC, Gargano MA, McMurry JA, Hegde A, Haendel MA, Valentini G, Smedley D, Robinson PN. Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am J Hum Genet. 2021 Nov 4;108(11):2205. doi: 10.1016/j.ajhg.2021.09.014. Erratum for: Am J Hum Genet. 2021 Sep 2;108(9):1564-1577. PMID: 34739835; PMCID: PMC8595927.
  • Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK. Predicting Splicing from Primary Sequence with Deep Learning. Cell. 2019 Jan 24;176(3):535-548.e24. doi: 10.1016/j.cell.2018.12.015. Epub 2019 Jan 17. PMID: 30661751.

About Professor Damian Smedley

Professor Damian Smedley leads a Computational Genomics team at Queen Mary University London where his research focuses on the use of phenotype data to obtain novel insights into disease causes and mechanisms. Professor Damian SmedleyHis team is involved in translational aspects for a number of projects such as the International Mouse Phenotyping Consortium (IMPC).

In collaboration with other members of the Monarch Initiative, he has developed tools that utilize phenotype comparisons for candidate gene prioritization, particularly for whole genome sequence interpretation of rare disease patients as in the Exomiser software suite. Prof. Smedley served as Director of Genomic Interpretation at Genomics England from 2016-2018 and has led the analysis of the impact of the 100,000 Genomes Project pilot on rare disease diagnosis in healthcare.

Disclaimer: This article has not been subjected to peer review and is presented as the personal views of a qualified expert in the subject in accordance with the general terms and conditions of use of the News-Medical.Net website.

Last Updated: Jan 10, 2022


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.