Even before the current coronavirus disease 2019 (COVID-19) pandemic, many biologists produced far more data than they personally needed to analyze, and many bioinformatics tools have been created in attempts to turn this data into useful information. The amount of data released during research into severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is overwhelming, and researchers from the University of Edinburgh have created a tool designed to make as much use of it as possible.
A preprint version of the group’s study is available on the medRxiv* server, while the article undergoes peer review.
The researchers were primarily displaying a Python-based tool known as civet, which they developed in order to better analyze very large quantities of genetic data while integrating metadata from around the world. It requires a sequence alignment and metadata file that represents the background diversity of the pathogen of interest – although civet can provide this if a fasta sequence is submitted. The short pipeline filters genome sequences and maps them against reference sequences.
The user can then define queries to generate reports for certain sequences, sampled from certain areas, to see the specifics of their particular cases of interest in comparison to wider background data. Civet does this by searching the background data to identify the set of sequences that seem the most relevant, and then comparing the single nucleotide polymorphisms (SNPs) in each query with the SNPs in record in the background dataset. These comparisons allow the user to determine the direction of ancestry, distance between targets, and even search for specific mutations of interest and see which nucleotide/amino acid variants are present at sites in both the query and returned results.
The researchers examined a case study involving an outbreak in an Edinburgh hospital in 2020, with cases across multiple wards – including both staff and patients. The outbreak started with a single patient in Ward B, before long three more patients across Wards A and B had tested positive. This was followed by a total of five healthcare workers testing positive, across three different wards. A household contact of one of these workers also contracted the disease.
Initially, it was thought that following a patient-to-staff transmission event, subsequent staff-to-staff events or multiple patient-to-staff events continued the outbreak. Genome sequencing of samples from both staff and patients showed two different lineages of SARS-COV-2 were present. It is unfortunately difficult to tell the directionality of transmission, even with this information, but it does suggest multiple introduction events, and civet analysis confirms this, splitting the outbreak into two distinct catchment trees. The civet report also predicted inter-ward transmission, potentially indicating only two introduction events.
The researchers also used civet to demonstrate its capability as a community surveillance tool, in order to better summarize the diversity of viruses circulating in a local area, or to monitor clusters of interest. For example, the N501Y mutation has been predicted to increase spike protein binding to angiotensin-converting enzyme 2 (ACE2), which is likely to boost both infectivity and transmissibility. The researchers presented a hypothetical case of a civet report resulting from a background search, defining queries as sequences originating from the UK with the spike N501Y mutation from the beginning of September to the latest data in the background set. Two clusters of this existed in the UK – one in Wales, which became known as B.1.1.70 and one in the southeast of England that became known as B.1.1.7. Two other, very small clusters were detected by civet.
While the community surveillance is impressive, the researchers also show that civet can be used on the national level. They show a civet report of genomic surveillance in Trinidad and Tobago over 2020, showing the sequences gathered from this population alongside available metadata, and summarizing the different catchments used to represent the genomes. There are three catchments from Trinidad and Tobago, representing lineages B.1.111, B.1.1 and B.1.1.33. This indicates at least three independent introductions of the genome into Trinidad and Tobago. The sequences form a monophyletic cluster when compared to sequences from around the world. The scientists also display a timeline of events, with B.1.111 appearing consistently towards the end of 2020, while B.1.1 and B.1.1.33 only appear intermittently.
The authors have created a powerful tool for phylogenetic analysis, and while it has primarily been used to investigate SARS-COV-2 variants, it will likely remain in use long after the pandemic. It could provide invaluable insight into future outbreaks and differences in disease based on location. This could inform healthcare workers, researchers and drug developers, and help provide more specified care to those in need.
medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information