Mass spectrometry has always had a powerful synergy with computers. Computers have pushed mass spectrometry forward at key junctures in its history from data collection to instrument operation to data analysis. Proteomics was enabled by both tandem mass spectrometry and informatics to rapidly assign amino acid sequences to spectra.
As instrumentation has become more powerful informatic capabilities have grown to keep pace with increases in data production and data types. Sophisticated workflows are used to process proteomic experiments that encompass search, quantitation, and statistical processing of data. As new features are added to mass spectrometers like ion mobility this provides additional capability for collecting data and information for interpreting peptides and peptide features. IP2 is a proteomic platform that creates a workflow combining GPU powered search, flexible quantitation, and statistical analysis of data.
Please tell us about the evolving relationship between mass spectrometry and computers?
Mass spectrometry and computers have had an interesting relationship over the years. There is a concept known as ‘adjacent possible’ (introduced by Stuart Kauffman in 2002) which states that both evolution and innovation tend to happen incrementally, within the realm of possibilities available at any given moment.
This idea has been extremely relevant to mass spectrometry. If you look at the history of mass spectrometry, a lot of the early work took place at academic institutions where computers were also being developed.
This led to collaborations where computers were used in various projects such as complex and accurate mass calculations for molecular formulae. Eventually, technology developed which enabled computers to start recording mass backwards rather than using the traditional photographic plate. In time, computer algorithms were able to start processing data more efficiently, so you could get much more out of it.
How important have crowdsourcing and current technology been in the development of mass spectrometry?
Incredibly important. Around the same time as the advances in computing mentioned earlier, people realized that they should not need to interpret mass spectra more than once.
This led to the idea of creating libraries of spectra that have already been interpreted and sharing these, and this practice actually opened up the concept of library searching. This is an early example of crowdsourcing within the scientific community.
At that time, computers did not have large amounts of storage space or memory. Algorithms had to be very clever to reduce the amount of information and technology that was required to do the library search. Computer-controlled data acquisition gave way to the data-dependent acquisition, and then eventually data-independent acquisition, allowing for the large-scale analysis of peptides.
It was now possible to treat a tandem mass spectrum of peptides and amino acid sequence barcodes, using this to scan through a database and identify the amino acid sequence that is represented.
This approach enabled high throughput and large-scale experiments that could accommodate highly complex biological systems, from protein complexes and organelles to cells and tissues. However, this generated more data and more analyses, which required further organization.
How have developments in informatics impacted the potential of mass spectrometry?
Developments in mass spectrometry have placed a lot of stress on our ability to collect, analyze, organize, and interpret data. This is where informatics has become important, with software tools and scripts that range from data extraction, search engines, and quantitative analysis through to validation tools, confidence scoring tools, and data repositories becoming commonplace – the latter regularly using Microsoft Excel as a storage medium, though we’ve been moving away from this as it’s not ideal.
It is important that data can be used within the lab, so here at Bruker, we have been developing protein structure analysis and gene enrichment analysis tools with that in mind.
Can you give our readers an overview of the IP2 platform?
A few years ago, we did a comparison among the data extraction tools available, and we found a great deal of variability in terms of their capabilities. This is one of the most important steps of the process, but the issue with having so many software tools available is that compatibility can be difficult to achieve across a whole workflow.
We launched a company called Integrated Proteomics Applications, developing a workflow scaffold called the Integrated Proteomics Pipeline (IP2). The idea behind this scaffold is that we can plug-in tools that we developed in academia (which were freely available and open-source) into this IP pipeline, thus creating a streamlined workflow with integrated data analysis tools.
The IP2 is a middle layer program, which manages analyses, spectral quality control, back end layers, links to cloud and cluster computing, data storage, and backup.
Users can access their data or view the status of processes using a desktop computer, phone, or tablet. The IP2 is also customizable via the IP2 Developer’s Kit, meaning users can adapt the platform to work with other software and applications.
The IP2 can work with large-scale parallel proteomics data analysis, and we have been able to integrate our platform with cloud computing via Amazon Web Services, Google Cloud, and Microsoft Cloud tools. We are also taking advantage of GPUs, both within our laboratory system and cloud-based, to increase the speed and efficiency of the platform.
How does the use of GPU processing affect the performance of the IP2?
The IP2 allows the use of a GPU search engine, and this utilization of GPU cores rather than CPU cores makes it extremely fast. A GPU card will have thousands of GPU cores, and you can add even more GPU cards to improve speed. The GPU core allows database searching at incredibly fast rates, and database searching scales with computing power.
We have been looking at using dual searches to improve data quality, incorporating this feature into the IP2. Here, you pass DDA data that you have searched through to a sequence database and then use this data to build a library. You can then search your DDA data a second time using that library. This approach can improve data reproducibility dramatically, and the use of GPU cores gives us the processing power to achieve this.
Can you also tell our readers about the timsTOF platform and how this integrates with the IP2?
Bruker’s timsTOF is a tool designed to measure ion mobility. This is a powerful extension to mass spectrometry that gives us information about the three-dimensional structure of an ion, helping us to increase peak capacity and overall confidence in the compound characterization.
We have been optimizing our platforms, tools, and search engine for the timsTOF, specifically around how we extract data from timsTOF’s raw files, which are large and contain a lot of information.
The result of this work has been a robust extraction program: the timsTOFExtractor, the ProLuCID search engine (which uses the GPU processing technology we talked about earlier), the Census quantitative data analysis application, and PaSER (Parallel Database Search Engine in Real-Time) for timsTOF.
What are the challenges in developing search engine tools for metaproteomics and microbiome data analysis?
Working with microbiome data is extremely challenging because of its large sequence database. This is currently around 70 gigabytes in size and continuing to grow. The index database for this data is over a terabyte in size, meaning that it is difficult to search this using traditional search strategies.
To address this, we worked with Dennis Wallen from Scripps Research Institute to develop the ProLuCID-ComPIL search engine.
The ProLuCID-ComPIL pre-sorts and pre-analyzes the data using NoSQL to improve search time. These algorithms and processes can also be used with PTMs and Sequence Variants, with these being transferred into an index database, which is then searched at high speed using the GPU.
We have also been able to turn our attention to metabolomics thanks to our work with Yu Gao at UCSD and his spectral alignment tool Dilu.
Can you tell us more about the PaSER system?
Our PaSER system is a parallel database search engine that can work in real-time. Many applications scan extremely fast, generating a large number of spectra, so one of the key advantages of searching in real-time is that there is no need for the data extraction step - you just take the data directly from the mass spectrometer, and search right away. There is no need to upload the data.
The PaSER system is fast enough that it can accommodate several instruments at once, but our goal with the PaSER platform is not just real-time search. We want to continue to address the many challenges present in delivering effective real-time search functionality.
The search engine’s speed is critical if it is to keep pace with the rapid scanning speed of instruments feeding into it.
Like the IP2 platform, the PaSER uses GPU cores instead of CPU cores, ensuring considerable speed improvements over traditional search technology. This means that it is possible to send data from an instrument to an IP2-GPU box in real-time and the database search result will be available immediately after the experiment is done.
How did you evaluate PaSER’s speed increases over traditional offline searches?
In order to evaluate the PaSER, we ran some samples on timsTOF Pro - namely HeLa at 200 nanograms. We ran six technical replicates. The first run was a real-time search, followed by no real-time search. This was repeated in run three and four, before doing two more real-time searches in runs five and six.
The goals of this experiment were to identify any lags in scanning speed and to check whether or not the number of successful identifications was impacted upon by the use of real-time search.
The experiment found that the use of real-time search did not affect scanning time. We also found that overall, the use of real-time search returned the same number of identified results as a standard offline search.
We also evaluated the offline search time. Sometimes, following a real-time search, users may want to search again with different parameters or in different databases. In this scenario, there is still no need to convert the raw data because the initial real-time search has already stored spectra and transferred them to the database. In our example, it only took three minutes for each search using the IP2-GPU search engine.
How does the Smart Precursor Selection tool improve the search process?
PaSER’s Smart Precursor Selection tool allows PaSER to communicate with instruments bi-directionally. This opens up many creative possibilities, depending on the goal of your project.
For example, we can use this tool to implement exclusion lists. This idea has been around for a long time, but it is not very popular because it is not very easy to implement. Historically, users have had to manually collect peptide IDs from search results, then input the exclusion list manually. As ideas developed, this process had to be repeated which, of course, is not ideal.
PaSER allows us to build the exclusion list automatically, right from the first experiment. PaSER will then pass this exclusion list to the second experiment, and so on, refining the exclusion list with each iteration. This approach allows our searches and experiments to yield more accurate results over time.
Another example is the dynamic management of mass drift due to temperature changes or calibration issues. Using PaSER’s real-time search, we can measure delta mass between theoretical and experimental peptide precursor ions. We can then send the delta mass back to the instrument and the instrument can calibrate the mass drift dynamically and in real-time. This means that whether we complete ten runs or a hundred runs, the mass calibration is always up to date.
When we are working with MS1-based quantitative analysis with labeling, we will often be presented with both light and heavy ions for the same peptide. For quantitative analysis, we do not need to use both precursor ions – one is enough to quantify the sample because we already know the mass difference between light and heavy ions.
During real-time search, we can ascertain if we are working with a heavy or light peptide, then we can dynamically exclude the other of the pair, so we do not have to generate redundant spectra.
We can also use this tool to work with site-specific labeling techniques such as AHA labeling or TEV tag labeling. For example, AHA labeling can label methionine, while TEV tagging can label cysteine. According to the UniProt database, approximately 65% of peptides do not contain cysteine or methionine. With this data, we can exclude many peptides in real-time, selectively scanning these as appropriate.
How does PaSER accommodate dynamic PASEF spectra and real-time quantification?
PaSER can use PASEF, or Parallel Accumulation Serial Fragmentation as part of its operation. In each PASEF cycle, we combine frames to build a PASEF scan, but depending on the abundance of ions, one PASEF cycle may not have enough ions to produce a result.
Here, real-time search can check the search score and evaluate the spectra. If this check reveals that the passive scan requires more signal, more frames can be added to boost this.
We need to know the apex point to be able to effectively trigger tandem spectra, and we can do this during real-time search by evaluating a chromatogram and triggering only precursors once the apex peak point has been determined. We can also dynamically remove the ion from the exclusion list at the most appropriate time.
Lastly, we are also working on real-time quant capabilities with the timTOF. Instead of a typical XYZ peak area, we can calculate the volume of a peptide using ion mobility.
It is also possible to separate co-eluding peptides with the peptide ID from within the real-time search, meaning that we can perform quantitative analysis in parallel. If we are working with multiple experiments, we can also build a match between runs as we move through the experiment.
These are all good examples of the creative use of PaSER’s bi-directional communication capabilities.
Finally, where do the IP2 and Bruker’s other platforms sit within the wider data and knowledge industry?
We now have a wide range of tools that allow us to catalog and identify mass spectrometry data. The overall goal, of course, is to identify things that lead to biological discoveries.
With this in mind, our applications link to data analysis tools like Reactome, many of which can be freely accessed via the internet. There are tools available that can explore Gene Ontology output, for example by picking the top twenty most significant categories, while maintaining information on all categories in the raw file generated so that it can be examined further without having to re-analyze it.
Mathieu Lavallée-Adam and his team developed an in-house tool for us called PSEA-Quant, which is designed for protein set enrichment analysis. This tool was based on others developed for gene set enrichment analysis, but these have been optimized for both label-free and label-based protein quantification data.
Overall, compatibility between platforms and tools is a major focus of our work.