By Shelley Farrar, MSc, BSc
Proteomes refer to the complete set of proteins expressed by an organism or biological system. Proteomics is, therefore, the large-scale study of proteomes, exploring a range of protein activities including expression, movement and interaction. Proteomics takes a quantitative approach to studies of functional genomics and biological systems through the use of extensive datasets formed by lists of proteins.
The advent of shotgun proteomics, identifying proteins in complex mixtures through high-throughput technologies, has meant that additional methods are required to interpret the resulting large lists of identified proteins. Biostatistics and bioinformatics tools have been applied to the interpretation of proteomics data.
Research in the field of proteomics. New technologies for the study of biological macromolecules. Image Credit: Sergei Drozd / Shutterstock
Interpreting Proteomics Data with Gene Ontology Annotation
The biological relevance of the vast amount of identified proteins obtained has to be extracted through the use of functional annotation. The functional annotation of proteomics data allows for the mining of biological information databases to predict the function of a protein. The classification of genes and proteins according to their roles in biological systems is also the foundation for the analysis of relationships and interactions between them.
Gene Ontology (GO) is a bioinformatics initiative to develop a controlled vocabulary for all eukaryotes that can classify the gene or protein into a category.
This annotation means the description is within one of three domains:
- A biological process.
- A molecular function.
- A cellular component.
GO annotations are hierarchical, with more general annotated terms at the higher end of the hierarchy and more specific annotated terms at the lower end. This allows for the tracing of relationships between the lower ‘child’ term and potentially multiple higher ‘parent’ terms. Genes and proteins are therefore annotated downwards within the hierarchy and can be traced to the three original domains.
The GO database is constantly revised with new annotation files to reflect better knowledge of a relationship and to remove obsolete terms.
Enrichment Analysis of Proteomics Data
Enrichment analysis can be used to identify overrepresentation of biological information in long protein lists and allow for the visualization of biological processes. Enrichment analysis takes GO terms and uses them to summarize the biological pathways that are most likely related to the proteomic data. Statistical methodologies are used to compare the abundance of GO terms in the dataset with the natural abundance in a reference dataset.
Terms are extracted that are overrepresented in the proteomics dataset by the calculation of a p-value. There are over 60 software tools developed to calculate enrichment analysis through enrichment algorithms.
Different algorithms are used depending on whether one annotation term is being tested at a time via singular enrichment analysis (SEA) or if the whole genome is being taken into account via gene set enrichment algorithms (GSEA).
Biological Network Analysis of Proteomics Data
A biological pathway is the series of cellular chemical reactions that together causes a biological effect. As proteins are involved in the chemical reactions, they can be combined in pathway databases to allow us to interpret the type of biological process within the proteomics dataset. The simplest methods analyze the protein lists for abundances that represent a particular pathway.
Several biological network models have been developed that aid in the interpretation of proteomics data by simulating biological systems. They allow for experimental verification of the processes involved and the simulation of complex cellular interactions. This means that the consequences of each biological pathway can be projected.
Software has also been developed to aid in the visualization of biological processes. Computational tools can process large-scale proteome datasets by integrating the results of the functional enrichment analysis, so that the overrepresented annotations can be displayed as a network.
Easier visualization of proteomics data interpretation can be made through this computational approach. The resulting network display includes nodes which are associated with a molecular component such as proteins, whilst the edges are associated with the different types of interaction between nodes. By processing proteomics data in this way, interpreting long lists of proteins is made easier and the resulting biological information can be applied to a variety of questions within the field of proteomics.
- Carnielli, C.M. et al. 2015. ‘Functional annotation and biological interpretation of proteomics data’, Biochimica et Biophysica Acta, 1, pp. 46-54. http://www.sciencedirect.com/science/article/pii/S1570963914002799
- Schmidt, A. et al. 2014. ‘Bioinformatics analysis of proteomics data’, BMC Systems Biology, 8, S3. bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-8-S2-S3
- Oveland, E. 2015.‘Viewing the proteome: how to visualize proteomics data?’, Proteomics, 15, pp. 1341-1355.