Protein variation, which accounts for large amounts of the complexity in biological systems and our bodies, can come in many different forms. The different types of variation, namely variations in the molecular form of protein products, are united by the term proteoform (previously also known as protein forms, protein isoforms, protein species, and protein variants).
Image Credits: StudioMolekuul / Shutterstock.com
Endeavors into understanding genetic variation led to the discovery that much of the variation and complexity in biology is due to proteins, rather than only genes. Different proteoforms can arise due to genetic variation, manipulation or splicing of RNA transcripts, and modifications occurring after translation.
There are a few exceptions to protein variations that are not covered by the term proteoforms. These include post-translational modifications that are known as reagent-derivatized or isotope-labeled residues. Otherwise, proteoforms are used to understand the full complexity of proteins and how the different sources of variation can interact to give rise to differences.
Sources of protein variation
Genetic variation giving rise to proteoforms can largely be attributed to coding single nucleotide polymorphisms (cSNPs) and mutations. Variation at the RNA level can be mainly attributed to alternative splicing.
For example, it is estimated that around 93% of human genes are subject to alternative splicing. These can have implications for function and localization. Variation at the RNA level can also be due to RNA editing, with the most common editing being where adenosine is edited to inosine.
Translation is not a perfect process, and errors in translation are another source by which unique proteoforms can arise. Estimated error frequencies are at around 0.01-0.1% per amino acid, which may increase in aging or stressed cells, meaning errors can make up a sizable portion of variation in cells with many proteins.
Post-translational modifications are also a sizable source of proteoforms, as they can increase proteoform numbers exponentially. Post-translational modifications can be divided into categories based on structure or function.
For example, structural categories can look at if modifications are simple (e.g. phosphor or acetyl) or complex (e.g. glycosylation) and how this increases proteoform numbers. Functional categories focus on the effects of post-translational modifications on phenotypes, thereby focusing on how proteoforms can give rise to different forms.
The size of the proteome is subject to a lot of debate, with values ranging from 20,000 to several million. While the human genome can be estimated to be around 20,000 protein-coding genes, the size of the proteome can be several magnitudes larger due to the great variation of proteoforms.
The presence and function of proteoforms can be critical for normal body functioning. In humans, there are 23 known proteoforms in the amyloid-β system in Alzheimer’s disease, where the different proteoforms are not detectable through traditional ELISA assays. There are also around 75 known proteoforms for the histone H4 system, which is associated with gene repression and activation.
Understanding the full extent of the human proteoform will be challenging. Not only is it necessary to understand how many proteoforms exist, but the way proteoform diversity varies between cell types, their role in disease, and their role in human diversity will be complex and difficult to decipher. Projects such as the Human Protein Atlas and the Human Cell Atlas have been launched in the past 10 years to help understand human diversity, and will likely include proteoforms.
Issues in proteoform detection and understanding
While proteomics platforms have been massively improved in recent decades, there are still discrepancies in proteoform detection. For example, alternative transcripts that are discovered via RNA sequencing are not always found using proteomics methods.
The low detection of proteoforms of this type is due to limited sensitivity and coverage of the currently used proteomics platforms. Even methods where most gene expression can be detected, called deep proteomic analyses, the sequence coverage for many proteins is low. This is especially true for low abundance genes.
Another complication to detecting proteoforms is that they cannot be detected by the currently dominant strategy. The ‘bottom-up’ approach most widely used involves digestion of proteins to detect peptides with LC-MS/MS, but most proteoforms share peptides with each other and thus this method is often inappropriate. The ‘top-down’ approach is often seen as better, where proteins are not digested and instead the entire proteoform is analyzed by LC-MS/MS.
- Smith, L.M. et al. (2013). Proteoform: a single term describing protein complexity. Nature Methods. https://doi.org/10.1038/nmeth.2369
- Aebersold, R. et al. (2019). How many human proteoforms are there? Nature Chemical Biology. https://doi.org/10.1038/nchembio.2576
- Toby, T.K. et al. (2016). Progress in top-down proteomics and the analysis of proteoforms. Annual Review of Analytical Chemistry. https://doi.org/10.1146/annurev-anchem-071015-041550