Isoelectric Point Estimation, Amino Acid Sequence and Algorithms

The isoelectric point, or pI,represents a point of balance for a molecule, where the external surface charge is a net zero. This factor governs electrophoretic mobility in proteins and also plays a role in identifying peptides from mass spectral proteomics data. pI depends on a number of factors, including amino acid sequence, post-translational modifications (PTMs) and presence of side chain—all of which can alter surface charge and behavior depending on the pH of the environment.

Various methods for predicting pI in denatured proteins exist, and most base this calculation on amino acid sequence with reference to pK_a values recorded for ionizable constituents. Although these predictive methods exist, their performance can be variable and may skew ensuing results.

Audain et al. (2015) compared and contrasted five tools available to researchers for determining pI on the basis of amino acid sequence.¹ The researchers benchmarked algorithm performance, comparing results obtained against public data sets to show how well these predictive tools performed.

The researchers chose the following tools to undergo benchmarking:

Iterative: calculated from amino acid sequence
Cofactor: calculated with correction factors according to amino acid position and adjacent charged residues
Bjellqvist: calculated according to pKa and amino acid position
Support Vector Machine (SVM): calculation based on amino acid sequence and Amino Acid Index database (AAindex) data
Branca: calculation according to correction factors for position, influence of neighboring groups, and statistical corrections for presence and nature of side chain groups

Audain et al. note that in order to avoid bias in reporting, they did not optimize the methods used for evaluation for any of the tools under investigation.

First, the team constructed an R-package, a collection of programs, functions and data written in statistical programming language R, as a framework for reproducible analysis within which to examine performance of the various algorithms. In this way, the benchmarking process would allow for direct comparisons through reference to correlation and root-mean-square deviation (RMSD) evaluation. The researchers then calculated pI values using each of the tools under investigation before comparing the theoretical results obtained against those publicly available. Audain et al. used two databases for reference; the first, the PIP-DB (protein isoelectric point database) contains a comprehensive record of protein pI data. The second is made up of values obtained for the tryptic proteome generated from the cellular fraction of Drosophila Kc167 cells.

For the theoretical values generated for proteins, the team first grouped the results into those with variable pIs and those with only one unique pI. From this analysis, they found that most proteins do not possess a unique pI. From the comparison between observed and theoretical, the researchers found a mostly poor performance from all five tools, with R² values ranging between 0.61 and 0.15. The best performance, with the lowest RMSD of 1.28, came from the SVM calculations.

When considering the data from peptides, the researchers found much better performance, with high correlation between predicted and observed pI values (R² = 0.96). They found the lowest RMSD with SVM predictions (0.21). Looking at peptides modified by PTMs, the team saw that the best predictions came when the algorithm included the effect of the PTM alongside its overall theoretical calculation.

Although Audain et al. found poor benchmarking performance for the five methods investigated, they make some suggestions arising from the process:

Some algorithms are suitable for in silico prediction
Machine-learning algorithms function best, although the ability depends on training and quality of training data

The authors also make further suggestions based on the results for the ideal conditions under which the algorithms function best, and have also made software and data freely available for scrutiny.

Reference

1. Audain, E., et al. (2015) “Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences,” Bioinformatics, doi: 10.1093/bioinformatics/btv674.

Post Author: Amanda Maxwell. Mixed media artist; blogger and social media communicator; clinical scientist and writer. A digital space explorer, engaging readers by translating complex theories and subjects creatively into everyday language.