A classification scoring schema to validate protein interactors

Hypothetical protein [HP] annotation poses a great challenge especially when the protein is putatively linked or mapped to another protein. With protein interaction networks (PIN) prevailing, many visualizers still remain unsupported to the HP annotation. Through this work, we propose a six-point classification system to validate protein interactions based on diverse features. The HP data-set was used as a training data-set to find putative functional interaction partners to the remaining proteins that are waiting to be interacting. A Total Reliability Score (TRS) was calculated based on the six-point classification which was evaluated using machine learning algorithm on a single node. We found that multilayer perceptron of neural network yielded 81.08% of accuracy in modelling TRS whereas feature selection algorithms confirmed that all classification features are implementable. Furthermore statistical results using variance and co-variance analyses confirmed the usefulness of these classification metrics. It has been evaluated that of all the classification features, subcellular location (sorting signals) makes higher impact in predicting the function of HPs.


Background:
The protein-protein interaction (PPI) data provide a powerful representation for discerning organization of cells besides predicting biological functions and providing insight into a variety of biochemical processes [1]. In the recent-past, there has been a twofold increase of the PPI data using protein interaction networks (PIN) while several advanced methods [2] connecting orthology mapping and comparative approaches have come up to analyze and visualize proteins. These approaches aid bioinformatical algorithms to discover families of proteins that have shared functional modules. However, bioinformatical methods stated as above are usually applicable when the protein has known functional relationship and not for those proteins like 'predicted' or 'similar to' or 'hypothetical.' On the other hand, significant experimental efforts have allowed us to analyze the interactions of known proteins in various organisms. The interactomes established so far represent proteins corresponding to various organisms and sometimes organelles. With high-throughput data still limited for the human proteome, genome-wide approaches have been used to elucidate the human interactome. However, assuming that functional protein interactions are conserved in evolution, one can consider extending the experimentally determined human protein interaction network by using data from protein interaction data-sets of the model organisms. Transferring the information from known interaction networks to the unknown have been accomplished [14] further requiring identification of genes that have a common ancestor and therefore share the same function between the orthologs. In addition, web-based databases such as String (http://www.string.embl.de) contain several thousands of predicted interactions which are assembled by mapping interactions from model organism to various orthologs using sequence similarity searches, bidirectional and reciprocal best hit approaches. All the inferred interactions work on either one or two methods which contains lots of false positive data. Hence there remains a challenge to develop efficient tools to predict and accurately define function to the hypothetical proteins.
The orthology mapping poses a great challenge as many known HPs remain un-annotated even after being 'mapped to' or 'associated' with certain known proteins. For example, a HP queried using a visualizer Osprey [3] would still be shown as a grey node (meaning unknown) and therefore remain unsupported. In addition, there is a challenge in analyzing huge data-sets of HPs as flow of operations are to be executed simultaneously for better protein annotation. Thanks to the Cloud architecture [4], the work is made simple with the data mapped into reduced phase. The reduced phases combine all the outcomes from the multiple nodes into a single outcome. Further, such learning algorithms can be implemented over multiple nodes on map-reduce framework. The data thus analysed can then be used to validate each parameter and further statistical analysis might be employed to validate the results. To overcome this, we have employed a six-point classification schema based on some set of features. Our classification was employed on a subset of 20 HPs that have been randomly considered from a group of 1455 HPs [5]. In employing the classification system, we ideate bona fide protein interactions can be determined making sensitive Protein Interaction Networks (PIN).

Six-point Classification
Each classification is a measure of different methods, viz. Pfam score, orthology inference, functional linkages, back to back orthology for protein interactants, subcellular location and protein associations taken from known databases and visualizers. Each protein is given a value of 1 if the protein matches the classifier; else 0 is given against them. The annotation scores, based on the features are available in (Table  1, see supplementary material).

Classification 1: Pfam identities
Score: Best Pfam scores are given as per the assignment returned by Pfam [6]. The Pfam-B is given value 0 and Pfam-A is given value 1.

Principle:
The underlying principle is that the presence of domains in varying combinations in different proteins tends to provide insights into the function of the protein. The Pfam, represented by multiple sequence alignments and Hidden Markov Models (HMMs) classifies the query into Pfam-A and Pfam-B. While the Pfam-A are curated and built from the seed alignment, the Pfam-B are lower quality sequences generated automatically from electronic annotation using the nonredundant clusters.

Classification 2: Orthology mapping
Score: E value <1 were given a score of 1, else 0 Principle: The ortholog proteins often retain similar functions, so a pair of orthologs that interacts in subject organism is likely to interact in target organism too (putative interactions in other organisms are called interologs). The protein sequences were blasted against the Arabidopsis thaliana and Non Redundant (NR) databases. Besides these, other organisms were also targeted towards hot spots for functional linkages. We transferred the information of the ortholog data to Arabidopsis thaliana, and mapped them to functional linkages.

Classification 3: Functional linkages using protein interactions and associations Score:
If there is an association or linkage found through Rosetta stone method, a score of 1 is given, else 0. Principle: A protein Navigator tool [7] to check orthology pairs was used to predict the functional linkages linked to them: 1. Rosetta stone method [8] works on a rationale that two polypeptides X and Y in one organism are likely to interact if their homologs are expressed as a single polypeptide XY (which is called as a Rosetta Stone) in another organism. It is also likely that some proteins might be functionally represented as pathways or chemicals or simply in GO database. 2. Gene fusion method [9] works on the theory that pairs of monomeric proteins fused in other organisms tend to be functionally related or physically interacted. 3. Gene neighbours method works on the assumption that the operons of one organism may be conserved across other organisms.

Classification 4: Back to back orthology Score:
The associators or interactors found in classifier 3 are searched in query organism, if found distinctively and are correlated a score of 1 is given; 0, if absent. Principle: The interaction is linked only if the interactant ortholog is present in the query organism too. This is similar to the bi directional best blast hits.

Classification 5: Presence of sorting signals and localization to the same organelle
Score: If localized to the same organelle, 1 else 0. Principle: An approximate 50% of interactions are between the same organelle and rarely do we find transmembrane interactions. If the proteins are predicted to be localized to different organelle, the perchance of protein interacting to the query would be less. TargetP [10] was used to employ the subcellular location classifier.

Total Reliability Score (TRS)
The total reliability score (TRS) is summation of all the scores employed for all the six classifiers. If the value exceeded 3, then we believe that the candidate is likely to have a probable interacting partner. The first five classifiers, viz. Pfam score, orthology inference, functional linkages, back to back orthology for protein interactants, subcellular location are based on the manual annotation and prediction that we have employed while the protein associations/interactions are based on the existence of known protein interactors. A scoring schema similar to phylogenetic profiling of 0 and 1 for absence and presence of genes respectively in an organism was employed to make the PIN sensitive. These scoring patterns are applicable either when annotation is transferred to another organism or when the protein is waiting to be annotated. However, the fact that the use of six classifiers makes this method more stringent, the scores are averaged into two sections, viz. the 1+2+3 classifiers and 4+5+6 classifiers and an over all, TRS summing them. In employing these scoring patterns, we ideate that the sensitivity of the proteins based on any three or more methods can give better results.

Evaluation of classification
To circumvent the problem of protein annotation on current dataset, we further evaluated the classification scores with single node through learning algorithms J48 (a version of C4.5 decision tree), SMO (a version of Support Vector Machine), Naive Bayesian, and Multilayer Perceptron (a version of Neural Network) and 22 learning algorithms [17-18]. We implemented WEKA machine learning package [16] (Version 3.6.4) on Cloud with a single node. The data-set containing the proteins over six-point classification scores was further modeled through learning algorithms with ten-fold cross validation scheme ( Table 1, see supplementary material).

Statistical analysis using Anova and Kruskalwallis
We statistically interpreted the six-point classification metrics further using MATLAB [23] (Version 7.11 on a Windows 7 desktop). We finally tested the matrix (protein and six-point classifiers) using one-way analysis of variance (Anova) and Kruskalwallis methods.

Discussion:
The classification scoring schema from the TRS was employed on the test dataset (Table 1, see supplementary material). The permutations and combinations of the dataset gave a valid protein interaction networks. For example, when the classifier 3, viz. functional linkages and classifier 6 were analyzed, only 4 of the 20 turned out to be Rosetta Stone sequences, making a putative and novel protein interaction map (Figure 1). We observe that the protein NP_057673.2, a mitochondrial protein is known to interact with AAM10026.1 suggesting that they make a bona fide interaction pairing. Further, we evaluated the classification scoring system using machine learning algorithms and statistical interpretation suggesting the fact that the subcellular location (sorting signals) make a very good impact amongst all classification features. These results demonstrate that the six-point classification scheme is capable of yielding an ultimate TRS which is capable of interpreting PIN. From Table  1 Table 3 (see supplementary material) illustrates that all subset combination method "0 1 2 3 4 5" by MLP (81.081) and Hill selected data sub set "4 0" by MLP (78.378) are the best accuracies by these methods. This further adds to the confidence that all the six classification schemes helps better modeling of TRS compared to the sub sets. From the statistical interpretation, we used a matrix; whose rows are proteins and columns are different classifiers (Table 4, see supplementary material) and observed that the function returns the p-value after transforming 37-2 degree of freedom. Covariance test further revealed that the data is unbiased in the covariance matrix, forming a normal distribution (Table 5, see supplementary material). The statistical results indicate that the six-point classifiers are uniformly useful in modeling TRS as evident from the scores (Anova = 8.0645e-031; kruskalwallis= 6.3123e-011; (Figures 2  and 3). This adds to our confidence that having such classification schema is valid and could be extended to protein annotation using cloud architecture, if the dataset is larger. However, in order to overcome the impact of ranking all classifiers, we propose either unimposing a specific classifier or employing all classifiers. Disregarding any of the classifiers will yield less score for which we propose yet another classifier based on the total number of possible pairwise interactions and degree of paralogy for the fact that paralogy increases the number of possible interactions thereby decreasing the certainty of the prediction. The bottom line in this proposed classification is that higher the TRS, greater is the chance of the protein to be interacting and more reliable the functional linkage.

Conclusions:
A six-point classification system is proposed to solve the problem of hypothetical protein annotation with respect to interaction networks. In this work, by employing our classification schema, we have shown an example taking a protein NP_057673.2 that is known to be localized to mitochondria. The functional linkages through Arabidopsis thaliana indicate that the protein is a Rosetta Stone sequence with AAM10026.1 and the fact that the protein has already been shown to be interacting with HGS (ZFYVE8) makes promiscuous interaction ( Table 2, see supplementary  material). We believe with a variety of statistical interpretation that we put forth, our in silico selection strategy can be used to select the most promising candidates from a PIN. Further the cloud computing resources employed were quite useful in validating accuracies on TRS modeling through multiple algorithms and data sub-sets.