SubmitoLoc: Identification of mitochondrial sub cellular locations of proteins using support vector machine

Mitochondria are important sub-cellular organelles in eukaryotes. Defects in mitochondrial system lead to a variety of disease. Therefore, detailed knowledge of mitochondrial proteome is vital to understand mitochondrial system and their function. Sequence databases contain large number of mitochondrial proteins but they are mostly not annotated. In this study, we developed a support vector machine approach, SubmitoLoc, to predict mitochondrial sub cellular locations of proteins based on various sequence derived properties. We evaluated the predictor using 10-fold cross validation. Our method achieved 88.56 % accuracy using all features. Average sensitivity and specificity for four-subclass prediction is 85.37% and 87.25% respectively. High prediction accuracy suggests that SubmitoLoc will be useful for researchers studying mitochondrial biology and drug discovery.

space and the matrix. Most of the mitochondrial proteins are synthesized in the cytoplasm and then imported into mitochondria by protein machineries located in the mitochondrial membranes [2]. Mitochondria involve in several biological processes such as programmed cell death, calcium signaling, ionic homeostasis etc [3]. It has been shown that mutation in genes that ecocide 864 ©Biomedical Informatics (2019) mitochondrial proteins leads to various rare human diseases like Leber's hereditary optic neuropathy, Leigh syndrome, Mitochondrial myopathy, hearing loss, and diabetes mellitus [4]. Therefore, detailed knowledge of mitochondrial proteome and their functions in various sub mitochondrial locations is very important for designing mitochondrial disorder therapies.
Various sequence databases provide experimentally verified mitochondrial subcelluar locations of proteins, but this list is very small. Further, designing experiments to obtain subcelluar locations of all mitochondrial proteins is expensive and time-consuming. Hence, it is necessary to develop bioinformatics methods based on machine learning algorithms for identifying mitochondrial proteins and its subclasses. In past, various machine-learning algorithms have been developed for prediction of mitochondrial proteins, although most were not proposed solely for mitochondrial proteins. . Although several methods are available for the prediction of protein sub mitochondrial locations, most of these methods are limited to the prediction of three sub mitochondrial locations (3 compartments). Moreover, they are developed using a small dataset. Therefore, it is of interest to describe the identification of mitochondrial sub cellular locations of proteins from sequence derived properties using Support Vector Machine (SVM) abbreviated as SubmitoLoc in this report. Various steps involved in SubmitoLoc prediction system are summarized in Figure 1.

Methodology: Dataset:
A set of 39371 proteins sequences was extracted from the SWISS-PROT database based on mitochondrial subcellular localization annotations in the comments block [21]. We applied the following filters to obtain high-quality data for training and testing our method. (1) Eukaryotic, non-plant protein sequences were only included, (2) Sequences with any ambiguous annotation like 'possible,' 'probable,' 'by similarity' and 'potential,' were omitted.

Features:
In this work, 239 features encoded each sequence. These features can be categorized into four groups: 60 of them are related to Composition, Centroid and Distribution features; 60 features are obtained from split amino acid composition; 88 features are extracted from protein functional groups and secondary structure information; 31 features are acquired from physico chemical properties (AA index).

Composition, Centroid and Distribution:
Composition, Centroid and Distribution (60 features) features were computed as described in Carr et al. 2010 [24].

Split amino acid composition:
The protein sequence is split into three equal parts. For each part, composition of 20 amino acid compositions was calculated. Totally, 60 feature vectors were derived from split amino acid composition.

Frequency of functional groups:
Based on the presence of functional groups, 20 amino acids were categorized into 10 functional groups. Similarly, we categorized 20 amino acids into 7 physico-chemical groups. For each protein sequence, frequency of each amino acid group was computed and this led to 17 feature vectors [25].

Frequency of short peptides:
From each sequence, we computed 10 residue length short peptides. Each short peptide was classified as hydrophobic, hydrophilic, neutral, polar or non-polar short peptide, and frequency of each short peptide was calculated as described in Pugalenthi et al. 2010 [25].

Content of secondary structural element (SSE):
The overall content of helix, beta sheet and coil was computed for each sequence, Further, frequencies of 10 amino acid group and 7 physico-chemical groups at helix, sheet, and coil regions were calculated as described in Pugalenthi [27]. In this work, we used LIBSVM 2.86 package [28], which is available for downloaded from http://www.csie.ntu.edu.tw/cjlin/libsvm/. Radial Basis Function (RBF) was selected as the kernel function for the training process. The optimal value for C (penalty constant) and γ (width parameter) parameters was determined using a grid search approach.

Feature selection:
We used Information gain approach to select subset of features that play prominent role in the classification [29].

Evaluation Parameter
We quantify prediction performance using four parameters sensitivity, specificity, overall accuracy and Matthew's correlation coefficient (

Matthews's Correlation Coefficient (MCC):
It is the statistical parameter to assess the quality of prediction and to take care of the unbalancing in data. It ranges from -1 ≤ MCC ≤   Figure 2). This shows that our method selected more informative features and eliminated less contributing features without any drop in the accuracy. When the features were further reduced to 100, we obtained 84.47% accuracy. The accuracy decreased by only 2% when compared to the accuracy of all 239 features. Our method produced 72% accuracy with just 10 features. The results suggest that the Info-gain feature selection approach selected useful features that have significant effect in the mitochondrial and nonmitochondrial protein sequence prediction.

Conclusions:
It is of interest to describe the identification of mitochondrial sub cellular locations of proteins from sequence derived properties using Support Vector Machine (SVM) abbreviated as SubmitoLo in this report. The model distinguishes proteins among four mitochondrial subcellular locations: mitochondrial inner membrane, mitochondrial outer membrane, mitochondrial inter membrane space and mitochondrial matrix with 88.6% accuracy under cross validation. The model is useful to assign mitochondrial sub cellular locations to several uncharacterized proteins to help in research and development through prediction data. We plan to implement a prediction tool in future for this purpose.