Prediction of cystine connectivity using SVM

One of the major contributors to protein structures is the formation of disulphide bonds between selected pairs of cysteines at oxidized state. Prediction of such disulphide bridges from sequence is challenging given that the possible combination of cysteine pairs as the number of cysteines increases in a protein. Here, we describe a SVM (support vector machine) model for the prediction of cystine connectivity in a protein sequence with and without a priori knowledge on their bonding state. We make use of a new encoding scheme based on physico-chemical properties and statistical features (probability of occurrence of each amino acid residue in different secondary structure states along with PSI-blast profiles). We evaluate our method in SPX (an extended dataset of SP39 (swiss-prot 39) and SP41 (swiss-prot 41) with known disulphide information from PDB) dataset and compare our results with the recursive neural network model described for the same dataset.


Background:
The completion of the human genome project shows a significant gap between the protein sequence and known structure space.Determination of protein structures using conventional X-ray crystallography and NMR (nuclear magnetic resonance) techniques is not adequate to cover the sequence space in the context of drug discovery.Hence, protein structure prediction using computational methods is becoming critical.However, prediction of protein tertiary structure from sequence is non-trivial and is generally achieved by dividing the problem into finite levels of secondary structures and super secondary structures.
The native protein fold is dependent on the physicalchemical properties of the amino acid residues in the sequence.Disulphide bonds between cysteines are important features in the formation of several protein folds.It is shown that cysteines are highly conserved in a protein family and they exit in either oxidized or reduced states.[1-3] The cystines in oxidized state form covalent bond between each other and are referred as disulphide bridges.A schematic representation of conotoxin (PDB (protein databank) ID 1AS5) showing disulphide bonds is given in Figure 1.Information about the location of disulphide bridges find application in the understanding of protein folding [1] and have a role in thermodynamic stability of proteins.[2] Hence, studies on disulphide bridges have become very important.
Fariselli et al., [2] proposed a disulphide prediction model combining a neural network based predictor and evolutionary data with an accuracy of 81%.In 2000, Fiser and Simon [3] proposed a method based on multiple sequence alignment and reported an accuracy of 82% using Jack Knife test on a larger dataset of 81 proteins.Martelli et al.,[4] proposed a Hidden Neural Network method (a combination of Hidden Markov Model and Neural Network) with an accuracy of 84% for a larger data set of 969 nonhomologous proteins.
Vullo and Frasconi [5] used recursive neural networks and evolutionary data to predict bonding patterns using known information on cystine bonding states.The method was tested using a small dataset derived from Swiss-Prot release 39 (SP39) and an accuracy of 48% was reported.Prior to this, Fariselli and Casadio [6] linked connectivity prediction to graph matching.They also showed better connectivity prediction by combining with neural network models.
Recently, Ferre and Clote [7] emphasized the importance of secondary structure and solvent accessibility information in the development of a diresidue neural network model for predicting disulphide bridges.Cheng and colleagues discussed ways to find and count (using recursive neural network) disulphide bridges in a given sequence and tested the model performance in SPX (an extended dataset of SP39 and SP41 with known disulphide information from PDB).[8] Here, we describe a SVM (support vector machine) model for predicting cysteine bonding state as an extension of the work by Cheng and colleagues.[8] In this method, we predict disulphide bond connectivity given two cysteines with and without a priori knowledge on their bonding state using the SPX dataset.

Methodology: Support Vector Machines: SVM (Support Vector Machine) is a class of tool used in classification and regression as described elsewhere by Vapnik. [9]
When used as a binary classifier, an SVM will construct a hyperplane which acts as the decision surface between the two classes.This is achieved by maximizing the margin of separation between the hyperplane and those points nearest to it.The idea is further extended for data that is not linearly separable by first mapping it to a possibly higher dimension feature space.The SVM formulation is desirable due to its mathematical tractability and good generalization properties.
(Equation 1) The nonlinear feature map ( ): ∈ to be chosen from a family of functions with sufficient capacity.In particular, contains functions for the linearly and non-linearly separable hyperplane having the following forms: Now for separation in feature space, we would like to obtain the hyperplane with the following properties: The conditions in equation Equation 4can be described by a strict linear discriminant function, so that for each element pair in we require: 5) The distance from the hyper-plane to points lying closest to it is given geometrically as 1 w . The softmargin minimization problem relaxes the strict discriminant in equation 5 by introducing slack variables, i ξ and is formulated as: 6) The constant C is selected so as to compromise between the minimization of training error and prevention of over-fitting.Applying Lagrangian Theory, the following dual problem in terms of Lagrange multipliers i The explicit use of the nonlinear function (.) φ , has been circumvented by the use of a kernel function, defined formally as the dot products of the nonlinear functions Kernels can be chosen according to Mercer's theorem.In all our experiments we use polynomial kernel with degree d = 2 given by ( ) ( ) 9) This was chosen based on preliminary experiments involving fewer protein chains.The SVM classifier is given by: 10)

Disulphide bonding patterns in proteins:
The human alkaline phosphatase (PDB ID: 1EW2) have 5 cysteines with 2 disulphide bonds formed between 2 nd -3 rd and 4 th -5 th cysteines in the order of the sequence.It should be noted that the 1 st cysteine is not involved in any disulphide bond formation.This describes the nature and selectivity of disulphide bond formation in human alkaline phosphatase and gives information on the bonding states of the cysteines in the sequence.However, disulphide bonds are formed in various combinations in different proteins.Therefore, it is of potential interest to predict the nature of disulphide bonds from sequence for which structure is unknown.Nonetheless, this task is nontrivial and predictions of disulphide bonds are generally preformed with and without prior knowledge on cysteine bonding states in a sequence of interest.If we have to predict the disulphide bonding patterns in human alkaline phosphatase assuming the structure is un known, then it can performed either with or without a prior knowledge on the bonding state of cysteines.Prediction of disulphide bonding patterns with prior knowledge on the bonding state (6 different possible combinations) is relatively simpler to that without any prior knowledge on the bonding state of the cysteines (10 different possible combinations) in human alkaline phosphatase.

Dataset:
The SPX dataset was created by Cheng et al., [8] was used in this study.The dataset contains nonhomologous (at a sequence similarity cut-off of < 25%) sequences (containing information on intrachain disulphide bonds) from PDB.

Feature parameters:
We used five parameters for each cysteine based on physico-chemical properties and probability of occurrence in secondary structures (alpha helix, beta strand, coil), Chou-Fasman conformational parameters

Use of homologous sequence information:
Recent CAFASP and CASP results showed that the use of homologous sequences can improve secondary structure prediction, solvent accessibility calculations and cystine connectivity identification.This attempts to capture the evolutionary information for sequences and is generated by developing matrices from sequence profiling.The PSSM (position specific scoring matrix) is generated by calculating positionspecific scores for each position from sequence profiles and the scores are a measure of residue variability or similarity in the profile.[13] The PSSM generated by PSI-BLAST (http://www.ncbi.nlm.nih.gov/)from a non-redundant (NR) dataset of protein sequences was used in this analysis with an E-value (expect value) of 0.001 at 3 iterations.A window of length w was considered for every cysteine under consideration at the center of the window and this is used as a feature for the classifier.In PSSMs, there are elements and L is the protein length.In this study, we used L = 5 after several trails.The PSSM values vary approximately between -10 and +10.However, SVM require values between 0 and +1.Therefore, we normalized the PSSM values using the following function as described elsewhere.In this formulation x is the value in the PSSM matrix.
Instead of taking just 20 values per residue as a feature vector, we considered a window of length w and all the values within the window were considered in feature definition.[13] We were able to incorporate the gradual variation required for the classifier to make a better decision by selecting a window = 5 for PSSM values.We included 5 X 20 PSSM values in addition to five physical-chemical features for every cysteine under consideration and the total feature length for every cysteine was 105.Hence, the final feature length for each cysteine pair is ( w ( * 20) 5)* 2 w + .

SVM parameters and performance measures:
We use SVM with and a polynomial kernel with in this analysis.We used the SVM implementation SVMHeavy developed based on incremental training of support vector machines as described elsewhere.[14,16] A five fold cross validation was performed for each experiment = 10 C = 2 D disulphide bonding states) and compares with the results of a recursive neural network by Cheng and colleagues [8] in SPX dataset.The results from SVM model were found to be similar to that of the recursive neural network presented by Cheng and colleagues.
[8] We measured the performance using the overall accuracy for disulphide bridges and proteins.These results (Table 1) show the utilization of SVM models for the prediction of disulphide connectivity in proteins.In our opinion, the combination of SVM parameters and the encoding method chosen in model development played an important role in better performance even in small datasets.

[ 10 ]
(3 in number), Kyte-Dolittle hydrophobicity scale [11] and Grantham polarity[12] (1 in number each) were chosen as features.The Chou-Fasman parameter for helix (α ) is given by (number of residues in helix total number of residues) and i is the set of amino acids residues.Grantham Polarity values were taken from the Protscale website.[17] We chose the above parameters after preliminary experimentation with a small dataset (30 protein chains) at different hydrophobic and polarity scales.
by Biomedical Informatics Publishing Group open access