SVM based model generation for binding site prediction on helix turn helix motif type of transcription factors in eukaryotes

Support vector machine is a class of machine learning algorithms which uses a set of related supervised learning methods for classification and regression. Nowadays this method is vividly applied to many detection problems related with secondary structure, tumor cell and binding residue prediction. In this work, support vector machines (SVMs) have been trained on 90 sequences of transcription factors with HTH motif. Four sequence features were used as attribute for the prediction of interaction site in HTH motif. A web page was also developed so that user can easily enter the protein sequence and receive the output as interaction site predicted or not predicted. The generated model shows a very high amount of accuracy, sensitivity and specificity which proves to be a good model for the selected case.


Background:
Machine learning is a branch of artificial intelligence which deals with the construction and evaluation of algorithms that expedite pattern recognition, classification and prediction based on models derived from existing data [1].Sophisticated machine learning algorithms (MLA) have also been attempted [2] on biological aspects.These utilize techniques such as neural networks [3], logistic regression [4] and support vector machines (SVM) [5].SVM as a supervised machine learning technology is attractive because it has an extremely well developed statistical learning theory [6,7].SVM is currently the best performers for a number of classification tasks ranging from text to genomic data.It has been gradually applied to protein secondary structure prediction In the present work, the characteristic features of interaction sites of DNA-binding proteins with HTH motif from the residue level were analyzed.The four sequence features include the evolutionary sequence conservation, positively charged residues, H-bond donor and acceptor residues and hydrophobic residues.A SVM based model was trained with 90 sequences (training set) using the above sequence features for the prediction of DNA-binding residues in HTH motif type of proteins.The total dataset was selected for training and test was 136 sequences and three fold cross validation was also done on the training set to get an average value of accuracy, sensitivity and specificity.Out of 136 total protein sequences 90 sequences were chosen randomly as training set and others as test set so that the ¾ of the total will remain as training set and ¼ as test set.

Features selection for training and test data sets
After the determination of crystal structures of C1 and Cro repressor proteins from bacteriophage lambda [27], the DNA binding HTH structural motif [22] has become one of the most important and studied examples of the interaction between proteins and DNA.The typical length of the HTH motif is around 25 residues.DNA interaction site was predicted through ScanProsite [28] available at http://prosite.expasy.org/scanprosite/.It was found that this site mainly compose of the maximum portion of HTH motif Table 7 (see supplementary material).For non HTH motif sequences interaction site with different motifs were found.
On the basis of these observations four features were chosen in this work for the interaction site.These are evolutionary sequence conservation, positively charged residues, H-bond donor/acceptor and hydrophobic residues (Table 1).Conservation of residues was determined from multiple sequence analysis [10].From the MSA result it was evident that in five columns (out of 25) the conservation score was very high.Thus, conditional probability was applied to find the value of each residue situated in that position.The value was calculated by,

P (AB) = P (A) P (B/A) (1)
Where A represents the column number to be selected and B represents the residue to be in that column.
As per many studies of DNA-protein interaction [2, 29, 30] the positively charges residues shows more prone of binding capacity with negatively charged residues of DNA strand.Side chain groups of positively charged residues as arginine, histidine and lysine within the interaction site were assumed to be capable of getting involved in H-bonding during the interaction [10, 13].Asparagine, arginine, glutamine, cysteine, histidine, serine, threonine, tryptophan and tyrosine were calculated as the H-bond donor, while aspartic acid and glutamic acid were calculated as the H-bond acceptor.The values were calculated and noted down for each feature.So, altogether four features were used as attributes for modeling the SVM classifier from the residue level of amino acids.

SVM implementation
The

Web page designing
The concerned web page based on above selected SVM model has a simple user interface.It contains an input field where user can paste the FASTA format protein sequence.There is also a provision of uploading text file with protein sequence.The web page also gives a brief description of our model.The overall web page was designed in HTML codes with different color combination and pictures incorporated in it (Figure 1).When the user enters the data in the form, these values are passed to the matlab engine.Once the matlab engine is ready, matlab code is executed from within the java environment.

Results & Discussion:
Goal of the work was to characterize the interaction site in HTH motif as well as non-HTH motif of TF proteins and train the model based on SVM approach.The accuracy of the model was also checked and verified with existing models.
In this work a set of features were selected, studied and applied to distinguish the interaction site with non-interaction site.The interaction site of HTH motif and non HTH motif was retrieved from ScanProsite result.The outcome presents the type of motif, its length, interaction site with DNA and its length (Figure 2).DNA interaction site identified by ScanProsite mainly constitute of a large portion of HTH motif.The total length of the HTH motif was approximately 20 to 25 residues long.Due to the much specified selection criteria of the samples, the total data set used for the study were 136 which is a pretty small in size.But in spite of fewer samples the k-fold cross validation resampling technique was employed for training and testing of the classifier for the selected features.Considering the k=3, the whole dataset was randomly divided into three parts.Each an every parts represent the same proportion (approximately) as in the original data set.The procedure was repeated for total 3 times where two parts of the data was selected as training set and one part as test set.For all 3 fold confusion matrix was generated by the LIBSVM Table 1, Table 2

Conclusion:
In this work, we have described a new SVM-based approach for prediction of binding sites within HTH motif based on amino acid sequence data.The present work is the 1 st SVM-based method for HTH motif interaction site prediction till date.The average accuracy of this work is 94.19%, which is a pretty good value for prediction purpose.K-fold cross re-sampling technique was used for validation.The sensitivity (96.7%) and specificity (89.16%) shows that this SVM -based approach can be used as a good prediction tool for HTH motif type of interaction studies.The output result of our model shows two outcomes in the web page.If HTH motif is present in the query protein, then interaction site is predicted otherwise, there is no prediction.In the future, prediction tool for other types of motifs can be developed as well.

[ 8 ]
, tumor cell prediction and DNA binding residues prediction [9] and many more pattern classification problems in biology related areas.Recently [10] SVM has been used as a new and promising technique for detecting DNA binding site prediction for TF proteins.SVM based [5, 11, 12] method for binding site prediction depend on the sequence [13] and structure characteristics or both [14] of known protein binding sites.These characteristics such as sequence conservation [15], interface propensities [16, 17], the charge or dipole moment of the protein molecule, the presence of positively charged surface patches, secondary structure [18], accessible surface area [19], 3D motifs [20] and residue evolutionary information [21] are applied to train a set of data's for model generation.The helix turn helix (HTH) motif is a major structural DNA binding motif.It is composed of two α-helices joined by short strand of amino acid and is found in many proteins that regulate gene expression [22].The two α helices of tri-helical HTH motif, one occupying the N-terminal end and the other occupying the C-terminal end of the helix are involved in recognition and binding to the DNA [23].The third helix is known as the recognition helix which has prime role in establishing the contacts with DNA major groove.The positively charged and hydrophobic amino acids residing in the third helix increase the bonding capacity in DNA-protein complexes [24, 25].A very large class of transcription factors can be classified as HTH motif type of family according to Interpro database (http://www.ebi.ac.uk/interpro/).

Figure 1 :
Figure 1: Snap shot of the web page designed for the DNA binding site prediction Methodology: Dataset selection Sequence data set was retrieved from NCBI (http://www.ncbi.nlm.nih.gov/),91 proteins (TFs) with HTH motif and 45 protein sequences without HTH motif of eukaryotes were collected.All these selected proteins are having DNA binding domain with HTH motif which was verified by Pfam database (http://pfam.sanger.ac.uk/) [26].Out of 136 total protein sequences 90 sequences were chosen randomly as training set and others as test set so that the ¾ of the total will remain as training set and ¼ as test set.

Figure 2 :
Figure 2: ScanProsite result of yeast protein showing the HTH motif as well as the interaction site Accuracy = TP+TN/(TP=TN=FP=FN) Specificity=TN/(TN+FP) Sensitivity=TP/(TP+FN)The numbers of true positive, true negative, false positive and false negative results are indicated by TP, TN, FP and FN respectively, and were calculated from the following confusion matrixes.The TP, TN, FP and FN are defined as follows: TP = Correct positive predictions / Total positives; FP = Incorrect negative predictions / Total negatives; TN = Correct negative predictions / Total negative; FN = Incorrect positive predictions / Total positive.
freely downloadable LIBSVM package was used for the implementation of SVM [31] with MATLAB interface.The widely used Radial Basis Function (RBF) kernel was used.In the present study our training and testing dataset was verified by simulating it in Matlab 7.0, Microsoft Windows 2007 Operating System and Intel Pentium D-2.80 GHz with 2 GM of Random Access Memory (RAM).

Table 4 (see supplementary material), sensitivity Table 5 (see supplementary material) and specificity Table 6 (see supplementary material) indicate
[32] our model is showing the best result with average accuracy of 94.19%.Here we showed that this SVM model can predict interaction sites (DNA-binding residues) at 96.7% sensitivity and 89.16% specificity.The result of our study suggest that this SVM -based approach can be used as a good prediction tool for HTH motif type of interaction studies.In view of the previous knowledge[32], four web servers are available at the moment for sequence-based prediction of DNA-binding sites: DBS-PRED [13], DBS-PSSM [33], BindN [34] and DP-Bind [9].Our performance measures are higher than those reported by these four studies.

Table 1 :
Confusion table for k=1 fold in SVM based model

Table 2 :
Confusion table for k=2 fold in SVM based model

Table 3 :
Confusion table for k=3 fold in SVM based model

Table 4 :
Comparative study of the accuracy of the present method (SVM) with other methods on the same dataset.

Table 5 :
Comparative study of the sensitivity of the present method (SVM) with other methods on the same dataset.

Table 6 :
Comparative study of the specificity of the present method (SVM) with other methods on the same dataset.

Table 7 :
Selected feature values of the training set.Both HTH and non-HTH motif result is given