PKSIIIexplorer: TSVM approach for predicting Type III polyketide synthase proteins

PKSIIIexplorer, a web server based on ‘transductive Support Vector Machine’ allows fast and reliable prediction of Type III polyketide synthase proteins. It provides a simple unique platform to identify the probability of a particular sequence, being a type III polyketide synthases or not with moderately high accuracy. We hope that our method could serve as a useful program for the type III polyketide researchers. The tool is available at “http://type3pks.in/tsvm/pks3”. Abbreviations PKS - Polyketide synthase, CHS - Chalcone synthase, SVM - Support vector machine, MCC - Matthews Correlation Coefficient.


Background:
Type III polyketide synthases (type III PKSs) are large superfamily of proteins that produce wide variety of secondary metabolites which possess antibiotic, antifungal, antitumor and immunosuppressive activities [1]. For example, resveratrol, a stilbene synthase derivative from grapes shows cancer chemopreventive activity in murine models [2]. To discover more of these novel proteins, Support Vector Machines (SVMs) have been used successfully for the purpose of classification. Earlier we have developed SVM based "PKSIIIpred", in which only labelled data were used for training set [3]. But in the improved version, we used an innovative variant of SVM, the so-called 'Transductive SVM' (TSVM) that not only take into account the labeled training data but also integrate unlabeled data.

Methodology: Dataset:
Positive (type III PKSs) and negative (non-type III PKSs) datasets were prepared (1000 each). Sequences were retrieved in FASTA format from Swiss-Prot. Unlabelled dataset (2000) was generated by profile hidden Markov models (HMMs) using the positive dataset to extract certain proteins from Swiss-Prot. In the case of unlabeled dataset, we are not sure whether they are type III PKS or not. BLASTCLUST was used to verify the non-redundancy of datasets [4].

TSVM-implementation:
SVMs are group of fast optimization machine learning algorithms which have been used for many kinds of pattern recognition [5]. The performance of SVM based methods has been optimized by tuning SVM parameters (linear, polynomial, radial or sigmoid). In classical SVMs, the training data that are used to build the model ideally cover the whole problem space; the model is then used to predict the labeling of new data points. But in most of the biological datasets the number of labeled data points is rather small, but a large number of unlabeled data points are available. To take advantage of these unlabeled data, the so called 'TSVMs' have been developed [6]. Here, TSVM was implemented using SVMlight package which posess two modules: SVM_learn (preparing models) and SVM_classify (classifying samples). For each cluster of composite specificity, we prepared a feature file with the sequences belonging to this specificity labeled +1, all other sequences with different but known specificity labeled -1, and the uncharacterized sequences labeled 0. TSVM was trained as described above, to obtain a model for composite specificity. During several rounds of evaluation, many parameters produced poorly performing models with poor MCC values. Therefore, selected a set of consistently performing parameters for identifying the optimally performing models. After training the SVM models, it is necessary to combine the predictions of all models to one single prediction. Here the SVM that outputs the largest score is used to assign the specificity to the unknown sequence.

Numerical properties:
The models were trained by using dipeptide and multiplet frequencies [7] of amino acid composition. For each protein, a matrix of 400 dipeptides was generated and fed as an input to SVM. The repetitiveness of the amino acid sequences were analyzed by means of multiplet which comprise homopolymeric stretches of any length (XX, XXX, ... (X)n) where X denotes any specific amino acid and n≥2.

Webserver:
The server was prepared in Apache version 2.0 and the scripting was done in PHP version 5.3.2. The background running programs for dipeptide and multiplet frequencies were written in Perl 5.8.5.

Performance assessment:
Fivefold cross-validation technique was used to evaluate the performance of all the models. We computed the Error rate (err) specificity (SP), sensitivity (SN), and MCC [8-9] for assessing the performance of a method (given in Supplementary material). Sensitivity gives the fraction of positive events; specificity represents how many false subjects are incorrectly recognized as positives; the 'error rate' is the fraction of type III PKS data that is classified incorrectly [9]. MCC ranges from -1 to +1 and the highest value indicates better prediction. We identified the model with highest MCC value in each of the five subsets. In the second subset, three models with different parameters sets 47, 65 and 89 were equally good and therefore both of them were included (Table 1

Discussion:
The web-interface of "PKSIIIexplorer" allows, one to 'upload' or 'paste-in' the sequences in fasta format. Here we describe the application of TSVMs to functionally predict the peptides, based on the chemical fingerprint of the residues. By using various kernal functions, we got the best results for polynomial and radial (RBFs), over linear and sigmoid ( Table 1) and found that SVM models yield very good results (MCC = 0.84-0.97). In addition to the plant proteins, we also provided type III PKSs from bacteria, fungi and bryophytes in the training dataset, so they can be perfectly predicted during user investigation. It is noted that the server efficiently predicts type I PKS, ketosynthase domain as negative which adopts similar structural fold and shows sequence similarity to type III PKS. These results demonstrated that the sequence features used by PKSIIIexplorer have powerful discriminating power. The system also found to be superior (Figure 1) to the previous prediction server "PKSIIIpred" (http://type3pks.in/prediction/).

Conclusion:
Because of the diverse pharmacological functions, the volume of data on type III PKS is rapidly increasing. With this regard developing a highly sensitive method to identify the protein 'in silico' will accelarate the experimental research. Our results give high reliable predictions, even though the training data is relatively low, leaving a room for further improvement with a growing number of type III PKSs. BLAST could be helpful especially for rare specificities and therefore, we plan to integrate it in a future version of PKSIIIexplorer.