Prediction of N-myristoylation modification of proteins by SVM.

Attachment of a myristoyl group to NH(2)-terminus of a nascent protein among protein post-translational modification (PTM) is called myristoylation. The myristate moiety of proteins plays an important role for their biological functions, such as regulation of membrane binding (HIV-1 Gag) and enzyme activity (AMPK). Several predictors based on protein sequences alone are hitherto proposed. However, they produce a great number of false positive and false negative predictions; or they cannot be used for general purpose (i.e., taxon-specific); or threshold values of the decision rule of predictors need to be selected with cautiousness. Here, we present novel and taxon-free predictors based on protein primary structure. To identify myristoylated proteins accurately, we employ a widely used machinelearning algorithm, support vector machine (SVM). A series of SVM predictors are developed in the present study where various scales representing physicochemical and biological properties of amino acids (from the AAindex database) are used for numerical transformation of protein sequences. Of the predictors, the top ten achieve accuracies of >98% (the average value is 98.34%), and also the area under the ROC curve (AUC) values of >0.98. Compared with those of previous studies, the prediction accuracies are improved by about 3 to 4%.


Background:
Myristoylation is an irreversible post-translational protein modification (PTM), in which a myristoyl group is derived from myristic acid and covalently attached via an amide bond to the alpha-amino group of an NH2-terminal residue of a nascent polypeptide. This PTM occurs most commonly on glycine residues exposed during co-translational NH2-terminal methionine removal. The usual irreversible modification is absolutely essential for the biological functions of most myristoylated proteins. A cytosolic enzyme, N-myristoyl transferase (NMT; E.C. 2.3.1.97), is responsible for the transfer of myristate (from myristoyl-CoA) onto the NH2-terminal residue of protein substrates. Therefore, NMT is a potential chemotherapeutic target in antifungal [1] and anticancer drug design [2]. The protein substrates of NMT (reviewed by Boutin [3]) are so diverse, including regular enzymes, such as the protein kinase A, protein kinase G, cytochrome b5 reductase, NO synthase, most of G proteinαsubunits, and key enzymes in transforming processes, such as nonreceptor associated tyrosine kinase (fyn, lyn, src, hck), and structural proteins (gag, UL, etc) of retroviruses and other viruses which include poliovirus, polyma virus, herpes simplex, influenza virus and so on. Attachment of the myristoyl residue provides hydrophobicity to influence the partitioning of proteins to cellular membranes; it can serve to promote protein-protein interactions [4]. Since myristoylation has a central role in virus maturation and oncogenesis, specific NMT inhibitors might also lead to potent antivirus agents. The experimental procedure for identifying a myristoylated protein includes cellular labeling by [ 3 H] myristate, isolation of the protein, and recognition of the released fatty acid. The experimental techniques involve reverse phase high-performance liquid chromato-graphy (RP-HPLC), gas chromatography and mass spectrometry. The ionization techniques of mass spectrometry also involve fast-atom bomdartment, chemical ionization, electrospray and tandem mass spectrometry. The sensitivity of these techniques is insufficient and they often require much more biological materials. It is, therefore, desirable to identify precisely and reliably myristoylated proteins for protein functional annotations in the proteome-wide, especially when experimental measurements are unavailable.
Studies of specificity of NMT protein substrates [5] led to a proposed model of the 8-residue NH2-terminal consensus sequence (GNXXXXRR). In PROSITE with scores lower than the threshold, they gave a lower bound and defined them as "twilight zone" predictions. Although Boisson, Giglione and Meinnel (BGM) [9] attempted to modify its threshold parameter and improve identification for plant protein sequences, "twilight zone" predictions are still unsolved. Bologna et al. [10] suggested a rule-based model using average output scores generated from 25 NNs, and Podell and Gribskov [11] put forward a plant-specific hidden Markov model. However, the former one needs much more samples to optimize the rule set and the latter one is also taxonspecific. In the present study, we show new predictors, trained by using support vector machine (SVM), to improve the prediction of myristoylated proteins.

Methodology: Training Datasets:
We retrieved 447 entries, each of which is clearly labeled with "N-myristoyl glycine" in the field of feature table, through online search with a keyword "myristate" against the UniProtKB database. Each sequence length of these proteins is larger than 100 residues; they cover a wide range of functions, including kinases, phosphatases, cytochrome c oxidase and so on. We noted that Maurer-Stroh's learning dataset and ours share 210 common entries. The negative dataset [12] consists of 429 entries, which were selected from GenBank by text-based searching. These entries contain four kinds of proteins: cytosolic proteins, secreted proteins, N-TM-C (transmembrane proteins with an NH2-terminal export signal predicted by SignalP and a hydrophobic COOHterminus), and transmembrane proteins. Homology similarity of sequences in the negative set was reduced to less than 50% by the employment of Smith-Waterman algorithm. Sequence-length requirement for these proteins in the negative dataset is also larger than 100 residues.

Feature vector construction:
According to the motif proposed by Maurer-Stroh et al. [7], we used 17 residues at NH2-terminus as input for each protein sequence. To transform these residues into a numerical format (called a feature vector), we applied a binary transform (20 binary values per residue), or the value Kyte-Doolittle (KD) scale, or their combination. Here, we exemplify the numerical transformation of a protein sequence with the feature vector named "KD+binary", which is a vector of 355 components. Of 355 components, the first 15 components are generated by applying the KD scale (window size of three residues; 15 windows in a 17-residues sequence), and the remaining 340 components (17×20) are generated from binary transformation. Also, we tried to combine KD with another scale representing physicochemical and biological properties of amino acids. As for this additional scale, we examined all the 543 scales in AAindex database [13].

SVM and Cross-validation assessment:
In this study, 1-norm soft margin optimization of SVM [14] and Radial basic Function (RBF), as the kernel function, were employed for our task. A K-fold cross validation (CV) scheme was used to prevent the over-fitting problem and K = 5 was adopted here, i.e. the whole data set was divided at random into five equal subsets, each of them served as the test set in turn and the remaining subsets used for training. Regarding CV assessment, the most frequently used performance measure, the area under the ROC curve (AUC) that provides us a single numeric value, was also used in this study; the closer AUC is to 1.00, the better performance of the predictor is; AUC of 0.50 means the predictor that predicts the class at random.

Discussion:
Since myristoylation has a central role in virus maturation and oncogenesis, NMT inhibitors would be potent antivirus and anticancer agents. Hence, the ability to computationally recognize NMT substrate has important implications. Apart from accelerating NMT protein substrate annotation in the current database, it would be helpful for designing NMT inhibitors. The assessments of various SVM predictors are summarized in Table 1 (see Supplementary material). As it's shown, the "Binary" predictor shows better performance than the "KD" or the "KD+binary" predictor; however, the remaining predictors in Table 1 improve the performance of the "Binary". Owing to addition of properties of amino acids, the prediction accuracies are all over 98% (the average value is 98.34%) and corresponding AUC values are also over 0.98, which is close to the perfect value (1.0). Regarding the top ten of 543 predictors whose features were generated by utilizing KD with other scales (the remains of them are not listed here), these ten scales representing properties of amino acid could be roughly classified into three categories: preference of protein secondary structure (AURR980117, RICJ880115, MAXF760104, MAXF760105, TANS770107 and ISOY800108), relative stability (ZHOH040102) and geometry property (FAUJ880104, FAUJ880106 and LEVM760102). The motif search by PS00008 of PROSITE, reported by Maurer-Stroh et al. [8] and Bologna et al. [10], generates a great number of the false positive hits as applied to large database scan. Therefore, we first compared our results with those of NMT predictor developed by Maurer-Stroh et al. [8]. They showed the jack-knife tests for their learning dataset show that for the fungi dataset, the prediction accuracy is 95.9% (353/368) and is 95.5% (21/22) for the eukaryotic and viral sequences. Comparing with these results, the average prediction accuracy (98.34%) produced by our SVM predictors is higher and improved by 3%. The NMT predictor is based on a statistical scoring scheme and is taxon-specific, i.e., one has to choose the parameter set according to the taxonomy of the query protein sequences. There are two taxon-specific fields: one is fungi, the other is eukaryota and their viruses (nonfungal). In contrast, our SVM predictors are taxon-free and have few parameters. Further, the classification rule of the NMT predictor generated is based on positive sequences. For screening of unknown proteins, however, it is difficult to balance between false positive and false negative errors.

Conclusion:
In this study, we presented new predictors based on the SVM for myristoylated protein prediction. The ten SVM predictors, whose features were generated by applying the KD scale and other scales, showed average accuracies of 98.34%. Roughly compared with those of previous studies, the prediction accuracies were improved by about 3 to 4%. Achievement of the higher prediction accuracies of our SVM predictors may imply that, except for hydrophobicity of amino acids represented by the KD scale, preference of protein secondary structure, relative stability and geometry property represented by the other ten scales might also be important to myristolation. Our SVM predictors have remarkable generalization ability by virtue of support vector machine algorithm, and we believe that our new predictors will be helpful to identify myristolated proteins on the proteomic scale. A crystal structure (PDB code: 1IID) shows that protein substrate in active site prefers to α helix and coil rather than β sheet since β sheet needs more space. Regarding myristoylation of proteins, the further investigations of the secondary structure and space requirement also need be done in future.