Multi-class subcellular location prediction for bacterial proteins.

Two algorithms, based onBayesian Networks (BNs), for bacterial subcellular location prediction, are explored in this paper: one predicts all locations for Gram+ bacteria and the other all locations for Gram- bacteria. Methods were evaluated using different numbers of residues (from the N-terminal 10 residues to the whole sequence) and residue representation (amino acid-composition, percentage amino acid-composition or normalised amino acid-composition). The accuracy of the best resulting BN was compared to PSORTB. The accuracy of this multi-location BN was roughly comparable to PSORTB; the difference in predictions is low, often less than 2%. The BN method thus represents both an important new avenue of methodological development for subcellular location prediction and a potentially value new tool of true utilitarian value for candidate subunit vaccine selection.


Background:
Only proteins liable to surveillance by the immune system are likely candidate subunit vaccines. Thus, for bacteria, subcellular location can be a prime arbiter of immunogenicity. There are five principal subcellular locations in Gram-bacteria (extracellular, outer membrane, periplasmic, inner membrane, or cytoplasmic) and three locations in Gram+ bacteria (extracellular, membrane, or cytoplasmic). Components of the proteome contain signals which can direct proteins to one or more of these locations. Such signals are legion. They can, for example, be explicit sequence motifs recognised by a membrane transporter. They can also be coincidental physical properties that render certain proteins compatible with their environment and were derived through an evolutionary process. An organism can read such signals well enough in vivo, and there is thus much interest in effectively reproducing this in silico. Bioinformatician's have, therefore, attempted to identify both sequence motifs and overall physical properties of proteins indicative of protein subcellular location.
Many methods have attempted to predict subcellular location. There are two basic types of prediction method: manual construction of rules derived from factors thought to determine subcellular location and the application of data-driven machine learning methods that automatically identify factors that determine cellular localisation, using proteins of known location as training data. The degrees of accuracy differ markedly between methods and compartments, reflecting either a lack of data for a specific compartment or the complexity of factors controlling the location of certain proteins.
However, there have been few, if any, real attempts to create prediction methods for all such compartments, since most methods predict only a subset of the 'most interesting' locations. An exception to this is PSORTB, which is a sub cellular location-prediction expert system developed specifically for bacteria. [1] PSORTB is a modular system based on 6 prediction algorithms. A query protein undergoes analysis by each of the modules and the results are then combined. The modules that form PSORTB are: SCL-BLAST, which uses sequence similarity to known proteins to identify location; PROSITE, which detects motifs indicative of subcellular location [2]; HMMTOP, a method for the prediction of TM domains, to identify membrane proteins [3]; outer membrane protein motifs are identified using sequences occurring only in TM beta barrel proteins; SubLocC, a support vector machine based method, which assigns a cytoplasmic or non-cytoplasmic location based on amino acid-composition; and a hidden Markov model trained to identify signal peptide cleavage sites. Prediction of a query sequence location is reported as the likelihood that a query protein belongs to a particular compartment. PSORTB has a precision of 96.5% and a recall of 74.8%.
In the context of bacterial subcellular location prediction, methods based on Bayesian Networks (BNs) are explored in this paper. Two algorithms which predict all locations for Gram+ and Gram-bacteria were created. A range of variant methods was evaluated, with differences including the number of residues considered (from the N-terminal 10 residues to the whole sequence) and residue representation (amino acid-composition, percentage amino acidcomposition or normalised amino acid-composition). The

Methodology: Dataset
An algorithm was used to mine the bacterial subset of SWISS-PROT release 40.
[4] Initially, bacterial status was confirmed using the OC line code of the SWISS-PROT entry. Entries were split into Gram+ and Gram-at the superfamily level. The following were assigned as Gram+: actinobacteria; deinococcus; thermus; firmicutes; planctomycetes; and thermotogae, and the following assigned as Gram-: chlamydia; verrucomicrobia; cyanobacteria; chloroflexi; fusobacteria; nitrospirae; proteobacteria; spirochaetes; chlorobi; and bacteroidete. The SWISS-PROT subcellular location descriptions (lines labelled CC) were then searched to identify if the subcellular location was known. To remove proteins of uncertain location, only entries not labelled as 'potential', 'probable', 'hypothetical', 'possibly' or 'by similarity', were incorporated into the final data-set. A non-redundant data-set of proteins was obtained using CLUSTALW.
[5] If two or more proteins were found to have sequence similarity higher than 90% then all but one were removed from the data-set. The algorithm and subsequent CLUSTALW analysis produced a Gram-data-set of were 272 extracellular proteins, 375 membranous proteins and 1500 cytoplasmic proteins, while the final Gram+ data-set contained 185 extracellular, 159 outer membrane, 432 periplasmic, 273 inner membrane and 2480 cytoplasmic proteins.

Combined bacterial subcellular location predictor method
When training the method, a variety of sequence representations were examined. Six different sequence lengths were used: residues 1-10 of the N-terminus, residues 1-20, residues 1-30, residues 1-40, residues 1-50, and the whole protein sequence. For each sub-sequence, amino acids were represented in three ways: as the residues themselves, as the amino acid-composition (for each residue, the total number of each amino acid in the subsequence); and as the normalised amino acid-composition (for each amino acid, the residue composition divided by the total number of amino acids in the sub-sequence).
Each representation was tested with each sub-sequence length, creating 18 Näive-Bayes networks. The amino acidcomposition and normalised composition sequence representations used BNs comprising 20 input nodes and 1 output node. During training, a sub-sequence is extracted from the original protein sequence and its composition calculated. To train the BN for an individual sub-sequence, each of the 20 input nodes is assigned a different composition value: the first contains that of alanine, the second that of arginine, etc. This procedure is repeated until all sub-sequences have been used to train the network. The output node is given the value of the subcellular location of the training protein, which are different for Gram+ (5 locations) and Gram-(3 locations).
The directed acyclic graph (DAG) required when the residue representation was the actual amino acid sequence, varied when different sequence lengths were used. A length of 10 residues required a BN with 10 input nodes, for example. When the whole protein sequence was used, the DAG required as many input nodes as the protein had amino acids. Since the same network is used for all the proteins of the data-set, the longest protein determined the total number of input nodes used. For the Gram-predictor 2248 input nodes were used and for the Gram positive predictor 1852 input nodes were used. The amino acids were converted to integers, 1 to 20 according to the alphabetical order of their single letter representations i.e. alanine (A) had the value 1, cysteine (C) was 2, etc. When training the network, the first input node takes as its value the first residue, the second the second, and so on until the end of the sequence. Input nodes that do not have a corresponding amino acid, due to the training sequence being shorter than the maximum length, were assigned the value 0. The output node is given the value of the subcellular location of the training protein.
Testing of the network was performed using the training set under five-fold cross-validation. For all networks, the negative set chosen was the equivalent data-set of the opposite Gram-type. To assess the predictivity of the Bayesian approach, the same data-sets were submitted to the PSORTB predictor.

Results and Discussion:
For both Gram+ and Gram-predictors the same combination of residue representation and sub-sequence length produced the most accurate results: amino acidcomposition and a sub-sequence length of 50 residues. See Tables 1 and 2. The accuracy of both predictors increased with increasing sub-sequence length, up to 50 residues. Generally, both predictors were more accurate when using amino acid-composition. The worst performing representations is the one based on residues, which tries to capture residue position specific information. Apparent inadequacies of the representation may arise from the structure of the BN DAG requiring it: the number of input nodes equalled the length of the longest sequence; all other sequences therefore had many nodes assigned a value of 0 during training. For each compartment, the longest sequence was many times larger than the average sequence, thus many input nodes for most sequences had little predictive benefit.
The sub-sequence length affected accuracy more obviously. Unsurprisingly, the accuracies of all locations are highest when the first 50 residues are considered, since this will encompass the entire length of the vast majority of signal sequences. Shorter lengths may neglect important regions within such signal peptides. Charge, length, and composition, among other properties, will vary between different signal sequences and can therefore be used to distinguish accurately between different signal peptides.
A surprising feature of the results was that in most cases the accuracy of amino acid-composition for the whole protein was close to the accuracy of just the first 50 residues. However, for the extra-cellular compartment Gram+ predictor, the whole protein composition had a higher accuracy. This was unexpected as the un-normalised composition varied significantly with sequence length. A possible explanation is that Gram+ extracellular sequences have a very different length distribution to sequences from other compartments. The average length of sequences from each compartment was calculated. For the Gram+ proteins the average sequence length of the extracellular set was 397, compared to 491 (membranous proteins) and 442 (cytoplasmic). Further support comes from the Gram-sequence lengths, which were found to be 549 (extracellular), 568 (outer membrane), 322 (periplasmic), 400 (inner membrane), and 448 (cytoplasmic). If the BN based on composition draws its predictivity from the atypical Gram+ sequence length distribution, then the accuracy for the negative set should be low, since sequence length is nearer that of Gram+ extracellular sequences.
Comparing the best performing multi-location BNs to PSORTB indicates that their accuracy is roughly equivalent; the discrepancy between predictions is typically low, often less than 2%. See table 3. Exceptions include the extracellular compartment (both Gram-and Gram+) and membrane prediction. The prediction of extracellular location is more accurate for both Gram-(8.57% higher than PSORTB) and Gram+ (7.86 higher). For membranous prediction, PSORTB has an accuracy which is 20.54% higher than that of the Gram+ multi-location predictor. This may be because PSORTB is specifically trained to identify TM spanning regions.  Table 2: Results of the Gram-all compartments predictor. The best performing network is highlighted in bold Table 3: Accuracy of PSORTB bacterial subcellular location predictor in comparison to the most accurate methods produced

Conclusion:
Good levels of accuracy were achieved, yet PSORTB outperformed the BN method. Since our approach attempts to utilise a single method and sequence representation to capture all information relevant to bacterial subcellular location, the performance of the BN method reported here is most encouraging. When comparing our method to PSORTB, we see a single methodology competing against an expert system, which is specifically designed to capitalise on best-in-class methods. Constructing a successful multi-outcome predictive method is difficult. Prediction is made between input variables that are very difficult to separate using any method. The generally lower degree of prediction accuracy of the BN approach is most likely due to PSORTB applying many algorithms, each specifically trained to address the individual requirements of each particular location. Clearly, this strategy is more likely to produce a significantly greater level of accuracy. However, the BN method described here is nonetheless very competitive, notwithstanding such arguments. Thus, we can aver that BNs represent an important new avenue in subcellular location prediction and that our implementation is in itself a potential powerful new tool for candidate subunit vaccine selection with real utilitarian value.