Prediction of protein-mannose binding sites using random forest

Mannose is an abundant cell surface monosaccharide and has an important role in many biochemical processes. It binds to a great diversity of receptor proteins. In this study we have employed Random Forest for prediction of mannose binding sites. Mannosebinding site is taken to be a sphere around the centroid of the ligand and the sphere is subdivided into different layers and atom wise and residue wise features were extracted for each layer. The method achieves 95.59 % of accuracy using Random Forest with 10 fold cross validation. Prediction of mannose binding site analysis will be quite useful in drug design.


Background:
There is an exponential increase in genome sequence and protein structure data in last few years. Comparatively less availability of experimental assays of carbohydrate binding and discoveries of essential roles of some of the proteincarbohydrate interactions in various metabolic processes suggests the necessity for prediction algorithms. It is known that carbohydrate-binding proteins share low sequence and structural similarity [1]. Despite this low similarity in sequence as well as structure, their binding sites are very specific. This specificity can be attributed to the local characteristics of binding sites such as hydrogen bonding patterns, presence of stacking interactions [2]. Another proof for presence of local characteristic features comes from biochemical studies of sugar binding in lectins by Rao et al [3]. They found conserved loop structures to be important for sugar binding. Hydrophobic stacking interactions have also been found to be specific for carbohydrate binding [2]. Such features constitute a multidimensional feature space. Prediction of mannose binding site employing Random Forest is carried out under the assumption that from such a space enough informative features can be extracted and employed for supervised classification of binding and nonbinding sites.
Mannose binding proteins cover a diverse range of functions. They can be broadly classified into two classes, viz. 1) those having N or O-glycosylation bonds with sugars and 2) those exhibiting non-covalent interactions with sugars. In this work only the proteins with non-covalent interactions are considered. In the literature there exist a few studies pertaining to prediction of carbohydrate binding sites. Shionyu Mitsuyama et al [4] first derive empirical rules based on the similarity of spatial distribution of amino acid residues in known binding protein structure and subsequently employ the derived rules for identification of positive sites. Taroni

Methodology:
For the purpose of extracting different features we need to provide a rational method of representing the binding sites of different structures [8]. As in the earlier work binding site has been represented as concentric spherical shells around its centre [8] and this centre is taken as the centroid of the atoms excluding hydrogen atoms present in the ligand. This approach was used by Nassif et al [8] for extracting different shell features. The shells are started from a distance of 3 Angstroms from the shell centre and continued up to 10 Angstroms from the shell centre, each shell with a width of 1 Angstrom. The radius of outermost shell is chosen to be 10 Angstrom unit. The radius of mannose pyranose ring is 1.5 Angstrom unit. The molecular interactions are significant to a range of 7 Angstrom unit [9]. Therefore the radius of outermost sphere is kept at 10 Angstrom unit.

Preparation of Dataset
A non-redundant list of 11 proteins, which bind to mannose non-covalently, where structures of protein-mannose complexes are known, was taken from PDB. This dataset of 11 proteins is kept as the final dataset. The positive dataset consists of 55 mannose-binding sites derived from 11 mannose-binding proteins. True binding sites for non-mannose ligands have been included as the negative data; Non-mannose ligands include non-mannose hexoses, non-sugar organic molecules and metal ions. These comprise of 78 Glucose binding sites, 40 Galactose binding sites and 69 other ligand-binding sites. All these 187 binding sites along with 55 positive sites form our input data for the experiments.   Table 2.

Feature selection
Feature selection is needed to reduce the feature space by filtering out unwanted features that reduce the classification performance. Feature selection is useful to know relatively more informative features from a collection of features that might contain redundant and non-informative features increasing the confidence of classification. For the selection of the attributes, information gain attribute evaluator from Weka software was employed.

Classification
Random Forests are an ensemble of randomly constructed independent decision trees. In each decision tree a randomly chosen fixed subset of features are employed to build a classification model. Bootstrapping technique is used in each tree for selection of training set. Due to this about one third of the examples are left unused and are known as out of bag examples. It is customary to use this out of bag examples as validation set for tuning the algorithm parameters. Hence a separate test data is not normally required in RF for checking the overall accuracy of the forest. After all individual trees are built a majority vote is then taken to decide on the class label for each case.

Discussion: Separate versus Combined features
Separate features refer to the all possible values of various properties taken together as separate features. For example, for a property called 'charge', there are three possible values viz. positive, zero and negative. These three properties taken separately can be considered as three different features. Thus, here the feature 'positive charge' shows the number of atoms with positive charge. Combined features refer to the combination of values of more than one property. Advantage of using combined features is that, the combination of more than one property avoids the redundancy in the features. Since the feature values considered here are the counts of atoms of a particular property, using different values of the properties will give redundant counts for some of the properties. Clubbing the properties together to form a new property will automatically reduce the redundant counts. Thus the combined features give more realistic properties rather than the separate features. Another advantage of the combined features is the reduction in the feature space. Here only the atom wise features are used and residue wise features are omitted from the combinations. The results Table 3 (see supplementary material) indicate that with separate features there is a slight decrease in MCC and slight increase in accuracy with feature selection. The reversal in this trend is observed for combined features. The maximum accuracy is found to be 95.59 % and 94.11% for separate and combined features respectively.

Conclusion:
In this work ligand centroid approaches were employed for prediction of mannose-binding sites. The tuned classifier model with most informative features provides an accuracy of more than 90 % percent. The developed model can be used to predict the mannose binding sites with a high degree of confidence.