FlexPred: a web-server for predicting residue positions involved in conformational switches in proteins

Conformational switches observed in the protein backbone play a key role in a variety of fundamental biological activities. This paper describes a web-server that implements a pattern recognition algorithm trained on the examples from the Database of Macromolecular Movements to predict residue positions involved in conformational switches. Prediction can be performed at an adjustable false positive rate using a user-supplied protein sequence in FASTA format or a structure in a Protein Data Bank (PDB) file. If a protein sequence is submitted, then the web-server uses sequence-derived information only (such as evolutionary conservation of residue positions). If a PDB file is submitted, then the web-server uses sequence-derived information and residue solvent accessibility calculated from this file. Availability FlexPred is publicly available at http://flexpred.rit.albany.edu


Background:
Proteins are flexible macromolecules. The protein backbone can switch from one specific folded conformation to another. Conformational switches have been shown to be involved in a variety of biological functions, such as catalysis, macromolecular recognition, signal transduction, locomotion, and a number of pathogenic disorders [1-2]. Molecular dynamics simulations of long time scale conformational transitions are very computationally expensive and therefore impractical for large-scale studies [3]. Several bioinformatics methods that attempt to predict conformational switches from sequence information alone have been developed. Most of these methods were not trained to predict conformational switches directly by using a dataset of experimental examples of such switches, but rather to identify them indirectly by predicting certain structural properties related to protein flexibility in general. Flexibility-related properties used to train these methods include the crystallographic B-factor [4-5], the ambiguity in secondary structure assignment . This dataset was used to train a supervised pattern recognition method, Support Vector Machine (SVM), to distinguish between flexible and rigid residue positions. We implemented two types of encoding of the input sequence. One is the binary encoding which utilizes the input sequence alone and represents the 20 amino acid types as 20 mutually orthogonal binary vectors. The other is the PSSM encoding which accounts for evolutionary conservation of the input sequence and is based on the PSI-BLAST position-specific scoring matrix (PSSM). If the user submits a protein structure in a Protein Data Bank (PDB) file, then the normalized residue solvent accessibility calculated from this file is also used for prediction along with one of the two types of sequence encoding. Thus, we have four possible ways of encoding protein sequence with or without solvent accessibility. Accordingly, four SVM predictors, one for each of the four combinations, were implemented.

Input:
FlexPred is freely available at http://flexpred.rit.albany.edu. It has a simple intuitive user interface that consists of four input fields described below. Instructions for each field and general information about the methodology and the output format can be found by clicking a corresponding help hyperlink on the input page.

Field 1: Protein sequence or PDB file to be analyzed.
For sequence-based prediction, the user can paste or upload an amino acid sequence in FASTA format. For the prediction based on a protein sequence and solvent accessibility of its residue positions, the user can either upload a PDB file or provide a four-character PDB id and let the server automatically download a corresponding file from ftp://ftp.wwpdb.org.
Field 2: Selection of encoding method. The user can select either binary or PSSM sequence encoding. The PSSM encoding performs better if protein sequence information alone is used for prediction, whereas the binary encoding performs better if both protein sequence and residue solvent accessibility are used. Therefore, the PSSM encoding is the default method for the sequence-based submissions, while the binary encoding is the default method for the PDB-based submissions. For any prediction method, when FPr is decreased, TPr is also decreased, and vice versa. The user can choose FPr of 5%, 10%, 15%, or 20%. Since most statistical tests consider the 5% chance of false positive prediction to be an acceptable level, the FPr of 5% is selected by default.
Field 4: Selection of retrieval method. The user can choose to receive results by E-mail (default) or manually retrieve them using a temporary URL provided upon submission. The results of prediction are kept on the web-server for one day from the moment of submission, and deleted afterwards.

Output:
The output from FlexPred consists of a header that describes the output format itself, the selected encoding type, the selected false positive rate, the submitted sequence in FASTA format, and the predicted labels for each residue position (Figure 1). The labels are 'R' (rigid) and 'F' (flexible). The column 'S_PRB' shows the probability of label 'F' for each residue position. The probability of label 'R' is (1-P F ), where P F is the probability of label 'F'. The probabilities are in range [0.0, 1.0]. Higher probability corresponds to a greater prediction confidence.

Future development:
To the best of the authors' knowledge, FlexPred is the only on-line method for predicting conformational switches in proteins directly trained on a large dataset of experimentally characterized examples. We will continue updating FlexPred by adding new predictive models and new experimental examples as they become available.