Functional annotation strategy for protein structures.

Whole-genome sequencing projects are a major source of unknown function proteins. However, as predicting protein function from sequence remains a difficult task, research groups recently started to use 3D protein structures and structural models to bypass it. MED-SuMo compares protein surfaces analyzing the composition and spatial distribution of specific chemical groups (hydrogen bond donor, acceptor, positive, negative, aromatic, hydrophobic, guanidinium, hydroxyl, acyl and glycine). It is able to recognize proteins that have similar binding sites and thus, may perform similar functions. We present here a fine example which points out the interest of MED-SuMo approach for functional structural annotation.

A majority of functional annotation methods relies on sequence similarity research, e.g. ProtoNet [5], or characterized sequence motifs, e.g. PROSITE.
[6] The direct use of 3D structures or structural models to assign protein functions is an emerging field. This development is due to the increasing number of available crystallographic structures, of hypothetical proteins obtained by structural genomics consortium [7] and to new automatic crystallization methods. The first dedicated methods were directly derived from 3D local similarity methods, i.e. local rigid superimposition approaches. SuMo was one of the first software to use chemical groups description combined with fast graph comparison heuristic.
[8] SiteEngine, developed later, had a comparable approach. [9] ProFunc is a popular web server composed of a compendium of structure-based and sequencebased methods. [10]

Description:
Recognition of similar binding regions on the protein surface is crucial for functional classification and for functional prediction. MED-SuMo (http://www.medit.fr/) is able to recognize proteins that have similar binding sites and thus may perform similar functions. It is an improved version of SuMo (http://sumopbil.ibcp.fr/ [8, 11]) with an updated source code; it is now faster and considers an increased amount of natural and synthetic ligands. Its heuristic is based on a unique representation of macromolecules using selected triplets of chemical groups which have their own geometry, regardless of the notions of main and lateral chains of amino acids. To extract similar sites, MED-SuMo transforms the binding site (or the full structure) of a query into a graph in which vertices are triplets of chemical group. Then, it is compared to binding sites extracted from the PDB which are already pre-assessed and stored in a database. [11] A major drawback in functional annotation is the difficulty to identify true "unknown function" proteins. The PDB website (http://www.rcsb.org/) associates more than 1,500 structures to an "unknown function" annotation. Nevertheless, numerous can be annotated using classical approaches (high sequence identity, structural homology, residue conservation analysis, sequence motifs research, Cleft analysis). As an illustration, 3-keto-Lgulonate 6-phosphate decarboxylase, a lyase, is represented by 14 proteins in the PDB. Among this family, 4 structures are classified as "unknown function", but their functions can be found in both, the PDB and the reference paper title (e.g. PDB code: 1XBX). Moreover, they have significant sequence identity/similarity rate and low root mean square deviation (rmsd) with 10 other protein structures.
For our study, we have selected proteins from the "Joint Center for Structural Genomics" (JCSG, http://www.jcsg.org/); they have determined more than 350 protein structures. About half of these proteins are classified as "Structural Genomics Unknown Function" but most of them share sequence or fold similarity with known proteins. Tm1012 is a hypothetical protein from Thermotoga maritime (PDB code: 2EWR) and cannot be associated to proteins with any known functions. Classical approach such as PSI-BLAST [12] launched on the NR database (via NCBI web service), or dedicated tool as ProFunc [5] could identify neither any related sequence nor any set of residues potentially implicated in known interaction or protein function.
As most of these methods, MED-SuMo does not give an all-ornothing answer. The results are set out in a hit list, which are potentially interesting regions of the protein query, superimposed with corresponding similar regions of selected targets. Concerning the 2EWR query, the best hit of MED-SuMo results corresponds to the same protein crystallized by the same consortium under different experimental conditions (PDB code: 2FCL). The following hits are not directly related to the query (not superimposable, nor sharing any significant sequence identity, i.e. less than 20%, with 2EWR): 2CJ5, 5APR and 1OD1. 5APR and 1OD1 have a significant sequence identity rate, 38 % with a rmsd of 1.8 Å. Otherwise sequence identity rates are less than 22%. Moreover, it is not possible to superimpose any of these proteins on more than 20% of their length [13], i.e. these proteins are distinct.  Figure 1a outlines the fact that these 2 proteins cannot be globally superimposed and Figure 1b displays a closer view of the local superimposition of the corresponding residues with the ligand, an acetate ion. Local rmsd is less than 0.5 Å and the two regions correspond to the same residues (YQ-L).
The second region implicates more residues. Figures 1c and 1d show the superimposition of tm1012 and rhizopuspepsin, 5APR.