Analysis and modeling of mycolyl-transferases in the CMN group.

Mycolyl-transferases are a family of proteins that are specifically present in the CMN (Corynebacterium, Mycobacterium and Nocardia) genera and are responsible for the synthesis of cell wall components. We modeled the three-dimensional structures of mycolyl-transfersases from Corynebacterium and Nocardia using homology modeling methods based on the crystal structures of mycolyl-transferases from M. tuberculosis. Comparison of the models revealed significant differences in their substrate binding site. Some mycolyl-transferases identified by the following Gene Ids: Nfa25110, Nfa45560, Nfa7210, Nfa38260, Nfa32420, Nfa23770, Nfa43800, Nfa30260, Dip0365, Ncgl0987, Ce1488, Ncgl0885, Ce0984, Ncgl2101, Ncgl0336, Ce0356 are associated with a relatively larger substrate binding site and amino acid residue mutations (D40N, R43D/G, S236N/A) are likely to affect binding to trehalose.

forms. The structure corresponds to a α/β hydrolase fold and the catalytic triad responsible for the mycolyl-transferase activity comprise the amino acid residues S126, E230 and H262 (numbering is according to PDB Id: 1F0P). The structural comparison of the three mycolyl-transferases (PDB Ids: 1SFR, 1F0P, 1DQZ) revealed that the active sites are virtually identical indicating that these share a common function. [9] However, in contrast to the high level of similarity within the substrate-binding site and the active site, it was observed that the surface residues disparate from the active site are quite variable indicating that all three Ag85 enzymes in M. tuberculosis are needed to evade the host immune system. The genome sequencing of M. tuberculosis [12], C. glutamicum [13], C. efficiens [14], C. diphtheria [15] and Nocardia farcinica [16] is completed. The M. tuberculosis comprising 3,986 genes is the causative agent of tuberculosis that causes 3 million deaths worldwide. The C. glutamicum comprising 3,002 genes is a soil bacterium and widely used by the industry in the production of amino acids. The C. efficiens comprising 3,069 genes is a non-pathogenic bacterium. The C. diphtheria comprising 2,320 genes is the causative agent of diphtheria. The genome of N. farcinica comprising 5,674 genes is the causative agent of nocardiosis, affecting the lung, central nervous system and cutaneous tissues of humans and animals.
In our earlier work [17], we identified mycolyltransferases in C. glutamicum and C. efficiens genomes and modeled their three dimensional structures. We reported the relative binding of corynomycolyl-transferases towards trehalose. Our findings are in accordance with the experimental data [18,19] that reported the gene deletion mutation studies and measured the concentration of TMCM / TDCM. The genomes of N. farcincia, a representative species from Nocardia and C. diphtheria were also subsequently sequenced and we now have complete data available in the public databases on all mycolyl-transferases from species that belong to the CMN group. Therefore we have carried out sequence analysis corresponding to all mycolyl-transferases and modeled the structures of Nocardia and C. diphtheria and compared their substrate binding sites. Such comparative analysis is relevant in situations when the structural information for proteins from only one organism is available and useful inferences can be made about the structure, function and nature of the substrate binding sites for related members from other organisms.  Table 1. Database searching: The homologous proteins were identified for the Mycobacterium, Corynebacterium, and N. farcinica using BLASTP [21] with the Ag85B as the query sequence against GenBank release 153 [22]. The BLOSUM62 matrices were used and the results were sorted using E-value (expected value) with the gap costs set to existence at 11 and extension at 1. Multiple sequence analysis: Thirty-one mycolyl-transferase sequences were aligned using the CLUSTALW program [23] available at EBI. A penalty of 10 for gap opening, 0.05 for gap extension and 8 for gap separation (default parameters) was assigned for the alignment and shown in Figure 1. Homology modeling: The three-dimensional models were constructed using MODELER [24] available in InsightII (Accelrys Inc., USA). The structures of Ag85A (PDB Id: 1SFR), Ag85B (PDB Id: 1F0N) and Ag85C (PDB Ids: 1DQZ) were used as templates for modeling. MODELER is an automated comparative modeling program designed to find the most probable structure of a protein sequence, given its alignment with related structures. The model is obtained by the optimal satisfaction of spatial restraints derived from the alignment and is expressed as probability density function for the features restrained. The optimization procedure is a variable target function method that applies conjugate gradients algorithm to position all non-hydrogen atoms. [25] In all seventeen homology models were constructed for the mycolyl-transferases from N. farcincia and C. diphtheria species. Model evaluation: The models were evaluated using PROCHECK.
[26] The RMSD (root mean square deviation) values corresponding to topologically equivalent residues between the models and corresponding crystal structures obtained via structural superposition were derived using programs in InsightII (Accelrys Inc., USA) Cys 87-Cys 92 Cys 146-Cys 227 The method of Profiles-3D that measures the compatibility of an amino acid sequence to a protein of known three-dimensional structure was used to further assess the model. [27] Substrate docking: The trehalose substrate was docked into the binding site of all protein models using QUANTA (Accelrys Inc., USA). The enzyme-substrate complex was refined using molecular mechanics (MM) and molecular dynamics (MD) calculations in order to understand their interactions. Hydrogen atoms were added to the structures at pH 7.00 using BIOPOLYMER in Insight II. The parameter 'capping mode off' was chosen so that the protein ends remain uncharged with the NH2 and COOH groups. The CVFF (Consistent Valence Force Field) force field was chosen and the 'Fix' option was used to select the potential atom types, partial charges and formal charges for the protein-substrate complex. The docked complex was subjected to energy minimization using 3000 steps steepest descent followed by conjugate gradient until an energy gradient < 0.01 kcal/mol/A 0 was achieved. The energy minimized structures were further subjected for MD simulations which were performed in the canonical ensemble (NVT) at 298° K using CVFF force field implemented in Discover-3 and equilibrated for 3000 femtoseconds with step size of 1 femtosecond.

Results and Discussion:
Sequence searches identified four mycolyl-transferases each in M. tuberculosis and C. diphtheria, six in C. glutamicum, five in C. efficiens, and thirteen in N. farcinica. The details of mycolyltransferases analysed and modeled in this work are provided in Table 1. The mycolyl-transferases corresponding to the mycobacteria species; M. tuberculosis, M. leprae and M. bovis are highly similar. Therefore, the mycolyl-transferases from M. tuberculosis H37Rv strain are used in our analysis. Also, M. tuberculosis consists a mycolyl-transferase precursor protein MPT51 (Gene Id: Rv3803) that does not possess mycolyl-transferase activity [28] and was also therefore excluded from our analysis. The multiple sequence alignment of thirty-one mycolyl-transferases is shown in Figure 1. Despite low sequence similarity shared between these proteins, we observed 16 amino acid residues are conserved. These amino acid residues are; L39, W51, P71, D81, W82, W97, F100, G124, S126, S150, D192, G214, E230, G260, H262 and W264. The alignment also indicated some proteins have an insertion sequence of variable length (between 2 and 19 amino acid residues) that precedes the catalytic E230. Further, two N. farcinia proteins (Nfa1810 and Nfa1820) comprise a 27 amino acid residue insertion sequence rich in glycine and serine present between the conserved W82 and W97 (see Figure 1).
It is often observed that, during evolution, gene duplications, rearrangements and gene loss occur in genomes due to a complex, general purpose mechanism for rapid adaptation of the organism. As a result of gene duplication, extra copies of selected genes are evolved. Duplications are important because they effectively allow at least one of the gene copies to evolve while the function of the original gene can remain intact. Many new functions arise from duplication and subsequent change of old genes. In this way, duplication of pre-existing genetic information provides the raw material from which new gene functions can evolve thereby contributing to the genetic complexity during evolution. With reference to mycolyl-transferases in the CMN genera, the presence of varying number of proteins in each organism reflects gene duplication events during evolution of these organisms. Further, we identified that the overall structure, active site and hydrophobic tunnel are identical in all proteins, with significant differences in substrate specificity pockets which may be a result of selective pressure during evolution. From this work, we propose that trehalose is the original substrate and this binding is retained only in some corynomycolyl-transferases and nocardiomycolyl-transferases. During gene duplication, mutations in the substrate binding site have occurred such that the newly evolved proteins can bind to other sugars so as to synthesize organism specific polysaccharidemycolate cell wall component. Further, the mycolyl-transferases Nfa1840, Ncgl2777, Ce2709 and Dip2193 comprise a 300 amino acid residue C-terminal extension as a result of gene fusion events. Brand et al., 2003 reported that deletion of Ncgl2777 gene led to a 10-fold increase in the cell volume of the organism. We reported the identification of 55 amino acid residue tandem LGFP (conserved sequence motif; leucine, glycine, phenylalanine, proline) repeats in the C-terminal region of Ncgl2777 and Ce2709 [30] and suggested that the abnormal increase in the cell volume of C. glutamicum is due to the loss of C-terminal domain corresponding to the LGFP tandem repeats that may be responsible for maintaining the integrity of the cell wall. The presence of these LGFP repeats in C-terminal region of Nfa1840 and Dip2193 imply that these are also cell surface proteins and may be important in maintaining cell wall integrity in analogous manner.

Conclusion:
This work describes the comparison of the three-dimensional models for mycolyl-transferases in CMN genera. Although the sequence identities in some cases is as low as 17%, yet the overall α/β fold characteristic of mycolyl-transferases is conserved. This conservation extends to the active site comprising amino acid residues; S126, E230 and H262. However, the amino acid residues comprising the substrate binding pockets defined by interactions with trehalose vary owing to certain mutations in some mycolyl-transferases. Also, significant differences are observed in the size of the substrate binding pocket owing to the close proximity of an insertion loop between the conserved W82 and W97. The size and nature of amino acid residues corresponding to the substrate binding pockets is likely to affect mycolyltransferase substrate specificity. These observations lead us to believe that during the course of evolution, gene duplication events followed by mutagenesis at the substrate binding pockets, may have resulted in those mycolyltransferases that are responsible for synthesis of polysaccharide-mycolate complex in an organism specific manner.