Modeling and structural analysis of cellulases using Clostridium thermocellum as template

Cellulase is one of the most widely distributed enzymes with wide application. They are involved in conversion of biomass into simpler sugars. Cellulase of Trichoderma longibrachiatum, a known cellulolytic fungus was compared with Clostridium thermocellum [AAA23226.1] cellulase. Blastp was performed with AAA23226.1 as query sequence to obtain nine similar sequences from NCBI protein data bank. The physicochemical properties of cellulase were analyzed using ExPASy’s ProtParam tool namely ProtParam, SOPMA and GOR IV. Homology modeling was done using SWISS MODEL and checked quality by RMSD values using VMD1.9.1. Active sites of each model were predicted using automated active site prediction server of SCFBio. Study revealed instability of cellulase of two eukaryotic strains namely Trichoderma longibrachiatum [CAA43059.1] and Melanocarpus albomyces [CAD56665.1]. The negative GRAVY score value of cellulases ensured better interaction and activity in aqueous phase. It was found that molecular weight (M. Wt) ranges between 25-127.56 kDa. Iso-electric point (pI) of cellulases was found to be acidic in nature. GOR IV and SOPMA were used to predict secondary structure of cellulase, which showed that random coil, was dominated. Neighbor joining tree with C. thermocellum [AAA23226.1] cellulase as root showed that cellulases of Thermoaerobacter subterraneus [ZP_07835928] and C. thermocellum [CAA4305.1] were more similar to eukaryotic cellulases supported by least boot strap values. Pseudoalteromonas haloplanktis cellulase was found to be the ideal model supported by least RMSD score among the predicted structures. Trichoderma longibrachiatum cellulase was found to be the best compared to other cellulases, which possess high number of active sites with ASN and THR rich active sites. CYS residues were also present ensuring stable interaction and better bonding. Hydrophilic residues were found high in active sites of all analyzed models and template.

functional relations. Present study is to identify more efficient cellulolytic enzyme producing microorganism for bioprospecting using the computational analysis. Protein sequences of cellulase retrieved from NCBI and were subjected to ProtParam to analyze physicochemical parameter, secondary structure prediction using GOR IV and SOPMA, homology modeling (Swiss model), phylogenetic analysis and active site prediction by SCFBio.

Methodology: Sequence retrieval and alignment
Cellulase protein sequence of Clostridium thermocellum [AAA23226.1] was retrieved from the National Center for Biotechnology (NCBI) and made as the query sequence for the structure, properties prediction and modeling. Blastp was performed and obtained nine similar sequences of different strains. Clustal W multiple sequence alignment was done for those sequences using BioEdit5.0.

Secondary structure and physicochemical characterization cellulose
The sequences obtained were analyzed using various softwares available in the ExPASy server [7]. The GOR IV analysis was performed to understand the presence of helices, beta turns and coils in the protein structure [8]. Self-optimized prediction method with alignment (SOPMA) analysis was done for analyzing the structural components [9]. Comparison was made between the GOR IV and SOPMA analysis results. ProtParam software analysis was done to understand about the amino acid composition, molecular weight, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [7]. Hydropathy plot analysis for all cellulase sequences was performed and the nature of amino acid residues were studied using ProtScale [7] based on Kyte and Doolittle scale.

Homology modeling of cellulase
Homology models were predicted using SWISS-MODEL [10-12] and the quality was analysed using VMD 1. 9.1 [13]. RMSD values were calculated using the RMSD calculator and the best homology model was selected. Ramachandran plot for the best predicted model was depicted by RAMPAGE software [14].

Phylogenetic analysis
Phylogenetic relation among the aligned cellulase sequences obtained from Blastp were analyzed based on neighbor joining method [15] using MEGA 4.0 [16]. The cellulase sequence of C. thermocellum [AAA23226.1] was considered as the root taxon for the analysis. Confidence level was analyzed using bootstrap of 1000 replications.

Activity validation by active site comparison
Active sites of the predicted models and the template were analyzed using Automated Active Site prediction AADS server of SCFBio [17]. Amino acid compositions of all the cavities were analyzed and the frequency of amino acid occurrence in the cavities of each models were analyzed.

Discussion: Blast analysis and sequence retrieval
The cellulase protein sequence of Clostridium thermocellum [AAA23226.1] was used as query sequence and nine sequences were obtained by performing Blastp. Multiple sequence alignment was done in BioEdit software and further used for phylogenetic analysis in MEGA.

Secondary structure and physicochemical analysis
SOPMA and GOR IV were used to predict the secondary structure, percentage of alpha, extended and random coils of cellulase producing microorganism are presented Table 1 (see supplementary material). SOPMA analysis for the structure prediction was also done and obtained the percentage of alpha, extended, beta and random coils ( Table 1). The secondary structure indicates whether a given amino acid lies in a helix, strand or coil [18,19]. SOPMA was used for structure prediction of cellulase protein [20]. Random coil dominates the other forms in the cellulase analyzed by SOPMA and GOR IV. It was identified that random coils of M. abomyces (58.72%) and T. longibrachiatum (57.88%) were dominant compared to other forms. However, followed by random coils, extended forms ranging from (10%-27%) was dominant over α and β helix. All the cellulases analyzed, α-helix was ranging from (13%-37%) dominates β-helix, which had less percentage of conformation (4%-10%).
ProtParam analysis was performed and the number of amino acid residues, molecular weight, pI value, aliphatic index and GRAVY index was obtained for each sequence Table 2 (see supplementary material). Comparison of the amino acid residue occurrence in cellulase sequences were done and the most dominant residues were highlighted Table 3 (see supplementary material). It was found that molecular weight ranging from 25-127 kDa and it was higher in C. thermocellum (83 kDa) and lower in M. albomyces (25kDa). Comparing to the eukaryotic cellulase available, the higher aliphatic index of up to 97.51 was noted in T. subterraneus strains which indicate their stability over a wide range of temperatures. GRAVY value was negative in all species studied. It was notable that the bacterial strains had lower GRAVY values indicating the better possibilities of aqueous interaction. pI value showed that cellulase is acidic in nature. T. subterraneus had a slightly neutral pI value and the highest GRAVY value. Generally it was observed that towards acidic pI values the GRAVY tends to be low. In eukaryotic cellulases, the occurrence of α helices was found to be too low. In case of A. bisporus, α helices was similar to that of lower taxonomic groups. Moreover these cellulases possess higher percentages of random coils. A general pattern of inverse relationship between the percentage of occurrence of α helices and random coils were observed in both higher and lower taxonomic levels.
Cellulase of M. albomyces, T. longibrachiatum and R. flavefaciens FD-1 was classified as unstable (II > 40) with an instability index (II) of 53.54, 55.23 and 54.34 respectively. It is notable that the M. albomyces and T. longibrachiatum are eukaryotic isolates and possess the least percentage of alpha helices in their structure. P. haloplanktis and R. flavefaciens FD-1 with dominant amino acid residues Asn (10.1%) and Ser (11.6%) respectively which are hydrophilic residues, all the other sequences had ALA and GLY as dominant residues which are hydrophobic in nature. ALA was dominant in cellulases of A. bisporus, C. thermocellum, P. carotovarum, Saccharophagus sp. and T. subterraneus whereas, Gly was dominant for C. thermocellum, M. albomyces and T. longibrachiatum. Homology model validation SWISS MODEL was used to predict the homology model of the cellulase sequences and the protein structure quality was analyzed. RMSD values for the models were calculated and the model with least value i.e. the best predicted model is shown in (Figure 1). Ramachandran plot for the model was constructed using RAMPAGE software. Residue B 169 -LEU belonged to outlier region and the number of residues in the allowed and favoured region was very close to the expected values. It was observed that 94.8% of residues were in favored region and 5.5% in allowed region. It was found that 0.2% was found in outlier region.

Phylogenetic analysis
Phylogenetic tree was constructed using the ten sequences based on neighbour joining method with reference sequence C. thermocellum [AAA23226.1] as a root (Figure 2 However, all the taxa of the group belonged to prokaryotic origin. There was no much influence for evolutionary divergence of the sequence with respect to variations in secondary structure.
Compared to bacterial cellulases, fungal cellulases are widely used. Moreover, the cellulolytic activities are high for fungal cellulases. Highest cellulase activity for C. thermocellum was 12.05IU/ml [5]. P. haloplanktis being a psychrophilic bacterium the cellulase obtained is cold adaptable. Cellulase from the former has conserved five amino acid residues in their active sites [21]. C. thermocellum is a thermophilic bacteria and its cellulase has a better heat stability. It is known to be ethanogenic strain and cellulase from this source has high commercial applications [22]. Cysteine residues contribute to protein thermal stability [22]. Amongst fungi, species of Trichoderma and Aspergillus are well known for cellulolytic potential [23]. Apart from the above, other fungi used for cellulase production are Humicola and Aspergillus sp. [24]. Hydropathy plot for the cellulase sequence was constructed using ProtScale based on Kyte and Doolittle and the hydrophilicity and hydrophobicity nature was observed from the plot. It was observed that the majority of the residues were belonging to the hydrophilic regions confirming the interaction of the enzymes in aqueous medium. Aliphatic residues namely ALA, LEU, ILE and VAL were among the hydrophobic residues in the profile. Similarly, Phe which is an aromatic residue and sulfur containing residues MET and CYS were the other residues belonging to hydrophobic regions of ProtScale profile.

Active site prediction based on active site
Active sites for each model and template were predicted using Active Site prediction server and tabulated  1] cellulases. However P. haloplanktis, an extremophile had THR dominant active sites. In T. longibrachiatum ASN and THR was found to be dominant in active sites with a frequency of 10.58. It is clearly notable that the hydrophilic amino acid residues are high in the active sites of these enzyme structures ensuring their interaction with substrate in aqueous phase. However the least found residue was CYS which assures stable interaction and bonding. Though the frequency of CYS was too low, it was found in both C. thermocellum and 3 eukaryotic cellulases. So this result validates the higher cellulolytic activity and T. longibrachiatum could be the source of most active cellulase from the present study.

Conclusion:
These studies provide an insight for better prospecting of cellulolytic isolates from the environment for various industrial applications. Among the microbial cellulase used in the present work, T. longibrachiatum cellulase was found to be best with high number of active sites.     Table 4: Amino acid frequencies in active sites of predicted cellulase models and template

Percentage of amino acids in active sites
Amino acid residues