Analysis of multi-domain hypothetical proteins containing iron-sulphur clusters and fad ligands reveal rieske dioxygenase activity suggesting their plausible roles in bioremediation

‘Conserved hypothetical’ proteins pose a challenge not just for functional genomics, but also to biology in general. As long as there are hundreds of conserved proteins with unknown function in model organisms such as Escherichia coli, Bacillus subtilis or Saccharomyces cerevisiae, any discussion towards a ‘complete’ understanding of these biological systems will remain a wishful thinking. Insilico approaches exhibit great promise towards attempts that enable appreciating the plausible roles of these hypothetical proteins. Among the majority of genomic proteins, two-thirds in unicellular organisms and more than 80% in metazoa, are multi-domain proteins, created as a result of gene duplication events. Aromatic ring-hydroxylating dioxygenases, also called Rieske dioxygenases (RDOs), are class of multi-domain proteins that catalyze the initial step in microbial aerobic degradation of many aromatic compounds. Investigations here address the computational characterization of hypothetical proteins containing Ferredoxin and Flavodoxin signatures. Consensus sequence of each class of oxidoreductase was obtained by a phylogenetic analysis, involving clustering methods based on evolutionary relationship. A synthetic sequence was developed by combining the consensus, which was used as the basis to search for their homologs via BLAST. The exercise yielded 129 multidomain hypothetical proteins containing both 2Fe-2S (Ferredoxin) and FNR (Flavodoxin) domains. In the current study, 40 proteins with N-terminus 2Fe-2S domain and C-terminus FNR domain are characterized, through homology modelling and docking exercises which suggest dioxygenase activity indicating their plausible roles in degradation of aromatic moieties.

Because of the inherent thermodynamic stability of the aromatic ring, natural turnover of these compounds is slow and instead relies on complex microbial degradation pathways. With aromatic compounds comprising >25% of the earth's biomass, these pathways play a crucial role in the biogeochemical carbon cycle. However, despite the abundance of microbial degraders, man-made aromatic pollutants are often recalcitrant to existing bioprocessing pathways. As a result, these xenobiotic compounds, many of which are derived from the processing of crude oil, persist in the environment causing irreversible damage to the biosphere [7].

Figure 1: Reaction of ring cleavage mediated by RDO
Aromatic ring-hydroxylating dioxygenases, also called Rieske dioxygenases (RDOs), are class of multi-domain proteins that catalyze the initial step in microbial aerobic degradation of many aromatic compounds. Two hydroxyl groups are introduced into the aromatic ring yielding cyclic cisdihydrodiols or cis-diol carboxylic acids (Figure 1) [Substituents X and Y can be hydrogen atoms or any of several other groups] [8,9].
More than three dozen distinct RDOs have been identified. RDOs consist of a reductase, an oxygenase and in some cases, an additional ferredoxin that mediates electron transfer between the former two components. The oxygenase component catalyzes the insertion of both atoms of molecular oxygen into the aromatic substrate, which is believed to occur at a mononuclear iron site and to be accompanied by electron insertion from a Rieske-type [2Fe-2S] centre. Either the reductase or, where present, the intermediary ferredoxin component, supplies the two electrons from NAD(P)H to the dioxygenase [10]. RDOs have been empirically classified according to the various combinations of subunits and electron transfer co-factors involved in reducing the oxygenase component [10,11] as mentioned in Table 1 (see  supplementary material).
Here we present a protocol to data mine and computationally characterize redox hypothetical proteins possessing multiple domains. Most proteins consist of multiple domains, and domains determine the function and evolutionary relationships of proteins [12]. Thus, it is important to understand the principles of domain combinations and their associated inter domain interactions especially, in hypothetical proteins.
Primarily, 2Fe-2S (Ferredoxins) and FMN/FAD (Flavodoxins) were considered due to their vital and diverse roles in biological systems, the most important amongst it being their role in Electron Transport Mechanisms. Ferredoxins are small, acidic, electron transfer proteins that are ubiquitous in biological redox systems. Members of the 2Fe-2S ferredoxin family have a general core structure consisting of beta(2)-alphabeta (2). They are proteins of around one hundred amino acids with four conserved cysteine residues to which the 2Fe-2S cluster is ligated [13]. Flavoenzymes have the ability to catalyse a wide range of biochemical reactions. They are involved in the dehydrogenation of a variety of metabolites, in electron transfer from and to redox centres, in light emission, in the activation of oxygen for oxidation and hydroxylation reactions. About 1% of all eukaryotic and prokaryotic proteins are predicted to encode a flavin adenine dinucleotide (FAD) or Flavin Mono Nucleotide (FMN)-binding domains which are involved in electron transport [14].

Methodology:
The proteins belonging to oxidoreductase (Ferredoxin, Flavodoxin) families were retrieved from the ExPASy Prosite interface [15]. However, engineered and mutated sequences were not considered to avoid redundancy. Additionally, only reviewed sequence from Uniprot containing a structural entry were considered. Binding sites of all the proteins belonging to the same group were analyzed in order to arrive at a consensus pattern through multiple sequence alignment. Extended regions which had no information with the other sequences were clipped to strengthen the alignment. The protocol adopted is shown in (Figure 2).
The search for Ferredoxin family (PDOC00175) yielded 14 sequences with 2Fe-2S binding signature. As there existed heterogeneity within the group, the sequences were clustered based on phylogenetic analysis. The sequence alignment was performed through ClustalW [16] and the tree was obtained using MEGA (NJ method) [17]. The tree obtained is shown in (Figure 3A). Further to the clustering, multiple sequence alignment was performed using Multalin [18], for all the 3 clusters (groups) to obtain a representative sequence containing strong signatures. The multiple sequence alignment of sequences belonging to group 1 yielded better consensus Similarly, the search for flavodoxin family (FNR reductase -PDOC51384) yielded 7 sequences, whose Phylogenetic tree is shown in (Figure 4A). When multiple sequence alignments of both the clusters were critically analyzed, the MSA of group 1 exhibited strong signatures of the FNR domain when compared to cluster 2, which is depicted in (Figure 4B).
Thus, a consensus of the cluster of sequences from group 1, for both the 2Fe-2S and FNR domains respectively, were considered as possible representative patterns, towards generating the probable synthetic sequence, which was used as the basis for BLAST tool [19] search against the non-redundant database. Interestingly, this approach yielded 2078 sequences, and clearly contained both 2Fe-2S and FNR domains when analysed through the conserved domain database (CDD) [20]. Amongst these 2078 sequences, 129 belonged to that special class of hypothetical proteins, which were taken up for further characterization and analysis.

Results and Discussion:
Upon critical evaluation of the 129 multi-domain hypothetical sequences through CDD, significant differences in the location of 2Fe-2S domain, relative to other domains, were found. Of these 129 sequences, 61 contained an N-terminus 2Fe-2S and a C-terminus FNR domain while this order was reversed in 25 sequences as shown in the ( Figure 5A).

The remaining 43 sequences contained an N-terminus MOSC domain [21]
(pfam03473 and pfam03476) which is a super family of beta-strand-rich domains identified in the molybdenum cofactor sulfurase and several other proteins from both prokaryotes and eukaryotes. The MOSC domain is predicted to be a sulfur-carrier domain that receives sulfur abstracted by the pyridoxal phosphate-dependent NifS-like enzymes, on its conserved cysteine, and delivers it for the formation of diverse sulfur-metal clusters. The pie chart in (Figure 5B) illustrates the distribution of the domains amongst these 129 proteins.
In the current study, 61 sequences containing N-terminus 2Fe-2S and C-terminus FNR domains are only considered. The remaining sequences, 25 of which contain an N-terminus FNR and C-terminus 2Fe-2S domains, and 43 of them containing MOSC domain will be considered for detailed analysis in near future. The phylogenetic analysis of the 61 sequences containing an N-terminus 2Fe-2S and C-terminus FNR domains is depicted in (Figure 6), which exhibits the domination of the genus Pseudomonas (46% amongst 61 sequences).    Of these 61 sequences, 21 sequences had very low (<20%) sequence identity with the template 1KRH, and hence were discarded from further analysis due to lack of clarity. The remaining 40 sequences were considered with confidence for homology modelling exercises, as they exhibited high similarity with 1KRH. The overall sequence identity between the query and template was between 20-30% for most sequences, except 7 of them which was in the higher range of 40-70%. However, in spite of lower overall identities will the template, the appreciation with the patterns at domain regions was indeed revealing. The (Figure 7) shows the distribution of the overall sequence identity, identities at the FAD and 2Fe-2S binding regions for each sequence, which clearly illustrates the conservation at critical regions of functional relevance.
The In view of the poise in the signatures between the template and the 40 target sequences, model building exercises were carried out with Swissmodel automated mode [22]. The RMSD between the modelled structure and template for the Ca-atoms confirmed the quality of the models in spite of seemingly low sequence identity (refer table 3 and figure 10), in addition to the satisfaction of various criteria calculated using ProCheck [23]. Individual models were analysed for the binding of ligands through docking studies which was performed using FlexX algorithm [24]. As a case study, modelling of GI ID 238795496 is illustrated below, to define the quality of the structural and functional aspects of these hypothetical protein sequences.

1KRH based model for the GI ID 238795496
The query protein 238795496 from Yersinia mollaretii ATCC 43969 was successfully modelled using SWISS model interface, where the overall identity between the query and template is 25.22%. The alignment between the template and query is shown in (Figure 8). In spite of the low overall sequence identity, it can be appreciated that the binding regions of 2Fe-2S and FAD exhibit conservation up to 35%.
The RMSD for C-alpha atoms between the modelled structure and template is found to be 0.53 Å (for 93.2% of the atoms superposed). The quality of the model was assessed with PROCHECK (Ramachandran map) which showed that 96.8% of the residues were in allowed regions and only 3.2% non-critical residues in disallowed regions. Interestingly, none of these residues in the outlier regions belong to the functionally important regions of the model. The 2Fe-2S and FAD ligands were docked into the model and all the interactions were found similar to that of the template. The binding of 2Fe-2S Ligand and FAD are shown in (Figures 9A, B and Figure 10A, B, C).  where all the 40 models were judged to possess clashes within acceptable limits. Table 3 (see supplementary material) summarises the details of all the models generated with IKRH (which contains 338aa) as the template.

Conclusion:
129 hypothetical proteins from across the prokaryotic genomes have been data mined, and the 3D description of 40 sequences has been derived with confidence. The statistics related to comparative modelling and docking studies (with acceptable energy values) have revealed a strong interaction of redox ligands, viz., 2Fe-2S and FAD, which further strengthens the argument that these proteins may be involved in cleavage of aromatic compounds.
Though degradation of aromatic compounds by Pseudomonas is a well established fact [26,27], characterization of hypothetical sequences from Pseudomonas in the present study could aid in better understanding of these microbial systems. Additionally, large number of other bacterial systems containing these dioxygenases have also been mined and characterized in the present investigations, which could provide insights into their degradation properties.
Thus, this study on multi-domain hypothetical proteins could prove critical in two ways viz., in understanding the mechanism of uptake of nutrients which contain aromatic ring structures and hence enabling engineering of these proteins towards effective degradation of harmful xenobiotics.