Categorization of metabolome in bacterial systems

Analyses of biological databases such as those of genome, proteome, metabolome etc., have given insights in organization of biological systems. However, current efforts do not utilize the complete potential of available metabolome data. In this study, metabolome of bacterial systems with reliable annotations are analyzed and a simple method is developed to categorize pathways hierarchically, using rational approach. Ninety-four bacterial systems having for each ≥ 250 annotated metabolic pathways were used to identify a set of common pathways. 42 pathways were present in all bacteria which are termed as Core/Stage I pathways. This set of pathways was used along with interacting compounds to categorize pathways in the metabolome hierarchically. In each metabolome non-interacting pathways were identified including at each stage. The case study of Escherichia coli O157, having 433 annotated pathways, shows that 378 pathways interact directly or indirectly with 41 core pathways while 14 pathways are noninteracting. These 378 pathways are distributed in Stage II (289), Stage III (75), Stage IV (13) and Stage V (1) category. The approach discussed here allows understanding of the complexity of metabolic networks. It has pointed out that core pathways could be most ancient pathways and compounds that interact with maximum pathways may be compounds with high biosynthetic potential, which can be easily identified. Further, it was shown that interactions of pathways at various stages could be one to one, one to many, many to one or many to many mappings through interacting compounds. The granularity of the method discussed being high; the impact of perturbation in a pathway on the metabolome and particularly sub networks can be studied precisely. The categorizations of metabolic pathways help in identifying choke point enzymes that are useful to identify probable drug targets. The Metabolic categorizations for 94 bacteria are available at http://115.111.37.202/mpe/.


Background:
In recent years there has been an exponential growth in the number of full genome sequencing projects. This has generated huge genome sequence and related data. Organization of data related to genome, proteome, metabolome etc., in databases and analysis of the data in these databases have given insights in organization of biological systems. At each level of organization new properties and rules emerge that could not be predicted by studying lower levels [1,2]. For example, studying genes and proteins in isolation does not provide information on network of genes and proteins. The unique property that emerges by studying interacting network of genes and proteins is the dynamic interactions among chemical compoundsmetabolites, substrates etc. Metabolism is an emergent property of this dynamic interaction. Various approaches have been used to simplify and understand complex network of metabolic pathways. This has lead to classification of metabolome by implementing different set of rules.
Jeong et al., [3] have shown that the metabolic pathways are scale free and highly interconnected. Ma et al., [4] have used KEGG LIGAND [5] database to classify the metabolome in a fully connected sub-network using substrate and product properties. The primary aim of their work was to study network properties such as strong component of a network and closeness centrality. Another study on metabolic classification reported that metabolic networks are organized into highly connected topological modules. These modules then combine in a hierarchical manner into larger less cohesive units and demonstrated that their degree of clustering follows a power law [6]. Further these authors have classified the modules depending on the biochemical property such as carbohydrate metabolism, nucleotide and nucleic acid metabolism, protein, peptide and amino acid metabolism, lipid metabolism etc [6]. Ihmels et al.,[7] analysed the co-expression pattern of genes and revealed strong coordinated regulation of genes involved in individual metabolic pathways. Their analysis demonstrated that a clear hierarchical structure exists in metabolic network that converges with one of the three modules associated with amino acid biosynthesis, protein synthesis and stress. From these studies it is difficult to infer global relation between metabolic pathways in the organisms. Also none of the above studies have used specific metabolic pathway ontology, thus making it difficult to unveil unique features of a metabolome. Further, non-usage of specific ontology makes it difficult to compare the results from different studies.
Handorf et al., [8] introduced the concept of biosynthetic potential of a compound and expansion of metabolic network depending on the scope of randomly selected seed compounds. More recently Azuma et al., [9] reported minimal pathway maps by analyzing KEGG pathway maps of 30 representative genomes. These studies have identified energy conversion, amino acid biosynthesis and nucleotide biosynthesis as essential or part of the minimal set of metabolic pathways. Therefore it is argued that simpler/conserved metabolic pathways were present in the primordial cell. However these studies do not comment on direct or indirect dependence of rest of the pathways on the minimal/conserved pathways. Understanding the evolution of present complex metabolic network from the primordial set of pathways is difficult and thus above studies fail to pinpoint the causality behind pathways interactions. None of the above studies fully utilizes the potential of enormous metabolic data in large number of bacteria available in the post-genomic era. There is a need to develop a method that extract knowledge from this large data and helps in understanding the organization as well as functioning of the organism at higher resolution. This knowledge will help in engineering organism to attain specific function, through suitable modification of pathways. A simple method is developed with this aim and the results obtained are discussed in succeeding sections. Figure 1: Extension of metabolic categories from four Core/Stage I pathways belonging to carbohydrate metabolism. Nodes in red colour denote non-interacting pathways while green coloured nodes denote interacting pathways at each stage.

Data collection
Reliable and highly curated data from Pune University Metabolic Pathway Engineering (PuMPE) resource [10] is used in this study. Pathways data for ninety-four bacteria (50 aerobic, 38 facultative and 6 anaerobic bacteria) having 250 or more metabolic pathways was extracted from PuMPE database. Further analysis was carried out using this data for each of the ninety-four bacteria. Data from archaea are not included in the present analysis. To find out therefore C pathways following approach was used: Each pathway has a 'Unique id' in PuMPE database. In order to identify common identical pathways in ninety-four bacteria 'Unique id' in each bacterium was compared iteratively. The presence of pathways with same 'Unique id' in each of ninetyfour bacteria represents identical pathways. 42 such identical pathways were found to be present in each of these ninety-four bacteria. These identical metabolic pathways that are present in all bacteria are termed as Core pathways C or Stage I pathways.

Categorization of metabolome according to metabolic pathways interactions
The metabolome is categorized depending upon direct or indirect interaction of Core pathways (Stage I) with rest of the pathways. To categorize metabolome and assign a stage to the pathway in each bacterium following approach is used. Any pathway is said to be interacting with other pathway if at least one of the compounds -start, intermediate or end product is common in two pathways under consideration.
Following algorithm was used to categorize the metabolome using above definition of interacting pathways: 1) A set of all start, intermediate and end compounds {Cm} i from set of pathways {P} i belonging to Stage i was prepared; 2) A set of pathways categorized as 'other' pathways {O} i at Stage i was prepared after each stage of categorization of pathways; 3) Compounds (start, intermediate or end) in a particular pathway from the set of other pathways {O} i were compared with {Cm} i and if even a single compound was found to be identical then that pathway was categorized to Stage (i+1); 4) This process was continued for each pathway in the set {O} i to identify a set of interacting pathways {P} i+1 belonging to Stage (i+1); 5) Noninteracting pathways form the set {O} i+1 pathways of category are termed as 'other'; 6) This process was continued till pathways in {O} n do not interact; 7) The initial set or Stage I pathways for each bacterium is always a set of core pathways C and the starting point of subsequent categorization.

Result & Discussion:
From 422 bacteria in PuMPE database, ninety-four bacteria have ≥250 annotated pathways. Among these pathways 42 pathways are identical in all bacteria, called Stage I or Core pathways. The list of Core pathways (Stage I) is given in Table 1 (see supplementary material).
As can be seen from Table 1, core pathways include important pathways such as amino acid, carbohydrate, lipid and nucleotide metabolism. Presence of these pathways as core pathways supports the hypothesis that these are primary pathways, as suggested by Woese [11], could be the most ancient pathways. Thus these 42 pathways are suggested as being part of the universal ancestor metabolome. The methodology explained in the methods section was applied to categorize the pathways in each of the ninety-four bacteria. Results of a case study of categorization of metabolome of E.coil O157 are discussed below.

Metabolic categorization: a case study of Escherichia coli O157
Escherichia coli O157 is one of the best-annotated metabolome. In this strain of E.coil 433 pathways are annotated. Therefore, categorization of these 433 pathways is discussed. Start, intermediate and end compounds in each of the 433 pathways were extracted from PuMPE database. These were then organized in two sets: I -Compounds from 42 Core/Stage I pathways and II -Compounds from remaining or other 391 pathways If common compounds are present between Stage I and one or more remaining pathways, then those pathways were considered to have interaction with Stage I pathways. Such interacting pathways are called Stage II pathways. In Escherichia coli O157, 289 pathways from set of 391 pathways were found to interact with 41 core pathways. Thioredoxin pathway in the set of core pathway did not interact with any of the 391 pathways as there is no compound from this pathway that is common to any of the compounds from remaining 391 pathways. Thioredoxin pathway is termed as non-interacting.
In next step a set of start, intermediate and end compounds of the 289 Stage II pathways was prepared. These compounds were then compared with start, intermediate and end compounds of each of the remaining 102 pathways of the set 'other' pathways. This comparison gave 75 interacting pathways with Stage II. These 75 pathways are categorized as Stage III pathways and remaining 27 pathways as other pathways. The interaction between 75 pathways of Stage III and remaining 27 pathways gave rise to 13 Stage IV pathways. Finally Stage IV pathways interact with only one of the remaining 14 pathways giving one Stage V pathway namely -"Lipid A-core biosynthesis". Table 2 (see supplementary material) also points out that at every stage there are pathways that do not interact with any of the remaining pathways. These are called 'next stage noninteracting pathways'. In other words, in the hierarchical network of pathways model, these non-interacting pathways interact with previous stage pathways but do not interact with next stage pathways. For example, Thioredoxin pathway, the non-interacting pathway at Stage I do not have any direct or indirect impact on any of the 'other' metabolic pathways in Escherichia coli O157. On the other hand 200 non-interacting pathways of Stage II will interact with one or more 41 pathways of Stage I and will have no interaction with remaining 102 pathways belonging to Stage III (75), IV (13), V (1) or noninteracting (13) pathways. Similar is the case with other noninteracting pathways that, they will not affect higher stage pathways.
Previously, a set of randomly selected seed compounds from KEGG database was used to demonstrate expansion of metabolic network [8]. The method described here selects seed compounds by following a rational approach and is biologically more meaningful. This simple approach of categorization allows identification of pathways that will not get impacted even if certain pathways are modified or engineered.
The fourteen non-interacting pathways are listed in Table 3 (see supplementary material). None of these pathways have a single common compound with remaining pathways in the metabolome of E.coil O157. In other words substrate and catalysing enzymes are unique in these fourteen pathways. Identification of such pathways is not easy, though important, and may be very good target to engineer specific property of the organism. The analysis of these non-interacting pathways thus becomes essential. A careful reading of names of pathways in Table 3 gives some idea as to why these pathways are present but do not interact with each other or any other pathway in E.coil O157.
The first three pathways in Table 3 (see supplementary material) are responsible for degradation of aldoximes, acrylonitrile and catechol respectively, which can enter bacterial cell through free diffusion, use of a permease or other transport system [12]. Aldoxime, acrylonitrile and catechol are toxic and are often discharged into environment by various chemical industries. If these compounds are not degraded then they will harm environment and can have severe impact on aquatic ecology. The capability of degradation of aldoximes and acrylonitrile by Escherichia coli O157 is of interest as it may help in treatment of nitrile pollution. Similarly, catachols are also great environmental concern as they are known to be carcinogenic to human beings [12]. Therefore, the three degradation pathways 1 to 3 in Table 3 (see supplementary material) in Escherichia coli O157 can help in bioremediation of contaminated sites. Hence the non-interaction of these pathways can be understood, as they seem to have evolved in response to the change in environment.
In a similar manner Cis-vaccenate biosynthesis, sulphate activation for sulfonation and Cyclopropane fatty acid biosynthesis pathways (4 to 6 in Table 3) help the organism to survive under stress by producing /utilizing unusual compounds. For example Cis-vaccenate is the only unsaturated fatty acids found in Escherichia coli O157. It helps in adjusting growth and differentiation to differing temperatures. Sulfate activation for sulfonation pathway improves the defence mechanism of E.coli, where as the cyclopropane fatty acid synthesised in the cyclopropane fatty acid biosynthesis pathway protects from acid shock [13]. Table 3 can be categorized as co-factor biosynthesis pathways and are known to be essential for viability of an organism [14-16] On the other hand pathways 12 & 13 Table 3 are electron donor and acceptor pathways where as pathways 14 Table 3 is transporter of inorganic nutrient namely ammonium. In fact, pathways 7-10 in Table 3 are identified as drug target in various bacteria [14-16] and can be used as target pathways for designing drug against the diseases such as haemorrhagic colitis or haemolytic uremic caused by E. coli O157 [17].

Pathways 7-11
From above analysis it is clear that set of non-interacting pathways are essential and play very important role in the survival, growth and differentiation of an organism. Evolution of these pathways is in response to stress, environmental condition or to improve its defence mechanism and have therefore carefully chosen enzymes and substrates that have unique properties and are not part of other complex networks of pathways so that any change in these complex network will not have adverse impact on survival or defence mechanism. These non-interacting pathways exist in every organism. Since no simple tool to identify them was available till this study, their importance and roles are not studied in depth or are not utilised in engineering organism or pathways. Identification of a set of non interacting pathways along with set of identical or Stage I pathways is the most important contribution of this simple metabolic pathways categorization study described here.
This metabolome categorization approach also allows studying in detail complex metabolic networks in the organism. It is clear from Table 1 that only one network has highest complexity up to Stage V whiles thirteen metabolic networks have four stages. To understand the complexity of network a case study of pathways belonging to carbohydrate metabolism at Stage I are discussed. It can be seen from Table 1 that only four pathways namely, gluconeogenesis I, glycolysis I, pentose phosphate pathways (non-oxidative) and TCA cycle, can be grouped as carbohydrate metabolism.
These four pathways interact with 178 pathways, which are termed as Stage II pathways. Out of these 178 pathways only 53 pathways interact to give 50 Stage III pathways. From these 50 Stage III pathways, 8 pathways interact to give 9 Stage IV pathways, which do not interact further. The same is depicted in (Figure 1). It is clear that there is one to many mapping for Stage I to Stage II pathways (4→178) but for at least few pathways of Stage II there is many to one mapping to Stage III pathways (53→50) as well as for Stage III to Stage IV pathways (8)(9). However, the analysis also points out one to one as well as many to many mappings of pathways in the complex metabolic networks. To understand this complexity, networks of two pathways glycolysis I and gluconeogenesis I, belonging to the Stage I / Core pathways in the group of carbohydrate metabolism, was studied.
It can be seen from (Figure 2), that both these pathways interact through compounds Pyruvate and Dihydroxyacetone phosphate with Stage II pathways glycine betaine degradation and CDP -diacylglycerol biosynthesis I & II. These Stage II pathways interact through 1, 2 diacylglycerol -3-phosphate and Glycine betaine with Stage III pathways phospholipid degradation, choline degradation and glycine betaine biosysnthesis I. These three pathways through choline and phosphadylcholine interactions give rise to phosphadylcholine biosynthesis I pathways -a single pathway (Figure 2). It can also be seen that the compound Glycine betaine is common for two pathways choline degradation I and glycine betaine biosynthesis I. The network and its complexity can be worked out to high precision using above-mentioned approach for every pathway. Therefore the effect of pathway perturbation on the substructure as well as whole metabolome can be understood. Modifications of precise nature or engineering of metabolome is possible because of the understanding of precise connection among pathways and chemical / biochemical reactions. It must be kept in mind that these networks are results of the assumption that pathways given in Table 1 (see suplementry material) are at the lowest level. The occurrences of various types of complex networks not only of different hierarchy but also different mappings indicate that interacting compounds are used with varied frequency. To understand the correlation between number of times the same compound is used to connect different pathways and its chemical properties an in-depth analysis is necessary. The simple analysis of complex network of pathways discussed above suggested that few compounds are used more frequently in connecting pathways from lower stage to higher stage. (Figure 3) shows top 10 compounds from Stage I pathways that interact with remaining pathways to form Stage II. Pyruvate from four Stage I pathways interact with more than 50 Stage II pathways, clearly demonstrating the many to many relationship. Similarly acetyl-CoA from three Stage I pathways interacts with more than 40 Stage II pathways. As can be seen from (Figure 3), there is a sharp drop in the number of Stage II pathways that interact with compounds from Stage I. This point out that pyruvate and acetyl-CoA play a major role in connecting pathways from Stage I to Stage II. It is clear that removal or reduction in supply of metabolite like pyruvate or acetyl-CoA will have immense impact on the metabolome of the E.coli 0157. Many pathways will get affected at Stage II and thus those that are connected to them at higher stages will also be affected. List of such pathways can be easily determined because of the approach used here. In short, the approach of metabolic categorization discussed here pinpoints the role of compounds in connecting metabolic pathways and expansion of network. Also one can study the impact of removal of a metabolite on metabolic organization by using metabolic categorization. A similar analysis for compounds from Stage II interacting to form Stage III pathways revealed no over representation of a particular compound. Among the top 10 interacting compounds, three compounds viz., dimethylbenzimidazole, UDP-D-glucose and menaquinol, which belong to one Stage II pathway, interact with five Stage III pathways. In this case one to many relationship is evident. The situation did not differ when we studied compounds from Stage III interacting to form Stage IV. However the compound viz., KDO2-lipid IVA from Stage IV interact to form Stage V Lipid-A core pathway, an example of one to one relationship as it belongs to one Stage IV pathway and interacts with one Stage V pathway. This shows that as the level of categorization increases the metabolites become more pathway specific.
It is clear from above analysis that compounds that connects lower level to higher level of pathways becomes specific as the level of pathways connectivity increase. For example, pyruvate interacts with fifty-one Stage II pathways through Stage I pathways. On the other hand highest connectivity for Stage II to Stage III is five and Stage III to Stage IV is four. However, one Stage IV pathway interacts with only one Stage V pathway. Morowitz [18] has identified compounds such as pyruvate, acetyl -CoA, L-glutamate, succinate and oxaloacetate as a part of core metabolome and they are reported to have high biosynthetic potential [19]. It can be seen from (Figure 3) that these entire compounds are interacting with Stage I and Stage II pathways large number of times, indicating the validation of the approach used in this study. In fact as mentioned earlier the granularity being very high, this type of analysis allows studying specific role of various metabolite in categorization of metabolome.

Conclusion:
A very simple approach of categorization of metabolome has helped to identify most of the primordial pathways. The categorization approach provides a framework to understand global pattern of metabolic organization. Analysis of metabolic categories revealed compounds with high connectivity that plays an important role in maintaining the complex network of metabolome. For the first time the categorization has helped to identify precise sub group of pathways that may be affected due to perturbations in a particular pathway. The categorization of pathways is useful in identifying choke point enzymes by analyzing the non-interacting pathways at each stage. These choke point enzymes can further be used to identify probable drug targets. The process of categorization being simple and objective has been used to categorize metabolome of ninety-four well-annotated bacteria and is available at http://115.111.37.202/mpe/