Towards reconstructing a metabolic tree of life.

Using information from several metabolic databases, we have built our own metabolic database containing 434 pathways and 1157 different enzymes. We have used this information to construct a dendrogram that demonstrates the metabolic similarities between 282 species. The resulting species distribution and the clusters defined in the tree show a certain taxonomic congruence, especially in recent relationships between species. This dendrogram is another representation of the tree of life, based on metabolism that may complement the trees constructed by other methods. For example, the metabolic dissimilarity we demonstrate between Symbiobacterium thermophilum (previously defined as Actinobacteria) and the other Actinobacteria species, and the metabolic similarity between S. thermophilum and Clostridia, combined with other evidence, suggest that S. thermophilum may be re-classified as Firmicutes, Clostridia.


Background:
For many years phylogenetic trees have been used to study the evolution of organisms.Since Charles Darwin first described the evolution of species as a tree, scientist have attempted to create a tree that could represent a hierarchical classification of all known species based on their evolution and at the same time provide information about extinct species and the common ancestry shared by known species.When sequencing technologies were developed, the use of taxonomic marker molecules such as the small subunit ribosomal RNA seemed sufficient to draw consistent phylogenetic trees.Studies using genes or protein sequences led to a classification of microorganisms and recognised the Archaea as the third domain of life.[1] When whole genome sequences of prokaryote organisms became available, everyone hoped that this extended information would help them to build more accurate phylogenies but it was then discovered that different genes produced different trees.It was at this point that doubts were raised as to whether a tree structure was the best representation of evolution.
[2] Simultaneously, the discovery that horizontal gene transfer events (HGT) between species was more common than previously suspected [3, 4] put a strain on the search for the "true tree".[5] After all, the gene used in a phylogenetic study may very well have been acquired from an organism that was in no way a direct ancestor.[6] In view of the above, some scientists have started to consider that evolution is perhaps better represented by a network than by a tree.[7] Studies have also begun into new ways of creating a universal tree of life.If taking a single gene had become insufficient for consistent tree representation, now that hundreds of whole genomic sequences are available, new phylogenomic methods are being developed.
[8] As it is difficult to align the sequences of two genomes, several methods that use traditional sequence alignment tools have been developed to construct genome trees.[8, 9, 10] These methods involve concatenating the homologous sequences from different gene families to construct a single tree [9, 10,11] or comparing different trees to create a supertree.[12] Another way to describe the relationships between genomes is to use their gene repertoire.[13] New methods based on gene order or gene content have therefore been developed.[10] The main problem with these methods is the imbalance in the number of genes between small and large genomes.Two large genomes that are not phylogenetically closely related can have more common genes than a large and a small genome that are closely related.Measures to prevent this must be taken so that the phylogenetic tree does not become biased.[10] Genome trees seem to reveal a phylogenetic signal that supports the three-domain evolutionary scenario and the relationships between some clades of Bacteria.However, deep-level prokaryotic relationships are difficult to infer.[12] We have developed a new method for constructing a genome tree based on the metabolic pathways present in each species.The main structure of the metabolic pathways seems to be largely unaffected by HGT. [14] This enables us to use them as templates for comparing genomes.Using the orthologous groupings of enzymes found in the KEGG database, we have related genomes and metabolic pathways and created a tree-like representation of a fairly large group of organisms based on their metabolism.

Methodology:
Our aim was to create a dendrogram of different eukaryotic and prokaryotic species based on metabolic data.Here we detail the characteristics of the process used:

Database creation
Starting from the metabolic maps available in the KEGG: Kyoto Encyclopedia of Genes and Genomes [15] (http://www.genome.ad.jp/kegg/) and the MetaCyc [16] (http://www.metacyc.org)databases, we defined a representative group of pathways and introduced into our database the enzymes that catalyse each of the reactions that form every pathway by their KO number as defined in KEGG.Since a same pathway can follow slightly different routes in different organisms, we

Percentage matrix
The next step was to relate the data found in our database to a group of organisms.We used the complete genomes found in the KEGG database.For each organism, we created a list of enzymes codified in the genome, listed by their K number.Since the KEGG database is still growing and new genomes are being introduced, some of them still did not have all their KEGG numbers assigned.So, we compared the number of proteins with an assigned KEGG number to the total number of proteins coded in each genome.Those organisms in which the assigned number of proteins in the KEGG database was less than 20 percent were excluded from the list of organisms used to build the dendrogram.Finally we took 282 organisms which are listed in Table 1 (supplementary material) with their abbreviation.Using information from the metabolic database we had previously created, we searched in each genome for the enzymes that completed each pathway.To do so, we made a PERL script that calculated the percentages of enzymes that appeared in a pathway for each organism.The results were presented in a matrix whose rows were the pathways, whose columns were the organisms analysed and in which each element represented the percentage of enzymes of a pathway that one organism contains.

Dendrogram construction
By calculating the Pearson Correlation with the enzyme percentages of all pathways for each pair of organisms, we transformed the percentage matrix into a distance matrix containing the metabolic distance between each pair of organisms.From this distance matrix, and using the PAUP* program version 4.0, we built a dendrogram using the neighbour-joining (NJ) algorithm.This dendrogram graphically represents the relationships between organisms based on their metabolism.We also built the dendrogram with the UPGMA algorithm, but this dendrogram was fairly similar to the one obtained by NJ.

Bootstrap calculation
To verify the dendrogram obtained, we developed a new method based on bootstrap calculations to check how robust each cluster was.From the primary percentage matrix, this method creates a certain number of distance matrices (a thousand in our case) by randomly selecting the metabolic pathways and allowing repetition.Using this group of matrices, we followed the same process as before and obtained a thousand trees.Using the consense program of the Phylip package, we calculated a consensus tree using the majority rule extended option with default parameters.The number of times each node is repeated indicates how reliable that cluster is.

Discussion: Dendrogram based on metabolism
To ensure that the method developed was suitable for creating a dendrogram that would take into account at least the most basic taxonomic classification, we used it on 282 organisms (9 Eukaryota, 23 Archaea and 250 Bacteria) from the KEGG database.The evolution based on metabolic pathways is represented in the dendrogram in Figure 1.To make comparison easier, we have coloured the branches according to the taxonomic classification of their organism and classified the organisms into fourteen groups.These groups, which differ in size, were defined by taking into account the clusters observed in Figure 1 and their bootstrap values.The result of the groupings and the taxonomic group to which each organism belongs are shown in Table 1 (supplementary material).In general, although this dendrogram does not follow the taxonomic classification perfectly, some large clusters encompass taxonomically related organisms while others appear as mixed clusters.Here we comment two causes that may lead to the grouping of mixed taxonomic clusters.

Reduced genomes
All Archaea are clustered together separately from the bacterial cluster, the only exception is Nanoarchaeum equitans Kin4-M (neq).Unlike the other Archaea we used to construct the dendrogram, this organism is an obligate symbiont.[17] It appears clustered with most of the intracellular or obligate parasites with a small genome found in our dendrogram (groups 4, 5 and 6).Parasitic organisms have reduced genomes, which means that their metabolic capacity has been lowered to a certain degree.This could explain the clustering of several parasite species even though they are phylogenetically distant.In a tree based on metabolic information, therefore, it should not be surprising to find that the only symbiont Archaea clusters with other parasites due to their particular metabolic characteristics.

Metabolic similarity
The firmicutes are grouped in two main groups, Lactobacillales (Group 9) and Bacillales (Group 10).Between these two groups there are smaller groups of other Firmicutes, one of which contains the Clostridia Thermoanaerobacter tengcongensis (tte) and Clostridium tetani (ctc) with two other organisms that do not belong to the Firmicutes phylum: Symbiobacterium thermophilum (sth) and Fusobacterium nucleatum (fnu).The location of F. nucleatum among Firmicutes can be explained by their shared metabolic pathways.[18] Despite being gram negative, F. nucleatum has been found to be more similar to gram positive bacteria than to gram negative ones.This is also true of S. thermophilum.The 16S ribosomal DNA-based phylogeny suggested that this bacterium belongs to an unknown taxon in the gram-positive Actinobacteria [19], even though the traditional Gram-stain result indicates that it is gram negative.
[20] Also, the proteins of S. thermophilum show a greater similarity to the proteins found in Firmicutes organisms, in particular to T. tengcongensis, than to those found in Actinobacteria.[20] The metabolic similarity between S. thermophilum and T. tengcongensis shown in figure 1 and the metabolic dissimilarity between S. thermophilum and the other Actinobacteria, combined with previous evidences [20, 21], suggest that S. thermophilum may be re-classified as Firmicutes, Clostridia.[21] Metabolic influence Not all kinds of metabolism influence our dendrogram in the same way.In Table 2 (supplementary material) we can see a distribution of the enzymes found in the defined groups in the different metabolic groups.For example, Carbohydrate Metabolism has much more influence on the dendrogram than Energy Metabolism, simply because it has many more enzymes and pathways.Also, some of these enzymes are not very useful for classifying organisms into the different clusters.A clear example is the enzymes that catalyse the reactions that produce the different Aminoacyl-tRNAs as they are present in nearly Table 2 (under supplementary material) also shows that for several groups some kinds of metabolisms stand out because of the high number of enzymes they possess compared to the main number of enzymes that the metabolic group has in all organisms.For example, Lipid metabolism in Metazoa (Group 13).This is explained by the presence of pathways such as the synthesis of Lecitin or Cholesterol.The contrary is also true.Some groups have fewer enzymes than most.Examples of this are the three parasitic groups (Group 4, 5 and 6).In their low enzyme values, we can clearly see the effects of genome evolutive reduction due to their parasitic nature.

Limitations of metabolic-based methods
By their nature, metabolic pathways databases are humandefined and may be quite inexact, especially when a metabolic pathway found in one species is generalized to another.Several alternative pathways that have not yet been discovered surely exist in different organisms.Therefore, when only one or a few enzymes from a metabolic pathway are missing in one species, an orthologous gene displacement needs to be considered before we can conclude that the pathway is incomplete.Moreover, when a new sequenced genome is annotated, a high percentage of its proteins are not mapped to any pathway.It may therefore be argued that metabolic databases, while extremely useful for reconstructing metabolic properties of organisms, cannot be used to reconstruct the tree of life.However, we have shown that, assuming that any metabolic prediction of a large group of organisms is still incomplete, the phylogenetic signal that it contains partially agrees with the taxonomic information of the species.A metabolic dendrogram of different species can therefore be used as an additional criterion that may help to correctly re-classify some species, as in the case of the Symbiobacterium thermophilum we described earlier.

Conclusion:
We have developed a new method for constructing a dendrogram based on metabolic comparisons between species whose genome has been fully sequenced.Although the evolutionary signal that can be derived from metabolic data is not very strong, it is enough to obtain a rough sketch of the known taxonomic classification.We expect that the reconstruction of metabolic dendrograms may improve as more pathways are discovered and their enzymes are properly situated within those pathways.Until such a time metabolic-based dendrograms may be a useful addition when they are combined with other phylogenetic methods, allowing us to fine-tune dubious classifications that can not be accurately described by other methods.

Figure 1 :
Figure 1: Dendrogram created from metabolic pathways by neighbour joining.The small squares represent nodes with more than 750 repetitions in the bootstrap analysis.The triangles are nodes with more than 900 repetitions.Taxonomic groups are marked by the same colouring: Actinobacteria in purple, Archaea in red, Bacteroidetes in green, Chlamydiae in pink, Cyanobacteria in pale blue, Deinococcus-Thermus in cyan, Eukaryota in dark blue, Firmicutes in yellow, Proteobacteria in orange, Spirochaeta in grey, and others in black.

Table 2 :
Metabolic influence over the different clusters.For each group of organisms the mean number of enzymes involved in each kind of metabolism and the percentage of enzymes that belong to a determined metabolism in comparison to the total number of enzymes used to create the dendrogram are shown.Green and red numbers or percentages indicate a group of organisms that has more or lower enzymes of a kind of metabolism than most of the other groups.* Sus scrofa (ssc) was excluded from these data because of the lack of KEGG numbers on important metabolic protein