Data mining analysis of human gut microbiota links Fusobacterium spp. with colorectal cancer onset

Gut microbiota and their metabolites play a vital role in colon health and disease. Accumulating evidence suggests that the gut microbiota contributes to the risk of colorectal cancer (CRC). However, the role of a specific microbial community together with their metabolites contributing to the risk, initiation and progression of CRC is still unknown. Hence, we used a Bayesian Networks in combination with the IDA (Intervention calculus when the DAG is absent) to generate a graphical model that allows causal relationships to be inferred from observational data. Results from the analysis of publically available datasets showed that four species: Fusobacteium, Citrobacter, Microbacterium and Slaxkia have estimated non-null lower bounds of causal effects of CRC. These findings support the hypothesis that specific bacterial species (microbial markers) act in concert with locally modified microbiota to cause or influence CRC progression. Additional comprehensive studies are required to validate the potential use of F. nucleatum, Citrobacter as well as Slackia as microbial biomarkers in CRC for prevention, diagnosis, prognosis and/or therapeutics.


Background:
In an era marked by the emergence of complex diseases as cancer, diabetes, obesity and cardio-vascular diseases with heavy economic and social burden, better understanding of the diseases' complexity, causes and progression are essential to tailor more effective approaches for both prevention and therapy [1]. The innovation of high-throughput technologies has allowed better understanding of the interactions of each individual genomic makeup with both his/her environment and lifestyle to foster health or initiate diseases. These systems medicine approaches using mainly advanced metagenomics and bioinformatics tools have revealed the importance of microbiome -as a living component of the environment -in both health and disease [2]. Human microbiota is reported to be implicated in ~20% of human malignancies [3]. Accumulating evidences support the relationship between infective agents and human cancer suggesting the important role of bacterial species in the pathogenesis of cancer [4][5][6]. Studies have reported an 373 ©Biomedical Informatics (2019) association between certain bacterial species and different types of cancers [7,8].
The human gut is hosting a large, diverse and dynamic microbiota species shaped according to each individual genomic background, environment and lifestyle mainly diet and exercise. The imbalance of gut microbiota hemostasis (dysbiosis) is reported to induce several diseases [9]. Recent works have provided mechanistic evidence for the involvement of gut microbiota in the onset of CRC.
[ [10][11][12]. The mechanisms through which microbes initiate carcinogenesis can be assessed by different parameters including DNA damage, inflammation, cell proliferation and migration, activation of pro-carcinogenic pathways or production of genotoxins [13,14]. Many bacterial species have been shown to be involved in these mechanisms. Among these species is S. gallolyticus subspecies, which colonizes and invades colon tumors leading to enhanced tumor growth through inflammatory signalling [8]. Another species is Enterotoxigenic bacteroides fragilis which alters colonic epithelial cells structure and functions by enhancing Wnt/β-catenin signaling leading to increased colonic carcinoma cell proliferation and expression of the proto-oncogene MYC [15]. The bacterial strain Escherichia coli, which contains polyketide synthase (pks) island, induces double strand DNA breaks through the production of genotoxin colibactin, leading to promoting colon carcinogenesis [16]. Moreover, Fusobacterium spp. induces an expansion of myeloid-derived immune cells in the tumor microenvironment and upregulates inflammatory genes in colon tumors [17].
It is still not clear, whether a specific species, a microbial community, or both could initiate the above-mentioned biological events and promote CRC. Moreover, dietary metabolites derived from gut microbiota were reported to affect colon health and play a significant role in the aetiology of CRC. Some studies indicated that the microbial metabolites such as phenol, ammonia, primary and secondary bile acids could act as pro-carcinogenic in the colon, whereas other metabolites found to have a positive effect by suppressing the inflammation, inhibiting proliferation, as well as modulating the differentiation and gene expression in the colonic epithelium cell. These metabolites include the short chain fatty acids (SCFA) such as butyrate, acetate, and propionate and polyphenolic compounds. It is possible that hormonal secretion or the production of bacteriocins acts on the balance of this community, depending on the host physiology as a response to a host's diet or ingested pharmaceuticals. Despite this fact, so far no signature of distinct bacterial colonization patterns in CRC patients was established. The standard methods to discover these patterns are based on the statistical analysis of the diversity and community structure between individuals (healthy vs patients) and reflect enrichment in tumors biopsies versus healthy tissues. However, these results are merely based on classical statistical analyses such as pairwise relationship (t-test) or regression techniques and these are neither sufficient to model the complexity of these mechanisms nor enough to establish a causal relationship between species prevalence and cancer. Most importantly, these methods are unable to determine the real causal species implicated in CRC among those that show different abundances in tumors and healthy tissues.  The input dataset has p+1 variables: p which are the abundances of p species are quantitative and one binary response variable Y (tumour/non-tumour or patient/healthy). In the first phase, the causal structure in the form of a CPDAG (Completed Partially Direct Acyclic Graph) is learnt from data, by applying the PC algorithm. We used partial correlation as a conditional independence test for the PC algorithm. However, when the number of nodes in the CPDAG was large, we applied the theorem cited by Le et al. (2013) [20] to reduce the search space of possible DAGs. In the second phase, the do-calculus is used to estimate causal effects of a given species 'i' on the target variable 'Y'. In our case, the causal effect of a species 'i' is estimated as the binary logistic regression coefficient of 'Y' on the species and its parents in the DAG. This coefficient is the lower bound of the causal effect of species 'i' on'Y' [19]. If a species had a lower bound >0, we concluded that it was a causal agent otherwise (if it was 0) we cannot exclude a causal effect (Figure 1).

Algorithm:
Step1: First, we divided the data set as per categories (Normal vs Cancer) and identified the differential abundance of taxa across categories. We assumed that the taxa with little or zero change in abundance between categories paly a minimum role in the biological processes and thus omitted.
Let X1, . . . ,Xm represent taxa abundance and 'Y' the status category (binary variable : normal versus cancer) of taxa across categories. We have a dataset for the (m+1) variables.
Step 2: Use the PC algorithm to estimate the CPDAG G of the (m+1) variables and the conditional dependencies of the variables. We used partial correlation as a conditional independence test for the PC algorithm, as the partial correlations were easy to implement in a high-dimensional dataset.
Step 3: Estimated the causal effects of each taxa on each category. Naturally, we can identify all possible DAGs in the CPDAG, and estimated the causal effects with each DAG. However, when the number of nodes in the CPDAG was large, we reduced the search space of possible DAGs.
Step 4: Output the taxa causal effects. For each taxa, the outcome of Step 3 was considered as an array of multisets, and each multiset contains all causal effects of the taxa on the status. With each of the multisets, in this step, we selected the causal effect value with the smallest absolute value, and outputted it as the causal effect of taxa on the status.

Description of the datasets used:
Only two independent data, related to the colorectal cancer where the abundance specific taxa available, were used for applying the IDA method. In the first study by Marchesi et  The significance of the difference between abundance of specific taxa was performed by t-test within R language. To reduce the number of variables, a binary logistic regression was applied to the significant associated species. The latter were used as an input file for the IDA method.

Results: A network model representation of the relationship between CRC microbiota species:
First study: For the first study of Marchesi et al., the network structure obtained is illustrated in Figure 2. All estimated lower bounds of causal effects of genera on cancer status were 0 except for Fusobacterium (0.23); Citrobacter (0.21) and Slackia (0.16).

Second study:
In the second study of Zeller et al., over 1500 species were reported. We selected only those that have significant association with cancer using t-test. Selecting only species showing significant associations with cancer by multivariate binary logistic regression refined the list. The selected shortlist of species was then used to apply IDA. The graph structure generated also enhanced the importance of Fusobacterium spp. as a causal species involved in the colorectal cancer with the highest coefficient regression followed by Microbacterium testaceum and Slackia (with 0.19, 0.16 and 0.13, respectively, Figure 3). All the other species like (Parabacteroides distasonis, Dessulfovibrio vulgaris, Bafidobacterium adolescentis, Bacteoides fragilis, Cronobacter) seemed to play a role in the carcinogenesis of CRC but with moderate effect.  For the two-independent analysis, the common hubs of the two networks were the two species Fusobacterium spp. and Slackia. In fact, these two species seemed to be involved together in biomolecular perturbations/causing CRC.

Pathways enrichment:
Our results indicated that there was substantial evidence that the four species detected by the IDA method had causal effects in carcinogenesis pathways. Pathway information captures knowledge of biological processes at the molecular level and can be considered as an important tool for interpreting the growing amount of biological data with pathway enrichment analysis. Here, we used Biocyc software (https://biocyc.org/) to search pathways across the four species [23]. Since different strains were available for these species, only strains described in human colorectal cancer and available in biocyc.org were chosen: Fusobcterium nucleatum animalis 11_3_2; Microbacterium testaceum 87StLB037; Slackia exgua ATCC 700122 and Citrobacter Koseri ATCC BAA-895.
This refinement allowed dividing pathways into groups based on their biological functions, and based on the classes of metabolites that they produce and/or consume. Here, we reported the 7 classes of pathways generated by biocyc database for each taxa ( Table 1). We counted only the shared pathways between organism pairs. Interestingly, the Fusobacterium was shown to be the main causal bacteria in the two graphs generated per IDA method for the two studies. Furthermore, we found that the pathways shared between Fusobacterium and the other organisms were classified in accordance with the regression coefficient calculated via IDA method. In fact, F. nucleatum animalis 11_3_2shared 145, 144 and 108 pathways within C. koseri ATCC BAA-895, M. testaceum StLB037 and S. exigua ATCC 700122 respectively ( Table  2).
So far, not all molecular interactions in a pathway have been associated with a corresponding known gene(s) that was completely identified in the human genome. This explains why pathway holes may exist. They may represent true enzymatic or metabolic functions in the organism for which the corresponding human gene has not yet been identified, or they could represent false positive pathway predictions, or particular cases in which the pathway in this organism differs slightly from the reference pathway in (MetaCyc). Table 3 counts all the pathway holes in each organism database, and classifies pathways based on their number of pathway holes. Fusobacerium process the high percentage of holes pathways. We try to explain this specific pathway prediction via 377 ©Biomedical Informatics (2019) data driven of functional homology. We found only 342 orthologous proteins shared within the four species (Figure 4).

Figure 4:
Individual and shared distribution of orthologue proteins in the four microbial species.

Discussion:
Human microbiota is nowadays recognized as a key player in wellness and disease. Despite the huge progress in OMICs technologies and systems biology, the comprehensive study of microbial species types, abundance, diversity, interactions, and contribution in health and diseases is still challenging [24,25]. More powerful tools, integrative and multidisciplinary approaches are required to demystify such complexity in order to set up more predictive, preventive, participative, precise and cost-effective systems medicine that enhance the society wellness and prevent complex and chronic diseases [26]. Increasing evidence suggests a possible role for certain bacteria in colorectal carcinogenesis [27,28]. However, the association of the composition of the microbiome with the host factors such as gender, age, smoking, diet, exercise, and oncogenes polymorphisms is still poorly understood [27, 28, 17, 11]. In fact, the statistical relationships using most computational methods revealed associations or correlations but not causality, which is yet to be established. In this study, we used an alternative approach, namely IDA method [19,18], which can reveal causal relationships between bacteria and CRC status. Results obtained from the IDA analysis of the two previously described studies were in agreement with all previous association studies conducted in CRC. In fact, metagenomic analysis showed a significant enrichment of Fusobacterium spp and particularly F. nucleatum. This bacterium enriched in CRC was significantly more abundant in adenoma CRC patients compared to healthy individuals. This strain of Fusobacterium was associated within specific human expression profile, which was not shared by other bacterial species present in high number in the colon [29]. Furthermore, F. nucleatum appeared to be the dominant phylotype bacteria associated positively with lymph node metastases [30]. The abundance of Fusobacterium has been previously correlated to the expression of myeloid associated genes as well as to NF-kB driven inflammatory genes in human CRC [31]. It has been shown that Fusobacterium expanded myeloid-derived immune cells which inhibit-T-cell proliferation and induced T-cell apoptosis in murine models [31]. Our model allowed the discussion of disease causation scenarios and supported the hypothesis in which limited bacterial species act in concert with locally modified gut microbiota to cause CRC. The concept of keystone species or a microbial driver that recruits a consortium of disease facilitating a microbial community to initiate the biologic events causing CRC could be envisioned. However, our model could be improved by adding confounding factors (i.e. gender, BMI, age, smoking, diet, exercise…) to validate the potential use of F. nucleatum, Citrobacter and/or Slackia as early detection and/or diagnostic microbiota markers in CRC screening. Our results are in line with several recent evidence-based studies linking the causality, the onset and progression of CRC to severe imbalance to gut microbiota [14, [38][39][40].
Taken together, these findings lay foundation and provide insights for future studies to develop strategies for CRC prevention, screening (for diagnosis) and treatment through targeting the driver gut microbiota. A better understanding of gut microbial communities' diversity, interactions and roles is crucial to demystify the underlying causes of complex and chronic diseases and promote innovative proactive approaches to alleviate their burden on human health, society and economy. The interpretation of these studies also requires a better understanding of interindividual variations, heterogeneity of microbial communities along and across the GI tract, functional redundancy and the need to distinguish the cause from effect in states of dysbiosis [39].

Conclusion:
This pioneering study provided supportive evidence of crucial role of specific gut microbial species and their metabolites in CRC risk and onset. Further comprehensive studies using both high throughput metagenomic approaches and advanced computational tools are required to demystify the host-microbiota interactions and enhance our understanding of their impacts on gut integrity, microbiota hemostasis and human health.