Computational Epigenetics: the new scientific paradigm

Epigenetics has recently emerged as a critical field for studying how non-gene factors can influence the traits and functions of an organism. At the core of this new wave of research is the use of computational tools that play critical roles not only in directing the selection of key experiments, but also in formulating new testable hypotheses through detailed analysis of complex genomic information that is not achievable using traditional approaches alone. Epigenomics, which combines traditional genomics with computer science, mathematics, chemistry, biochemistry and proteomics for the large-scale analysis of heritable changes in phenotype, gene function or gene expression that are not dependent on gene sequence, offers new opportunities to further our understanding of transcriptional regulation, nuclear organization, development and disease. This article examines existing computational strategies for the study of epigenetic factors. The most important databases and bioinformatic tools in this rapidly growing field have been reviewed.

Epigenetics is highly combinatorial in nature due to the array of diverse control elements.The human genome contains ~23,000 genes that are active in specific cells at precise moments.Cells control gene expression by wrapping DNA around clusters of core histone proteins to form nucleosomes [21], which are then organized into chromatin.
Changes to the structure of chromatin affect gene expression patterns: genes are inactivated when the chromatin is condensed, and they are expressed when chromatin is relaxed [9].These dynamic chromatin states are controlled by DNA methylation [22], histone modifications (e.g., methylation, acetylation, phosphorylation, sumoylation and ubiquitylation) [23], [24] and DNA-binding proteins (e.g., polycomb and trithorax group proteins) [25].Most of these epigenetic modification mechanisms have been shown to be regulated by noncoding RNAs (ncRNAs), such as microRNAs (miRNAs), small RNAs (guide RNAs, piRANS) and large RNAs, which play important roles in events including transposon activity and silencing, position effect variegation, X-chromosome inactivation and paramutation [26].
With the rapid increase in the number of new modification sites being discovered each year, it has been suggested that post-translational modification may affect almost every solvent-accessible histone residue, allowing a high level of variability for signal transduction events [21], [27].This enormous combinatorial complexity [28] requires an extraordinarily large number of experiments, such as DNA methylation profiling, for systematic studies.Already, a number of large-scale initiatives have been established for the systematic mapping of epigenomic and related data.These include projects by the Alliance for the Human Epigenome and Disease (AHEAD) Task Force [29], the ENCyclopedia Of DNA Elements (ENCODE) Project Consortium [30], the Human Epigenome Project (HEP) Consortium [31] and the Highthroughput Epigenetic Regulatory Organisation In Chromatin (HEROIC) Project Consortium [32].The huge quantity of experimental data generated by these and other projects requires appropriate bioinformatics infrastructure spanning general and specialist databases, basic bioinformatics tools and sophisticated algorithms for meaningful analysis, modeling and prediction of DNAprotein interactions.Pioneering efforts in the field of computational epigenetics have been reviewed by Bock and Lengauer [33].Here, we review major tools and resources that have been developed in this rapidly growing field, with special analysis on the latest trends and future directions.

Data sources for Epigenetic research:
Large amount of data relevant for epigenetic research are available in scientific literature, molecular databases and case reports.Scientific literature serves as the primary source of data, providing high-level descriptions of biological entities and processes.As of January 2009, PubMed contained over 288,000 records related to epigenetic research.This information exists in the form of unstructured free text that makes the extraction of biologically meaningful information difficult.As the amount of electronically accessible textual material accumulates, the quality of epigenetic research will be increasingly dependent on the ability to retrieve quality data to facilitate the discovery of new facts, interpretation of results, and design of experiments [34].
The number and size of molecular databases have been increasing steadily.A total of 1,078 molecular biology databases are currently (March 2009)  provides information about the precise distributions of specific genes and proteins throughout brain development.StemBase [47] details gene expression data of stem cells and derivatives from rat, mouse and human.The Gene Normal Tissue Expression (GeneNote) database [48] contains complete gene expression profiles in healthy human tissues (bone marrow, brain, heart, kidney, liver, lung, pancreas, prostate, skeletal muscle, spinal cord, spleen and thymus) using the Affymetrix GeneChip HG-U95 set.The BloodExpress database [49] details information about mouse blood cell expression profiles, including both progenitors and terminally differentiated cells, derived from array experiments and independent studies.Such information allows for the identification of dynamic changes in gene expression during cell differentiation down the hematopoietic hierarchy.Other data sources exist and have been reviewed elsewhere [50].

Computational tools for Epigenetic research:
Numerous computational, mathematical and statistical methods, ranging from data mining, sequence analysis, molecular interactions, to complex system-level simulations, have been reported in the literature.Efforts have been channeled into the text mining of epigenetic information, though development in this field is still at an early stage.Current efforts are primarily focused on the extraction and analysis of DNA methylation patterns in various cancer types A variety of methods for the modeling and prediction of DNA methylation patterns have been reported.An example is the use of linear discriminant analysis and artificial neural networks (ANN) for the classification of individual lung cancer cell lines [81].The use of support vector machines (SVM) for genomic mapping of methylation patterns for all 22 human autosomes has also been described [82].In recent years, there has been increasing focus on the development of computational technologies that facilitates the prediction of protein methylation sites.These include procedures based on support vector machines (SVM), hidden Markov models (HMM), ANNs, naïve bayes, logistic regression, K-nearest neighbors and decision trees [83], [84], [85].However, the implementation of such systems is difficult due to the lack of publicly available experimental data for model construction.Many such systems are currently focused on arginine and lysine methylations as their mechanisms are currently the best understood and experimental data is most readily available [84], [85].Epigenome prediction pipeline that integrates DNA methylation, polymerase II preinitiation complex binding, histone H3K4 di-and trimethylation, histone H3K9/14 acetylation, DNase I hypersensitivity and SP1 binding has also been reported [86].Currently, the relative value of such computational tools remains unknown.As experimental data becomes increasingly available, the usefulness of these technologies will become clearer and it should be expected that more integrative models will be made available and that current models will also be refined.

Computational analysis of histone modifications:
Histones are the main protein components of chromatin.They play a key role in the compaction and accessibility of eukaryotic and probably archaeal genomic DNA [41], and are subject to a wide variety of post-translational modifications including methylation, acetylation, phosphorylation, sumoylation and ubiquitylation [23], [24].Covalent modifications of the histone proteins take place primarily within the histone amino-terminal regions that protrude from the surface of the nucleosome as well as the globular core region [87], [88].Histone modifications may affect chromosome function via two distinct mechanisms [89].First, they may alter its electrostatic properties, resulting in a change in the histone structure or its DNAbinding activity.Second, they may generate binding surfaces for protein recognition modules, and help engage specific functional complexes to their relevant sites of action.
There is intense interest in the use of informatics for the analysis, modeling and prediction of histone modifications in DNA sequences.Cellular automata have been described for examining a variety of epigenetic modifications.For example, Sneppen and coworkers [90] reported the use of a simplified stochastic model to examine the conditions for bistability and heredity of nucleosome modificationbased epigenetic memory.The team demonstrated that robust bistability required cooperativity of two or more modified nucleosomes in the modification reactions, and that nucleosomes occasionally stimulate modifications beyond their neighbor nucleosomes.Comparative genomics has taken advantage of new technologies to help identify histone marks and regulatory elements in higher eukaryotic genomes.Schübeler and colleagues [91] have performed a genome-wide comparison of chromatin structures in higher eukaryotes.Their work revealed the existence of a binary pattern of histone modifications among euchromatic genes, with active genes being hyperacetylated at H3/4 and hypermethylated at H3, and inactive genes being hypomethylated and deacetylated at the same locations.Roh et al. [92] reported a genome-wide mapping technique to determine the distribution of lysine-9/14-diacetylated histone H3 in human peripheral T cells.The team showed that this form of chromatin modification is correlated with active gene promoters and with regulatory elements associated with gene expression.In a followup study, the team extended their work to the genome-wide screening of conserved and non-conserved enhancers by histone acetylation patterns [93].
Figure 1: The computational epigenetics paradigm.Existing data sources from epigenetic-related experiments are analyzed with computational strategies and methods for the simulation and prediction of epigenetic patterns.Computational data analysis generates new hypothesis and knowledge for the development of therapeutic, diagnostic and prophylactic clinical tools.
The design of machine-learning algorithms for locating histoneoccupied as well as acetylation, methylation and phosphorylation positions in DNA sequences has also been well reported [71], [94], [95], [96], [97].Some of these tools have also been applied to ChIPon-chip and ChIP-seq datasets.An example is the use of HMMs to infer the states of histone modification changes at each genomic position based on ChIP fragment counts [71].The use of wavelet analysis combined with HMMs for the discovery of activating and repressive histone modifications using selected ChIP-on-chip datasets from the ENCODE project was also reported [98].These algorithms allow the screening of histone marks in large sets of protein sequences, such as those encoded by the complete genomes of higher complexity organisms.In recent years, a number of structure-based techniques, including quantitative structure-activity relationship (QSAR) analysis [99], [100], homology modeling [101] and molecular docking techniques [102], for the design of epigenetic inhibitors were also described.Dynamic activities over the past two years have seen the development of at least five computational methods for the functional annotation of epigenetic factors [96], [103].These tools are particularly useful for the understanding of epigenetic events, both within specific cell types and in an evolutionary context.

Cancer informatics:
Cancer progression is a form of somatic evolution in which certain mutations provide cancer cells with a selective growth advantage.It is now known that DNA methylation patterns in cancers generally display more variation compared with that of normal tissues Genome-wide analysis of MeDIP data from colon (Caco-2) and prostate (PC3) cancer as well as several tumor cell lines have shown that tumor-specific methylated genes belong to distinct functional categories, possess common sequence motifs in their promoters and occur in clusters on chromosomes [107].Abnormal DNA methylation within CpG islands is among the most frequent form of alterations in cancers.Experiments have now shown that entire CpG islands may become aberrantly methylated in cancer [108], and is mechanistically linked to histone methylation [109].Bock and coworkers [110] have performed a detailed analysis of inter-individual stability and variations of DNA methylation profiles among healthy individuals using linear regression models and the EpiGRAPH web service (http://epigraph.mpi-inf.mpg.de/WebGRAPH/).This work showed that CpG islands may act collectively as emergent and bistable epigenetic switches for maintaining a CpG-island-wide 'on' or 'off' state.An example for tumor class prediction in human cancers was reported by Olek and colleagues [111], in which SVMs were trained to recognize the difference between T and B cell leukaemias and CD19+ B cells and CD4+ T cells obtained from healthy donors using a set of selected CpG sites.Other computational models for classifying cancer subtypes based on epigenetic marks were also reported.Specific examples include the use of SVMs for discriminating between acute lymphoblastic leukemia and acute myeloid leukemia [112], as well as the use of Manhattan distance and average linkage algorithms for hierarchical cluster analysis of human colorectal tumors [113].

Stem cell informatics:
Stem cells are unspecialized cells that can either renew themselves through mitotic cell division or undergo differentiation into more specialized cells [114].Two classes of mammalian stem cells are available: 1) embryonic stem (ES) cells, which are blastocyst-derived, pluripotent cells that can differentiate into all cell types except the extra-embryonic tissue [114], [115], and 2) adult stem cells, which act as a repair system for the body, replenishing specialized cells and regenerating damaged tissues [116].Recent studies have shown that DNA methyltransferases [117] and Polycomb/Trithorax group response elements (PRE/TRE) [118], [119] possess epigenetic signatures that are important for the differentiation of both human ES cells and germ line stem cells.Of particular interest is the revelation that stem cells are the target cells for cancer, and epigenetic changes may occur long before they are distinguishable as tumor cells [120].By unraveling the nature of epigenetic modifications, it is expected that this will lead to improved culture and differentiation technologies, as well as next-generation drugs that can directly manipulate stem cells in patients.
Bioinformatic analysis of epigenetic marks in stem cells is at its formative stages.Analyses of up-and down-regulated gene clusters provide valuable information on the effect of exogenous control on ES cell state in human.Stanford and colleagues [121] have recently performed temporal expression microarray analyses of ES cells after the initiation of commitment and integrated these data with known genome-wide transcription factor binding.This work revealed a repressive model of ES cell maintenance, and helped define the regulatory balance that is needed for maintaining ES cell state.Ringrose and coworkers [122] performed an analysis of PRE/TREs in the Drosophila melanogaster genome and defined the sequence criteria that distinguish PRE/TREs from non-PRE/TREs.Using a series of weighted motifs, the team identified 167 candidate PRE/TRE sequences, which map to genes involved in development and cell proliferation.Position-specific matrices for predicting cis-regulatory elements were also reported, and used for studying PRE/TREs in Drosophila melanogaster [123].

Conclusion:
Realizing the full benefits of the informatics revolution will require significant advances in the efficiency of which new data is discovered, processed, interpreted and made accessible to researchers.With the huge amount of epigenetic-related experimental data generated by high throughput methodologies, the future will witness a shift towards the computational epigenetics paradigm (Figure 1).With the paradigm shift, one crucial issue lies in effective data annotation and management.Currently, a centralized repository for epigenetic-related data is still lacking.Resources like such will greatly facilitate computational studies on epigenetics.Another challenge in the field of computational epigenetics lies in the efficient processing of experimental data, which includes normalization and interpretation of data across various experiments from different research groups.
To date, computational algorithms that model different aspects of epigenetic modifications [81]- [86] and disease [107]- [110] have been described.On the other hand, cellular automata have also been proposed for exploring a variety of epigenetic modifications [90].With the explosion in the number, variety and sophistication of resources and analysis tools, the challenge lies in integrating the strengths and not the weaknesses of each approach.The next few years will see increased interest in the use of cluster computing, central (cloud) computing and distributed systems for large-scale epigenetic data analysis and screening.Computing grid technologies harnessing the resources of multiple computers in a network have been developed rapidly to solve high-throughput scientific research problem [124].On the other hand, cloud computing technologies, which offers scalable resources on demand, have emerged in recent years to complement the rate of data output and drive the rate of data analysis and knowledge discovery [125].The different bioinformatic and mathematical modeling approaches, in combination with advances in computational infrastructures, clearly could lead to improved understanding of epigenetic modifications at multiple levels of complexity, from the sub-cellular molecular level, to the cellular and systems level, and beyond.More importantly, research efforts in computational analysis, identification and classification of variations in epigenetic modifications contribute to further understanding of epigeneticassociated diseases and consequently, the design of relevant diagnostic, therapeutic and prophylactic tools.One exciting possibility, based on the highly combinatorial nature of epigenetics, is the cataloguing of individuals' genome and epigenome and the development of personalized or more specific drugs with lower toxicity and less side effects, paving the way for personalized medicine and a new era of "personal"-omics [126].
[43],[44].Traditional sequence analysis tools, such as ClustalW [51], BLAST software suite [52], BLAT (BLAST-Like Alignment Tool) [53] and MEGA (Molecular Evolutionary Genetics Analysis) [54], allow for the inference of functional, structural, or evolutionary relationships between DNA or protein sequences.Such methods are employed in diverse applications, and have been applied to homology searches of ortholog candidates for the KEGG/GENES database [55], predicting the secondary structures of histone deacetylases [56], homology modeling of DNA methyltransferases [57], and optimizing the activities of histone deacetylase inhibitors [58].Computational models have been used extensively to support various epigenome mapping initiatives such as chromatin immunoprecipitation (ChIP)-on-chip [59], ChIP-seq [60] and bisulfite sequencing [61].ChIP-on-chip is a microarray-based platform that allows the identification of DNA-protein binding sites on a genome-wide level [59].The main computational tools that have been developed for ChIP-on-chip analysis are focused on the identification of ChIP enrichment sites.Examples include Chromatin ImmunoPrecipitation On Tiled arrays (ChIPOTle) [62], TileMap [63] and Ringo [64].ChIPseq is a variant of ChIP-on-chip that uses high-throughput DNA sequencing for detecting differences between sample and control DNA [60].Although such an approach requires minimal data processing and allows analysis to be made directly from sequence read counts [65], a critical issue that needs to be resolved is the accurate mapping of short sequence reads to the reference genome.Algorithms that can identify regions of similarity between sequences such as BLASTN [52] and BLAT [53] are valuable for speeding up this process.Efforts are also channeled into the development of specialized algorithms for shortread assembly.Examples include QPalma [66] and AMOScmp [67].Bisulfite sequencing [61] employs the use of bisulfite treatment of DNA to detect cytosine methylation patterns.Computational tools that focus on bisulfite sequencing include methods for data processing and quality assessment.The basic methods for bisulfite sequence analysis allow the quantitative measurement of cytosine methlylation levels [68], estimating the effectiveness of bisulfite treatment [68], and visualization of results [69].Collectively, the developed algorithms enable the analysis of DNA methylation patterns of different tissue types [70] and the genome-wide comparison of histone modification sites identified by various epigenome mapping initiatives [71].Computational analysis of DNA methylation: DNA methylation plays an important role in the regulation of genomic stability and cellular plasticity [72].It is essential for normal cell development and is associated with numerous fundamental processes including genomic imprinting [73], X-chromosome inactivation [74], maintenance of repetitive elements [75] and carcinogenesis [76].DNA methylation is mainly accomplished by the transfer of methyl groups from S-adenosyl-methionine to the 5 position of the cytosine pyrimidine ring in a reaction catalyzed by a DNA methyltransferase or methylase [77].In mammals, four active DNA methyltransferases (DNMT) have been identified, namely DNMT1, 2, 3A and 3B [78], [79].DNMT1 is the most commonly found DNA methyltransferases in mammals, and predominantly methylates hemimethylated CpG dinucleotides.DNMT2 has been identified as a DNA methyltransferases homolog that methylates cytosine-38 in the anticodon loop of aspartic acid transfer RNA instead of DNA [80], while, DNMT3A and DNMT3B are de novo methyltransferases that act on both hemimethylated and unmethylated CpG sites [78], [79].
[104].Several studies have shown that aberrant methylation occurs in a tumor type-specific manner [105].A number of cancer epigenetic projects are currently underway to identify novel methylation patterns that correlate with progression to malignancy.An example is the CancerDip Consortium, a research initiative funded by the 7th Framework Programme for Research and Technological Development of the European Commission (FP7), which employs the use of Methyl-DNA immunoprecipitation (MeDIP) assays for identifying methylation patterns in different tumor types and the epigenetic machinery involved in establishing these abnormal patterns [106].

Table 1 in supplementary material). DNA
described in the Nucleic Acids Research online Molecular Biology Database Collection [35].These include 3 nucleotide sequence databases, 60 databases on transcriptional regulatory sites and transcription factors, 65 databases on microarray data and gene expressions, and 114 databases on human genes and diseases.The international collaborative GenBank [36], DNA Data Bank of Japan (DDBJ) [37] and EMBL [38] serve as worldwide repositories for nucleotide sequences of different origins.A number of epigenetic databases have been reported.We have reviewed some of these databases (

[41], [42] serves as
a central data source for histones and histone fold-containing proteins.Cancer methylation databases are valuable for analyzing irregular methylation patterns that are correlated with various cancers.Major data sources include PubMeth [