GT-Miner: a graph-theoretic data miner, viewer, and model processor

Inexpensive computational power combined with high-throughput experimental platforms has created a wealth of biological information requiring analytical tools and techniques for interpretation. Graph-theoretic concepts and tools have provided an important foundation for information visualization, integration, and analysis of datasets, but they have often been relegated to background analysis tasks. GT-Miner is designed for visual data analysis and mining operations, interacts with other software, including databases, and works with diverse data types. It facilitates a discovery-oriented approach to data mining wherein exploration of alterations of the data and variations of the visualization is encouraged. The user is presented with a basic iterative process, consisting of loading, visualizing, transforming, and then storing the resultant information. Complex analyses are built-up through repeated iterations and user interactions. The iterative process is optimized by automatic layout following transformations and by maintaining a current selection set of interest for elements modified by the transformations. Multiple visualizations are supported including hierarchical, spring, and force-directed self-organizing layouts. Graphs can be transformed with an extensible set of algorithms or manually with an integral visual editor. GT-Miner is intended to allow easier access to visual data mining for the non-expert. Availability The GT-Miner program and supplemental materials, including example uses and a user guide, are freely available from http://www.cifr.ncsu.edu/bioinformatics/downloads/


Background:
Contemporary biology faces challenges of analyzing and integrating the ever-accumulating high-throughput datasets to derive a coherent systems-based view of organisms [1]. Important challenges include relating genomic, transcription, proteomic, and other data for inference of metabolic and regulatory networks embodying complex processes such as disease phenotypes. Graphs, structures containing nodes and edges linking the nodes, can be used to model biological systems [2] wherein entities such as genes, proteins, RNA elements, and metabolites can serve as the nodes and experiment-specific relationships serve as the edges. Attributes, defined properties or additional information associated with the nodes and edges, of the graph form additional dimensions of information. Exploration of a graph's properties and network topology can provide insight into a biological system's architecture and or functioning.
From a systems biology perspective, software applications supporting visualization, exploration, integration, and analysis of disparate datasets are available, such as cytoscape, VisANT, Osprey, PathwayStudio [3, 4, 5, 6]; however, they can be economically and computationally expensive, restricted to specific computing platforms, require significant specialist knowledge or have narrow utility, and may be constrained to handle information in specific forms. In the context of visual data mining for bioinformatics, frameworks for discovering and interpreting relationships, characterization of graphs, and graph based visualizations have been developed [7].

Implementation:
GT-Miner [8] integrates a graphical user interface (GUI), transformational analyses for modifying the graph structure and information content, visualization layout of the graph, direct editing of the graph, and storage access for graph representations of data sets. The program accepts data from text files, applications like Microsoft Excel, or from databases like MySQL and Postgresql. The GUI supports user interactivity and graph visualization through multiple visual layouts, as well as multiple transformations for element filtering, merging of labeled graphs, and cluster analysis.
GT-Miner forms a lightweight, parsimonious framework wherein the graph and its associated attributes is the primary means for coupling information flow between software components. Much of the functionality is implemented as modules focusing on one part of an overall iterative analytical process. Extension with new transformations and layouts is through a simple programming interface, giving direct access to the graph structure and to the Java Swing graphic display, and the extensions incorporate into the framework through runtime configuration files. The base program and most of the plug-ins are written in the Java language. Visual layouts in the distributed software are based on GraphViz and in-memory modeling of the graph is based on a modified version of Grappa. Database queries are performed using JDBC, thus enabling access to an unbounded suite of database technologies, and result-table columns are mapped to graph elements by interpretation of the table's meta data. Since data base access is critical for handling large volumes of information, a copy of Apache Derby, a SQL-92 compliant database, is included with the software distribution.

Utility and caveat:
Flexibility arises from maintaining a distinction between the visualization and analytical processes. The user can utilize a given visualization and apply multiple transformations or, conversely, utilize multiple visualizations for a given transformation. An unbounded set of attributes can be associated with the graph elements and used with the transforms to modify the graph's structure or visual presentation. Modifications can be saved for incorporation into additional analytical processes. Combining attribute based transformations with the programs' built-in support for visual editing of the graph through simple mouse gestures can greatly facilitate the discovery process (see Figure 1).
Bioinformatics source data is often represented in a variety of potentially incompatible formats requiring a burdensome reformatting of the information into an acceptable form. Our solution partially addresses the problem by decoupling the acquisition and preparation of data from the analytical data mining and visualization processes through two approaches, both external to the application, for loading information: 1) support for common graph file formats like DOT, PHYLIP NEWICK, or GXL; and 2) acquiring the information in tabular formatted adjacency-lists describing the nodes, edge relationship, and attributes. This allows the user to convert raw data, typically via a SQL selection expression, into a graph format without the need for extending the application program. Consequently, the information can originate from specialized applications such as phylogenetic analysis programs, or more general sources like databases and spreadsheets. The final result is saved in the above file formats or in a database. The distribution includes a user guide and three complete tutorials covering phylogenetic ancestral recombination graphs [9], networks of gene duplications [10], and visualization of Gene Ontology annotations.