Omics Metadata Management Software (OMMS)

Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov. Availability The OMMS can be obtained at http://omms.sandia.gov

We developed the Omics Metadata Management Software (OMMS), a flexible, extensible, open-source, web-based tool that provides semi-automated curation utilities, and integrated implementation with widely-used bioinformatics executables, such as BLAST [6] and Bowtie [7], for human-microbiomeoriented research in our laboratories [8]. Example use cases with publicly available human microbiome and chimpanzee RNASeq datasets [9, 10] are detailed to demonstrate OMMS function and versatility, and operation as a pipeline frontend.

Figure 1:
Omics Metadata Management Software. Core functionality resides in three tables, "Specimen Information," "Sample Processing," and "Sequence MetaInformation," which have fields with embedded automation supporting efficient data metadata entry, storage and intuitive entity relationships facilitating data sharing and analysis. These tables are accessed via the "MetaData" portal.

Omics Metadata Management Software
The OMMS was engineered to support a large, multidisciplinary, geographically-dispersed research team developing next-generation sequencing-based approaches for identifying potentially rare etiologic agents in human microbiomes across hundreds of distinct sample types [8]. The OMMS graphical user-interface (GUI) enables semi-automated project-specific metadata entries in each table (Figures 1 & 2;  Table 1, 3-5 (see supplementary material)), and associated input sequence data are referenced to archiving locations in directories generated on Linux-based file systems. Intuitive point-and-click selection of input sequence files, analysis configuration and execution are carried out in the "Analysis" portal ( Figure 3; Table 2 (see supplementary material)). In examples provided here, BLAST, Bowtie 2, TopHat and Cufflinks were integrated and implemented with the OMMS interface [6, 7, 10], and in principle, any open-source application can be integrated with the OMMS, including inhouse pipelines, with custom scripts developed for that purpose.

Methodology: Creating a record
The three main portals are displayed after login. Tables for detailed biological curation ("Specimen Info," "Sample Processing," "Sequence MetaInfo") reside in the "MetaData" portal. To create an entry, select "Create New," and then click on "New (Empty fields)" in the "Specimen Info" table, and provide required information (Figure 1; Table 1(see supplementary material)). The following (parentheses) were entered: Host Species (Homo sapiens); Tissue Sampled (Stool). Click the "Add Specimen" button to generate the "Specimen Unique ID" (HsStoo_01). For the second record, repeat the previous steps, but insert "Pan troglodyte" and "Brain" for the host and tissue, respectively, to generate "Specimen Unique ID" (PtBrai_02).

Figure 2:
Unified framework for metadata management and state-of-the-art analyses. Curation (highlighted in aqua) and analyses (indicated in yellow) tasks are intrinsically related (overlap region) in next-generation sequencing studies, because sample handling and sequence production are multistep processes, and careful metadata tracking and management are required for downstream analyses and publication preparation. The OMMS supports user input of project metadata, automated creation of consistently named and enumerated unique identifiers for specimens, samples and sequence production information, and straightforward integration with bioinformatics utilities. Spreadsheets can be generated for structured data extraction and local download. Standard input and output of executables used here are stored in automatically-generated files and directories.
Corresponding records are generated in the "Sample Processing" table, with dropdown menus provided to streamline curation by explicitly linking the "Specimen Unique ID" (e.g., HsStoo_01, PtBrai_02) and the user-defined "Sample Alias" with the new sample entries. "Sample Unique ID" entries are generated by clicking "Add Sample" (e.g., HsStoo_01_01, PtBrai_02_01). Four corresponding records were created in the "Sequence MetaInfo" table to complete the curation exercise, and to illustrate functional integration with executables Table 2 (see supplementary material); in the "Provider Sequence Directory Name" field, arbitrary directory names were given; for "Fastq File Mate Pair 1" and "Fastq File Mate Pair 2," test input file names were used, and appropriate options were chosen for "Read Type" field for testing in this order: Bowtie 2 (single-, then paired end with human microbiome stool fastq files), BLAST (with human microbiome stool fasta input), Tophat and Cufflinks (with the single-end chimp RNASeq file). The "Sequence Run ID" and "Unique Experiment Name" were generated by clicking the "Add Sequence." In each of the tables, the "Update" function can be used to extend curation. Methods for generating and downloading custom metadata tables are further detailed in the "OMMS Integrated Workflow" link (under Quick Start).

Figures 3A-C:
Omics Metadata Management Software (OMMS) curation and analysis interface. The OMMS was designed to integrate and implement with open-source bioinformatics tools, such as BLAST, Bowtie 2 and Tophat and/or custom pipelines. These tools are accessed via the "Analysis Portal" (panel A). End users select the identifier ("Sequence Run ID") of interest, which is referenced to particular sequence files (panel B and inset). Following input selection, the desired program is chosen and parameterized (panel C and inset) to launch a run. Output from a given analysis run can be downloaded via the OMMS "Results" portal (not shown).

Enabling integrated workflows
To call integrated executables via the "Analysis" portal, click on "Select Input" for the relevant Sequence Run ID (e.g., HsStoo_01_01_01), and then choose the desired program (Figure 2). To launch a Bowtie 2 run on "Sequence Run ID" HsStoo_01_01_01, select the "Staphylococcus_aureus" index from the dropdown menu, and enter an integer value in "Processors" and click "Go." The results file name will appear, and standard output can be downloaded upon run completion. The same steps apply for paired-end analyses with Bowtie 2. For BLAST runs, select the pertinent Sequence Run ID and input (HsStoo_BLAST500.fa), and choose the desired program (blastn) and database (Clostridium kluyveri).
Set the significance threshold expectation (E) value at 0.001 or higher, and indicate the desired output format in the dropdown menu, and click "Go." Similar steps are followed for splice-variant and/or differential expression analyses using TopHat and Cufflinks (Figure 3; Table 2 (see supplementary material)).
The chimp "Sequence Run ID" was selected (PtBrai_02_01_01, referenced to RNASeq file SRR023838_RNASeq.fq) and aligned with the hg19 index [7]. Standard output can be downloaded via the "Results" portal by choosing the "Sequence Run ID" of interest (e.g., HsStoo_01_01_01). The website provides additional instructions for building integrated analyses (in the "OMMS Integrated Workflow" link under Quick Start).

Software: Design and function
Interoperable, open-source software packages (i.e., the LAMP bundle, Linux, Apache, MySQL, PHP) wereused to develop the browser-based OMMS interface to support next generation sequencing-based research efforts in our laboratories [8]. Realworld metadata associated with the test datasets were entered in the three tables, "Specimen Info," "Sample Processing," and "Sequence MetaInfo" (Figure 1; Table 1, 3-5 (see supplementary material)) to instantiate example database records. Most of the fields in the tables accommodate varied data types (e.g., the "Sample_Alias" field in the "Sample_Processing" table), such as character strings, but in cases with fewer possibilities, dropdown menus are provided (e.g., the "Nucleic Acid" field in the "Sample Processing" table).

Test datasets for validating installation and benchmarking integrated tools
Distinct hosts and tissues (Homo sapiens, Pan troglodyte; Stool, Brain) were used to demonstrate automated metadata tracking, storing and functional integration with utilities [9, 10]. Test datasets were obtained from the GenBank Short Read Archive (accessions SRX025177: SRR063480 and SRX008322: SRR023838), and were pre-processed using the NCBI SRA and Fastx Toolkits, and in-house custom scripts. These preprocessing steps are explained in the README file included in the distribution and in the "OMMS Integrated Workflows" link (see the Supplemental Materials for fine details pertaining to curation and pre-processing steps).

Semi-automated curation and results downloading
After entering the minimum required information (indicated by asterisks) for a specimen, the OMMS generates a unique identifier under the "Specimen_UID" field ( Figure 1; Table 1 &  3 (see supplementary material)) describing the subject/host and tissue/microhabitat from which nucleic acid preparations and sequence data will be derived (Figure 1 & 2; Table 1 (see  supplementary material)).
Unique identifiers are automatically propagated to corresponding fields in the other tables (Figures 1 & 2), intuitively linking specimen, sample and sequence data. Input sequence data files put can be uploaded, and results files (output) downloaded, as can metadata for specific entries, as well as table-overview custom spreadsheets (Figures 2 & 3).

Concluding Remarks:
The freeware reported here guarantees standardized, intelligible, automated curation and management of biological metadata, and supports integrated analyses. Recent events, from the outbreak of Ebola Virus Disease in West Africa, to the emergence of antibiotic-resistant bacteria (e.g., Clostridium difficile, Carbapenem-resistant Enterobacteriaceae), make it impossible to overstate the importance of rigorous metadata curation and management systems in high-intensity scenarios, clinical and otherwise. For our project, the OMMS frontend was foundational for handling metadata inherent to nextgeneration sequencing-based experiments involving large numbers of samples at a time. In the context of a host microbiome, potential etiologic agents are typically rare and difficult to detect using standard in silico and experimental approaches, and careful metadata curation is crucial for identifying signal (infectious disease) in the presence of overwhelming noise (background microbiota) and results interpretation. Looking ahead, the OMMS and OMMS-user tailored versions will represent easy-to-use promising tools for addressing microbiome-centric research, from clinical and public health challenges, to exploring new frontiers in agricultural research and development, where handing and tracking hundreds, if not thousands, of samples from diverse subjects at a particular location and time are necessary. Additionally, the OMMS enables development of integrated workflows with state-of-the-art utilities (Blast, Bowtie 2) and in-house pipelines (e.g., local implementations of Galaxy), facilitating fine-grained comparative analyses, such as strain discrimination (e.g., Zaire vs. Sudan ebolavirus) and microbiome composition and functional profiling. Table 1: Overview of tables with embedded automation supporting specimen, sample and sequence data curation and storage. The OMMS supports next-generation sequencing projects, which invariably have substantial metadata management overhead. Core software functionality resides in the three interoperable tables, which are accessed via the "MetaData" portal.  custom pipelines. Sequence files are referenced to automatically generated metadata entries (e.g., "Sample Unique ID", "Sequence Run ID"), and these files can be used as input for programs accessed via the "Analysis" portal. The OMMS also generates directories to store results (output) for each run of a program. Output can be accessed and downloaded via the "Results" portal.