EuDBase: An online resource for automated EST analysis pipeline (ESTFrontier) and database for red seaweed Eucheuma denticulatum

Functional genomics has proven to be an efficient tool in identifying genes involved in various biological functions. However the availability of commercially important seaweed Eucheuma denticulatum functional resources is still limited. EuDBase is the first seaweed online repository that provides integrated access to ESTs of Eucheuma denticulatum generated from samples collected from Kudat and Semporna in Sabah, Malaysia. The database stored 10,031 ESTs that are clustered and assembled into 2,275 unique transcripts (UT) and 955 singletons. Raw data were automatically processed using ESTFrontier, an in-house automated EST analysis pipeline. Data was collected in MySQL database. Web interface is implemented using PHP and it allows browsing and querying EuDBase through search engine. Data is searchable via BLAST hit, domain search, Gene Ontology or KEGG Pathway. A user-friendly interface allows the identification of sequences either using a simple text query or similarity search. The development of EuDBase is initiated to store, manage and analyze the E. denticulatum ESTs and to provide accumulative digital resources for the use of global scientific community. EuDBase is freely available from http://www.inbiosis.ukm.my/eudbase/.


Background:
The species of Eucheuma occur throughout the Indo-Pacific region from East Africa to Guam, in China and Japan waters and mostly in algal reef areas of islands in Southeast Asia.In Malaysia, E. denticulatum is commonly farmed in Kudat and Semporna in the state of Sabah.E. denticulatum is also known as "spinosum" which is a trade name indicating its production of iota-carrageenan.E. denticulatum has been the focus of many studies due to its unique polysaccharides that constitute its cell wall that are unlike those found in plants.Currently there is no dedicated database available for the expressed sequence tags (ESTs) data of E. denticulatum even though the interest in seaweed community has been increased globally due to its economical value.ESTs are significantly important especially for organisms where the genome sequences are not available and they can be used as a basis for structural genomic annotation.Until now only Ectocarpus siliculosus (brown algae) has its genome fully sequenced [1].We aim to generate as many ESTs from E. denticulatum as possible and to use the encoded information to reveal interesting information on the biosynthetic pathway of iota-carrageenan.Bioinformatics analysis has been carried out to facilitate the finding of interesting biological information.shows the output for domain analysis using InterProScan.We used Blast2GO for the functional annotation in E. denticulatum EST and 1935 Gene Ontology terms were assigned on 399 UTs.BLAST2GO used 5 best hits from BLASTX results to annotate each UTs sequence and successfully annotated 823 GO terms under biological process, 578 under molecular function, and 488 under cellular component (Table 5, see supplementary material).We have also performed a pathway mapping of E. denticulatum UTs on KEGG pathway to observe their interactions.Using BLAST2GO KEGG pathway mapping functionality along with the complementary support from KOBAS, 57 unique pathways have been mapped with E. denticulatum ESTs and 100 UTs were found to map on the pathway of plant hormones biosynthesis and 99 UTs are mapped on the phenylpropanoids biosynthetic pathway.Table 6 (see supplementary material) lists 10 most abundant pathways that were mapped with E. denticulatum UTs.EuDBase web interface enables users to perform keyword search and browsing against the database.Database users can query the database using keywords together using Boolean operators such as AND, OR and NOT to perform complex queries.
EuDBase includes local BLAST server to enable BLAST searching against EuDBase assembled UTs and translated peptides using appropriate BLAST subprograms such as BLASTN, BLASTP and BLASTX (Figure 4).

Future developments:
Eventually EuDBase will incorporate E. denticulatum proteomics, transcriptomics and metabolomics data as well as its integration with a genome browser.The server will be periodically upgraded for faster access to accommodate the growing number of data.

Conclusion:
EuDBase is a first online resource for red seaweed that allows for easy data integration and retrieval with the aim of providing a tool to expand the knowledge on E. denticulatum functional genomics.

Authors' contributions:
ZAMH formulated the study, directed the work and wrote the manuscript.RAZA worked on the preliminary development of the database.LKK continuously developed, implemented and managed the database and analysis pipeline.RO

Figure 2 :
Figure 2: EST analysis pipeline in EuDBase known as ESTFrontier

Figure 3 :
Figure 3: Data mining route in EuDBase.There are three main branches for mining in EuDBase

Figure 4 :
Figure 4: Snapshots of EuDBase web interface.EuDBase Top page with links to Browse and Search.A) Keyword search results with links toward sequence summary report; B) Consensus sequence summary report; C) Browse EuDBase by raw sequences, singletons, consensus, protein domain, Gene Ontology (GO) and KEGG pathway; D) Raw EST sequence summary report; E) Singleton sequence summary report; F) Consensus sequence summary report.

[2], RepeatMasker, CAP3 [3], ESTScan [4], FrameDP [5], BLASTX [6], InterProScan, InterPro2GO, BLAST2GO [6], AutoFACT [7] and KOBAS [8]. A comprehensive spreadsheet report in EXCEL format is generated as output files. Figure 1: A database model of EuDBase Utility: E. denticulatum EST statistics in EuDBase:
To date, we have uploaded 9,057 high quality ESTs to the GenBank EST repository.We present the E. denticulatum EST database (EuDBase) consists of EST data, functional annotation and metabolic pathway assignments.The content of EuDBase will continue to increase in parallel with the EST sequencing effort carried out at It also provides comparative data for analyses of organism that has no comparable genomic resources.EuDBase also links to ESTFrontier pipeline for comprehensive EST data analyses.
ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 7(4): 157-162 (2011) 158 © 2011 Biomedical InformaticsUniversiti Kebangsaan Malaysia (UKM).Methodology: EuDBase and ESTFrontier have been designed for simple and efficient information search and retrieval.EuDBase is composed of two major components i.e. a relational database created using open access MySQL version 5.1.36and a PHP version 5.StackPack EST assembly pipeline was used to assemble raw EST data resulting to the assembly of 2,275 unique transcripts that consisted of 1,320 consensus sequences and 955 singletons (Table 1, see supplementary material).Sequence similarity search against NCBI nr-database with a cut-off value of 1e-06 showed 961 UTs have significant matched homologues, 145 UTs were categorised as predicted proteins whilst 138 UTs were grouped into hypothetical and unknown proteins (Table 2, see supplementary material).Table 3 (see supplementary material) shows the most abundant similarity search of E. denticulatum UT data set where 62 ESTs were found to match the RNA-binding proteins.Table 4 (see supplementary material)

Table 1 :
conceived and directed the molecular biology studies.All authors read and approved the final manuscript.A summary of E. denticulatum EST in EuDBase ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 7(4): 157-162 (2011) 160 © 2011 Biomedical InformaticsSupplementary material:a Mean EST length following vector and end clipping; b EST assembly parameters were 80% minimum match with 40 minimum base overlap; c Unique transcripts are the sum of contigs and singletons.

Table 2 :
BLASTX analysis results for E. denticulatum UTs

Table 3 :
Most abundant EST similarity search of E. denticulatum UT data set

Table 4 :
Gene Ontology of E. denticulatum UTs Classified using guidelines of the Gene Ontology Consortium 2001 (http://www.geneontology.org).Indented terms are children of the above parent term.Only mapped GO terms are presented.

Table 5 :
Domain analysis using INTERPRO

Table 6 :
Ten most abundant ESTs mapped to KEGG PATHWAY