ESMP: A high-throughput computational pipeline for mining SSR markers from ESTs

With the advent of high-throughput sequencing technology, sequences from many genomes are being deposited to public databases at a brisk rate. Open access to large amount of expressed sequence tag (EST) data in the public databases has provided a powerful platform for simple sequence repeat (SSR) development in species where sequence information is not available. SSRs are markers of choice for their high reproducibility, abundant polymorphism and high inter-specific transferability. The mining of SSRs from ESTs requires different high-throughput computational tools that need to be executed individually which are computationally intensive and time consuming. To reduce the time lag and to streamline the cumbersome process of SSR mining from ESTs, we have developed a user-friendly, web-based EST-SSR pipeline “EST-SSR-MARKER PIPELINE (ESMP)”. This pipeline integrates EST pre-processing, clustering, assembly and subsequently mining of SSRs from assembled EST sequences. The mining of SSRs from ESTs provides valuable information on the abundance of SSRs in ESTs and will facilitate the development of markers for genetic analysis and related applications such as marker-assisted breeding. Availability The database is available for free at http://bioinfo.aau.ac.in/ESMP


Background:
Expressed sequence tags (ESTs) represents short, unedited and randomly selected single-pass reads derived from cDNA libraries provides an alternative to whole genome sequencing of organisms. The analysis of EST data enable gene discovery, complete genome annotation, gene structure identification, establish the viability of alternative transcripts, guide single nucleotide polymorphism (SNPs) characterization and facilitate in proteomic exploration [1].
The ubiquity of microsatellite or simple sequence repeats (SSRs) in eukaryotic genomes and their usefulness as genetic markers has been well established over the last decade. SSRs are short (1-6 bp) repeat DNA motifs that are usually single locus markers with characteristics of hypervariability, abundance, reproducibility and ease of detection by polymerase chain reaction with unique primer pairs that flank the repeat motif are available with their own objectives, provides cleansing EST sequences and annotating them using public databases. The mining of SSRs from ESTs requires different high-throughput computational tools that need to be executed individually which are computationally intensive and time consuming. To reduce the time lag and to streamline the cumbersome process of SSR mining from ESTs, we have developed EST-SSR Marker Pipeline: ESMP for mining of putative SSRs from EST sequences. ESMP accomplish EST pre-processing, clustering, assembly and subsequently mining of SSRs from assembled EST sequences. Cross_match ESMP has a three-tier architecture system. Presentation tier helps the user interact with ESMP through a web browser, whereas the business tier performs different analytical services associated with user specific options. The data generated in the business tier is then deposited into the data tier. For the use of this pipeline it does not require any database or any application installation on user machine. Instead the user simply uploads the fasta formatted EST sequence data into the server to run the pipeline with default parameters. It also has the options to choose the user defined parameters which makes the pipeline more interactive, user friendly and flexible.

Implementation:
ESMP interface has been developed using HTML, CSS, JavaScript and PHP. MySQL has been used to store input EST data, intermediate data of the pipeline and mined SSRs statistics. The database schema is available at ESMP website. The backend system is a Linux machine with Intel ® Core(TM) 2Duo@3.33GHz CPU and 3GB RAM. Architecture and workflow of the pipeline is depicted in Figure1.

Software input:
The ESMP web interface allows the user to submit EST sequences in the fasta format with ".reads" extension. It also asks the user to upload vector sequences in a plain text format which can be obtained from FTP site UniVec database (ftp://ftp.ncbi.nih.gov/pub/UniVec/) of NCBI. Although most EST projects produces a large number of chromatogram files, ESMP cannot accept chromatogram files due to file-size limitations of web-based uploading. Accordingly, chromatogram files have to be converted into DNA sequence files using a base-calling program such as phred [8].

Software output:
The ESMP output is stored in a MySQL database. All the output files are stored in ".rar" extension which can be downloaded by the user as well as can be viewed in the current web page. The statistics files contains the statistics of putative SSRs i.e., total number of sequence examined, total size of examined sequences (bp), total number of identified SSRs, number of SSRs containing sequences and number of sequence containing more than one sequences, number of SSR present in compound formation about the putative SSRs mined in the run. The statistics file can be transferred into an excel file for better visualisation of putative SSRs.

Conclusion:
ESMP pipeline is the integration of multiple tools which are individually used for their respective applications to accomplish the mining of putative SSRs from ESTs.

Caveat and future development:
ESMP currently supports pre-processing, assembly and putative SSR detection from EST datasets. This web-based EST-SSR pipeline reduces time lag and streamline the cumbersome process of SSR mining from ESTs, which is user-friendly. Our goal is not just to limit this pipeline for EST-SSR mining but to extend further for annotation and detection of suitable primer pairs which will flank the repeat motif. The mining of SSRs from ESTs provides valuable information on the abundance of SSRs in ESTs and will facilitate the development of markers for genetic analysis and related applications.