Mining of SSR markers from Expressed Sequence Tags of bamboo species.

With the ever increasing number of Expressed Sequence Tags (ESTs) from various sequencing projects, ESTs have become valuable and first-hand source of in-silico mining of simple sequence repeats (SSR) markers. We examined a total of 3419 EST sequences from three bamboo species, namely, Phyllostachys edulis, Bambusa oldhamii and Dendrocalamus sinicus for the presence of di- to hexa- microsatellites. The frequency of SSR containing ESTs varied from 5.36% in B. oldhamii to 13.05% in P. edulis. No SSRs were found in D. sinicus. Tri-nucleotide repeats (49.34%) were most frequent in P. edulis, while not much comparable difference in repeats was found in B. oldhamii. Flanking primer pairs were also designed in-silico for the sequences containing SSRs and their position on the genome hypothesized using similarity searching. SSRs located in open reading frame (ORF) were given functional annotation using Gene Ontology. Polymorphic SSRs were also detected using new pipeline- polySSR. Polymorphism level was very low (2.43%) and the position of the polymorphic SSRs was determined. The development of SSRs and the study of polymorphism will help in the further study of intra- and inter- gene flow, genetic structure, variability, linkage mapping and evolutionary relationships in bamboo.


Background:
A large number of plant genomes are under consideration for whole genome sequencing but for most of them, due to their large genome size, the task has been little difficult and time consuming. However, for them, as a prehand, large scale EST sequencing projects has been started as an alternative. Expressed Sequence Tags (EST) are cDNA clones, sequenced randomly in a single pass run. Based on the direction of cloning, it can be 5' EST or 3' EST. Since the ESTs are derived from cDNA, they provide direct evidence for the study of transcriptome and genome.
Microsatellites or Simple Sequence Repeats (SSR) or Short tandem repeats (STR) are 1-6 bp tandemly repeated motifs present in both coding and noncoding regions of prokaryotic and eukaryotic genome. SSR are extensively used as molecular markers because of its multiallelic nature, co-dominant inheritance and relative abundance. Since EST also represent the coding part of the genome, these serve as an important source for mining putative SSR markers and provide first hand insight into the organism's genetic diversity. Here, in this study, we present the mining of EST-SSR markers and detecting polymorphism in microsatellites in 3 bamboo species, namely, Phyllostachys edulis, Bambusa oldhamii, Dendrocalamus sinicus. Two approaches were used in this work. First, by using MISA, SSR markers were predicted and functional annotation of those SSR containing sequences was done. Another approach was to determine polymorphism using the novel pipeline PolySSR.

Methodology:
ESTs for 3 different bamboo species namely, Phyllostachys edulis, Bambusa oldhamii and Dendrocalamus sinicus were downloaded from NCBI's dbEST. A total of 3087 EST sequences of P. edulis, 318 EST sequences of B. oldhamii and 14 EST sequences of D. sinicus was downloaded from dbEST as on may 29, 2009, dbEST release 052909. dbEST has redundancy in EST sequences. In order to remove the redundancy, CAP3 Assembler EST-SSR sequences were subjected to similarity searching against nonredundant (nr) database with constraint of ORGN: Oryza sativa. Bamboo being monocotyledon, rice was used as model organism owing to more proximity in phylogeny. It was performed using NCBI's Basic Local Alignment Search Tool (BLAST), variant BLASTX [4]. The sequence was considered homologous if the e-value was ≤ 1e-5 and score ≥ 100. Based on the position of SSR in the homology search, they were assigned whether they lie in 5' UTR, 3' UTR or ORF. Only those EST sequences were further analyzed in which SSR were predicted to be in ORF. The functionality was assigned according to Gene Ontology (GO) annotation. Rice was used as model organism for similarity searching. GO annotation of Oryza sativa was used to map functions by similarity searching in GRAMENE. Polymorphism exhibited by EST-SSR was mapped using the new pipeline PolySSR [5].

Results and Discussion:
For analysis, the EST data was significantly reduced to a nonredundant set of sequences by atleast 30% using CAP3. Mono-SSR repeats were not considered since they do not serve important as molecular markers. The search criteria were kept low to maximize the SSR discovery [6]. In D. sinicus, no SSRs were detected by MISA and excluded for further study. Most SSRs for which primers were designed were found to be < 20 bp length, thus has a smaller chance of mutation or slipped strand mispairing over smaller sequence length. This may lead to more chance of sequence conservation.

Bioinformation
The SSR containing sequences for which primers were designed were then analyzed for determining the relative SSR position on genome using sequence similarity search. Most of the SSRs were predicted in 5' UTR followed by 3'UTR. Very few were predicted to be in ORF. Maximum SSRs were found to be in 5'UTR in bamboo species, in accordance to another study [18]. Almost all AG motifs were found in 5' UTR where as CT motif was found maximally in 5' UTR and very few in 3' UTR. ORF contained mostly trinucleotide repeats [19]. High frequency of trinucleotide repeats can be explained as these are less affected by base mutations and hence more conserved. However, disease causing effect of change in SSR sequence in humans has been reported [20].
These SSR predicted to be in ORF were mapped against related Gene Ontology (GO) IDs. They were mapped with variable functions which are summarized giving the corresponding GO annotations for sequences in