Mining and characterization of EST derived microsatellites in Curcuma longa L.

Turmeric (Curcuma longa L.) (Family: Zingiberaceae) is a perennial rhizomatous herbaceous plant often used as a spice since time immemorial. Turmeric plants are also widely known for its medicinal applications. Recently EST-derived SSRs (Simple sequence repeats) are a free by-product of the currently expanding EST (Expressed Sequence Tag) databases. SSRs have been widely applied as molecular markers in genetic studies. Development of high throughput method for detection of SSRs has given a new dimension in their use as molecular markers. A software tool SciRoKo was used to mine class I SSR in Curcuma EST database comprising 12953 sequences. A total of 568 non-redundant SSR loci were detected with an average of one SSR per 14.73 Kb of EST. Furthermore, trinucleotide was found to be the most abundant repeat type among 1-6-nucleotide repeat types. It accounted for 41.19% of the total, followed by the mononucleotide (20.07%) and hexanucleotide repeats (15.14%). Among all the repeat motifs, (A/T)n accounted for the highest proportion followed by (AGG)n. These detected SSRs can be greatly used for designing primers that can be used as markers for constructing saturated genetic maps and conducting comparative genomic studies in different Curcuma species.


Background:
The genus Curcuma of the family Zingiberaceae constitutes 80 species all over Asia, South East Asia and Africa [1].Turmeric, also known as the "golden spice" is one of the most important herbs in the tropical and subtropical countries.Turmeric rhizome is valued world over and has been in use from ancient time as a spice, food preservative, coloring agent, and in the traditional systems of medicine [2].Its medicinal uses are indeed diverse, ranging from cosmetic face cream to the prevention of Alzheimer's disease.Turmeric is also qualified as the queen of natural Cox-2 inhibitors [3].India is the world's largest producer, and exporter of turmeric followed by China, Indonesia, Bangladesh and Thailand [4].The International Trade Centre, Geneva, has estimated an annual growth rate of 10% in the world demand for turmeric.Conventional crop improvement methods are not suitable in turmeric because it is not only completely sterile but also propagate exclusively by vegetative means.Characterization of Curcuma longa using molecular markers is very limited excepting a few sporadic reports on isozyme studies and genetic stability studies using RAPD [5,6].Moreover, it is a well-known fact that the genotypic diversity of exclusively asexually reproducing plants like turmeric will be lost in the long course of evolution.Hence, the development of reliable and reproducible molecular markers in turmeric is highly essential to assess the genetic diversity for germplasm conservation and crop improvement Microsatellites, or simple sequence repeats (SSRs), are stretches of DNA consisting of tandemly repeated short units of 1-6 base pairs in length.Compared with other molecular markers, simple sequence repeats (SSRs) are more advantageous because of their simplicity, high information, and co dominant nature and because they can be rapidly screened and analyzed by polymerase chain reaction (PCR) and gel electrophoresis.In addition, SSR loci are present not only in the non-coding regions of genes but are also widely distributed in the coding regions.Microsatellites are categorized into two groups-class I hypervariable markers with ≥20 repeats and class II potentially variable markers with ≤20 repeats.The standard method for development of genomic SSRs is highly time consuming and labor-intensive Keeping in view the above, the objectives of the research described in this paper were to assess the potential of existing public databases for the discovery of simple sequence repeats.We have mined updated EST tissue libraries of Curcuma longa for this analysis to find the SSR polymorphisms.SSR detecting software SciRoKo was used to identify the SSR polymorphisms.There are other SSR detecting softwares such as MISA

Methodology:
EST database of NCBI contains 12953 Curcuma longa express sequence tag data.We have mined 12593 EST sequences consisting of two tissue libraries of rhizomes 6870 (DY395309-DY388440) and leaves 5723 (DY388439-DY382717).The EST sequences were screened against the UniVec database from NCBI (ftp://ftp.ncbi.nih.gov/pub/UniVec/) for detecting vector and adapter sequences by using the program Cross_Match [Li et al 2006]; the following parameters were used: minmatch ≥13 and minscore ≥20.Furthermore, polyA/T tails and X characters were removed using the EST_trimmer.plscript (http://pgrc.ipk-gatersleben.de/misa/download/est_trimmer.pl) until no stretch of (A/T)5 or (X)1 was present in a window of 100bp at the 5′ or 3′ end, respectively.The mined EST-SSRs were classified according to their structure into the simple motif type, with a single motif; and compound type, with more than two motifs.Among the 568 EST-SSRs, most (98.92%)consisted of simple repeats with no interruptions in the motif; whereas only six loci (1.05 %) were of the compound type (table 2 see supplementary material).
Among the 1−6 repeat types, the most abundant repeat type was the trinucleotide repeat type, which accounted for 41.19% of the total, followed by the mononucleotide (20.07%) and hexanucleotide types (15.14%).The dinucleotide, tetranucleotide and pentanucleotide types accounted for only 9.68%, 6.16% and 6.69% respectively.Many studies have suggested that the trinucleotide repeat is the main EST-SSR repeat type in most plants, followed by the dinucleotide and tetranucleotide repeat types [22,23].However, the most abundant motif in the trinucleotide repeat type differed among plants [13].Kantety et al [24] showed that the (CCG)n repeat motif accounted for 32% and 49% of all repeat motifs in wheat and sorghum, respectively.Gupta et al.Rest of the repeat motifs accounted less than 5% contribution to the total SSR motifs.In the 1−6 repeat types, the most frequent repeat motifs were A/T, AG/CT, AAG/CTT, AAAC/GTTT, AAAAC/GTTTT, and AGGCGG/CCGCCT, which accounted for 78.9%, 54.54% 29.01%, 31.42%,15.68% and 8.98% of all types, respectively.This frequency analysis of repeat motifs can be used as a potential source for designing repeat probes for effective targeting and isolation of microsatellite repeats from turmeric.Moreover, these probes can be used for designing informative primers that can be used for genetic diversity analysis and related studies.

Conclusion:
In total, we identified 568 non-redundant hypervariable microsatellites from EST data source of Curcuma longa using SSR identification tool SciRoKo.Development of SSR markers from EST-databases saves both cost and time, once a sufficient amount of EST sequences is available.These non-redundant SSR resources will not only be applied in studies of genetic variation and linkage mapping but also provide the foundation for an in-depth analysis of the characteristics of distribution of genes on chromosomes and for comparative genomic studies on different Curcuma species.
Volume 5 [7].Recent advances in Curcuma genomic technologies have generated a large number of expressed sequence tags (ESTs) that has been made available in public database, thereby offering an opportunity to develop EST derived SSR markers by data mining.ESTs are short and single pass sequences read from mRNA (cDNA) [8] representing a snapshot of genes expressed in a given tissue and or at a given developmental stage.As of July 2010, GenBank had released 12593 EST sequences from Curcuma longa.In this context, the use of EST or cDNA-based SSRs has been reported for several species including grape [9], sugarcane [10], durum wheat [11] and rye [12].

Figure 1 :
Figure 1: Distributions of EST-SSRs based on the motifs.Discussion : Large-scale sequencing of Expressed Sequence Tags and complete genomes offers information of use to plant breeding programs.With the completion of the first crop genome sequencing projects [20] the potential for plant breeding to be impacted by new technology has never been greater.A total of 12953 redundant EST sequences were retrieved from NCBI database representing about 8.4 Mb of Curcuma longa genome.During pre-processing, 38366Bp of empty vectors, low-quality sequences and Poly A/T tails were removed successfully.After sequence redundant analysis, 7139 unique sequences with combined length of 5.11 Mb were obtained and were used for mining of hyper variable class I microsatellites.Using the SciRoKo SSR mining program, while searching for SSRs with 1-6 nucleotide repeat motifs, 568 hypervariable SSR loci were observed (Table3see supplementary material).The frequency of SSR loci in turmeric EST was found to be one SSR in every 14.73 kb of EST sequence (Table1see supplementary material) that is higher as compared to earlier retrieved data of one SSR per 17.96 Kb [21].Cardle et al [22] estimated the average distances between SSRs in sets of non-redundant ESTs in poplar (1/14.0kb), cotton (1/20.0kb), Arabidopsis (1/13.8kb), maize (1/8.1 kb), rice (1/3.4 kb), tomato (1/11.1 kb) and soybean (1/7.4 kb).This clearly suggests that, with the increase in the transcript data of plants, SSR estimation in ESTs will become more precise and reliable.

[ 25 ]
found that the (AAG)n repeat was the most abundant motif in the trinucleotide repeat type.In a similar study Lu et al 2010, have found (AAG)n to be the most abundant repeat motif in Gossypium barbadense.Siju et al [21] also found (AAG)n to be the most abundant in turmeric accounting to 8.2%.The dominance of trimeric SSRs observed in the present study could be attributed to the fact that the suppression of non-trimeric SSRs in the coding regions leads to frame shift mutations [26].Moreover, we also found that the frequency of mononucleotide (20.07%) and hexanucleotide (15.14%) repeats was more as compared to other repeat motifs.This suggests that the functions of EST-SSRs derived from Curcuma longa may be different form other members of the Zingiberaceae family.In all the repeat motifs, most of the SSR repeat motifs derived from the ESTs were A/T (16.54%) followed by AAG/CTT (11.44%),AGG/CTT (10.91%) and AG/CT (5.28%) (Figure.1).

Table 1 :
Summary of EST-derived microsatellies from the EST database of Curcuma longa L.

Table 2 :
Distribution of SSR motifs in Curcuma longa

Table 3 :
Total number of detected SSR loci.