Mining, characterization and validation of EST derived microsatellites from the transcriptome database of Allium sativum L

Expressed Sequence Tags (ESTs) with comprehensive transcript information are valuable resources for development of molecular markers as they are derived from conserved genic regions. The present study highlights the mining of EST database to deduce the class I hyper variable SSRs in A. sativum. From 21694 garlic EST sequences, 642 non-redundant SSRs were identified with an average frequency of 1 per 14.9 kb of garlic transcriptome. The most abundant SSR motifs were the mononucleotides (32.86%) followed by trinucleotides (28.50%) and dinucleotides (13.39%). Among the individual SSRs, (A/T)n accounted for the highest number (137; 21.33%) followed by (G/C)n (74; 11.52%) and (AAG)n (63;9.81%). Primers designed from a robust set of 7 AsESTSSRs resulted in the amplification of 63 polymorphic alleles in 14 accessions of garlic. The resolving power of the markers varied from 4.286 (AsSSR7) to 18.143 (AsSSR13) while the average marker index (MI) was 5.087. These EST-SSRs markers for garlic could be useful for the improvement of garlic linkage map and could be used for evaluating genetic variation and comparative genomics studies in Allium species.


Background:
Garlic (Allium sativum L.), the "spice of life" is a unique monocot plant from the economically important family Liliaceae. It belongs to the genus Allium consisting of 1250 species which has been used throughout history for both culinary and medicinal purposes [1]. It is attributed with several medicinal properties including being a stimulant, diaphoretic, an expectorant, a diuretic and a tonic due to the presence of allicin in the bulb and garlic oil in both bulb and leaves [2]. Presently it is considered to be useful for the treatment and prevention of a number of diseases, including cancer, coronary heart disease, obesity, diabetes type-2, hypertension, cataract and disturbances of gastrointestinal tract [3]. The interest and demand in garlic has significantly increased in the recent times due to its nutritional and pharmaceutical properties. India contributes about 4.1% of global garlic production and lies next only to China in terms of productivity [4]. The Food and Agricultural organization of the United Nations has estimated an annual growth rate of 4.7% in the world demand for garlic. Although sexual reproduction is possible in garlic, nearly all garlic plants are propagated vegetatively. Clonal selection is the main breeding method for modern garlic since no sexual propagation in garlic usually precludes crop improvement through hybridization. In garlic breeding programme, genetic variation is increased only by somaclonal variation, genetic transformation and mutation [5]. Continuous domestication of preferred genotypes together with their exclusive asexual propagation have eroded the genetic base of this crop and make the plant vulnerable to various biotic and abiotic stresses which together attributes for upto 60% yield losses [4]. Therefore, it is highly essential to develop efficient and reliable molecular markers to analyse the genetic diversity among garlic species for conservation of germplasm and improvement of the extant crop. Several molecular marker systems are used in plants for characterization and mapping of important traits [6]. Among them, the most reliable are the microsatellites. Microsatellites or simple sequence repeats (SSRs) are small array of tandemly arranged 1 to 6bp nucleotides present throughout the genome and mainly used as marker in genetic variation and population genetic study [7]. SSRs have great advantages over other markers as they are simple, highly abundant, polymorphic, polyallelic, co-dominant and occur in both coding and non-coding regions of the genome [8].
Although several SSR markers have been identified in garlic [9, 10], additional SSRs with polymorphism are needed, particularly for the development of linkage maps for use in trait specific mapping studies. In the recent years, high throughput next generation sequencing technologies has led to the generation of large databases of expressed sequence tags (ESTs) and genomic sequences. ESTs are the short and single pass sequence reads of mRNAs or cDNAs corresponding to the partial coding sequences of expressed genes and acts as an attractive alternative source for mining of SSRs from the coding regions of the genome [11]. EST-SSRs are more advantageous than the genomic SSRs due to their easy availability and high transferability to related species thereby serving as reliable markers for gene mapping analyses [12]. Use of ESTs to generate EST-derived SSRs has been reported in several plants species [13].
Considering the above facts into account, the present study aims to exploit the EST database of Allium sativum to develop EST derived SSRs. There are 21694 numbers of Allium sativum ESTs available in the dbEST database of National Centre for Biotechnology Information (NCBI) (as of 22 nd December 2014). A user friendly SSR identification tool-SSRLocator [14] was used for this purpose. Further, primers were designed from a selected set of EST-SSRs to determine polymorphism and validate their utility by evaluating genetic diversity among different garlic accessions.

Methodology:
A total of 21,694 expressed sequence tag (EST) sequences were downloaded from the dbEST database hosted in GenBank at NCBI using the key word "Allium sativum". The 21694 ESTs retrieved were from three different garlic tissues i.e leaf, stem and root. The Cross_Match program [15] was used with parameter set at minmatch ≥13 and minscore ≥20 to screen the ESTs against the UniVec database (ftp: //ftp.ncbi.nih.gov /pub/ UniVec/) to detect vector and adapter sequences. Additionally, the polyA/T tails and X characters were removed from the EST sequences using EST_trimmer.pl script (http: / / pgrc.ipk-gatersleben.de /misa /download /est_trimmer.pl). The trimming of the ESTs were done until no stretch of (A/T)5 or (X)1 was observed in a window of 100bp at the 5′ or 3′ end, respectively. The assembly program CAP3 [16] was used to assemble all the ESTs thereby creating a non-redundant dataset. The resulting output of unigenes, contigs and singlets were combined together as a selected non-redundant dataset for SSR identification. The SSR detection tool SSRLocator was used to detect EST-SSR loci from the garlic EST datasets. Two repeat motifs found close to each other within an EST were considered as individual entity and not compound SSRs as suggested by Gupta et al., [17]. Primers were designed from Class I SSR containing EST sequences using Primer3plus [18] and validated using NetPrimer (http: //www.premierbiosoft.com /netprimer /index.html). Primers having a score of more than 75 as evidenced by the absence of self dimer and/or cross dimer were selected and synthesized. 7 selected set of primers were tested for functionality and polymorphism against a panel of 14 Allium sativum accessions.

Results & Discussion:
The 21694 redundant EST sequences retrieved from NCBI represented approximately 11.87Mb of Allium sativum genome. During the scan for class I microsatellite repeats, 793 SSRs were detected in this dataset corresponding to 1.0 SSR per 14.9 kb. 53672 bp of empty vectors, low-quality sequences and Poly A/T tails were removed successfully during the pre-processing. Rest of the sequences were clustered and assembled into a nonredundant dataset of 14054 unique gene sequences (1491 contigs and 12563 singlets). Mining of Class I microsatellites revealed 642 unique SSR containing sequences within the non-redundant datasets accounting for one SSR per 11.19 kb of garlic genome Table 1 (see supplementary material). The reduction in redundancy of SSR"s obtained by trimming and clustering of non-redundant sequences is shown in Figure 1. Assembling of ESTs into contiguous or single gene sequences reduces the redundancy to calculate the accurate frequency and design unique primer sets. The parameters for SSR detection and frequency analysis are highly variable in different plant species [17,19]. Cradle et al., [20] used a comprehensive computational strategy to estimate the average distances between SSRs in non-redundant ESTs of various plant species such as rice (3.4Kb), soybean (7.4Kb), tomato (11.1Kb), Arabidopsis (13.8Kb), poplar (14.0Kb) and cotton (20.0kb). We followed the same strategy and found one SSR per 11.19 kb of garlic non-redundant EST dataset. This suggests that frequency of EST-SSRs in the expressed portion of the garlic genome is high in comparison to rice, soybean and tomato and lower than other plant species.
The distribution of the individual SSR motifs among the nonredundant set of 642 SSRs is represented in Table 2 (see supplementary material). The mined EST-SSRs were classified as simple motif type, with a single motif; and compound SSRs, with more than two motifs. Among the 642 EST-SSRs, 12 loci (1.86%) represented compound SSRs while the rest 630 loci (98.14%) consisted of simple repeats (Figure 2). Among the Class I SSR motifs detected, mononucleotides were the most abundant (32.86%) followed by trinucleotides (28.50%) and dinucleotides (13.39%). The most common repeat motif was (A/T) in mononucleotides, (AG/CT) in dinucleotides, (AAG/CTT) in trinucleotides, (AAAC/GTTT) in tetranucleotides, (AAAAC/GTTTT) in pentanucleotides and (AGCCTG/CAGGCT) in the hexanucleotides. The high occurrence of mononucleotide repeats in the poly A/T trimmed dataset suggests that they are located within the expressive regions and not at the end of the mRNA sequences. This is in agreement with earlier report in Catharanthus [21] and turmeric [22]. The dinucleotide motif AT (18%) had the lowest abundance while AG (68.6%) showed the highest frequency among the dimeric SSRs Table 3 (see supplementary material). The deficit of AT SSRs in EST sequences is in compliance with reports from rice [23] and Arabidopsis [20]. AG/CT corresponds to GAG, AGA, UCU and CUC codons encoding for arginine, glutamic acid, alanine and leucine respectively. The surplus of AG/CT in the garlic genome corroborate with the fact that majority of plant proteins contains higher concentration of alanine and leucine. Among the trimeric SSRs, AAG (34.4%), AAC (20.2%) and AAT (11.4%) were the most common patterns in the garlic ESTs. AAG is the most common plant trinucleotide repeat as has been reported with highest percentage in cotton [24] and Catharanthus [21]. AAC/GTT and AAT/ATT repeats are also significantly represented in periwinkle [21], turmeric [22], pearl millet [25] and barley [26]. Likewise, majority of the monocot genomes boast a specific occurrence of CGG repeats [27]. ATT represents only 4% of the total garlic SSRs which is in accordance with the fact that most plants have the least percentage of ATT repeats because TAA-based variants encode stop codons [27]. The most frequent tetranucleotide SSR motifs were AAAC/GTTT (27) and AAAG/CTTT (09). The pentanucleotide, hexanucleotides repeats and the compound SSRs accounted for less than 3% contribution to the total SSR patterns.
Seven AsEST-SSRs primers designed from the selected set of SSR loci resulted in the amplification of 82 unambigously scorable bands in 14 accessions of Allium sativum Table 4 (see  supplementary material). The test retrieved 63 polymorphic bands (76.8%), averaging 9 bands per primer. Of the 63 polymorphic loci, 39 (61.9%) were highly informative, as they were characterized by allele frequencies ranging from 0.2 to 0.8. The 7 AsEST-SSRs had a polymorphism information content (PIC) range of 0.689-0.780 with an average of 0.730. Thus, the markers exhibited good power to discriminate among the genotypes of Allium sativum used in this study. The resolving power of the markers varied from 4.286 (AsSSR7) to 18.143 (AsSSR13). The AsEST-SSRs were characterized by an average marker index (MI) of 5.087 with values ranging from 1.540 (AsSSR9) to 8.504 (AsSSR2). Three pairs of AsEST-SSRs (AsSSR2, AsSSR8 and AsSSR11) revealed high level of polymorphism among the garlic accessions by revealing higher number of alleles. Hence these EST derived SSR primers have potential as informative markers for genetic diversity analysis and selective trait identification studies in garlic improvement programmes.

Conclusion:
Microsatellites markers have myriad uses in plant genome analysis. The present study highlights the frequency, type and distribution of garlic EST derived microsatellites and demonstrates the successful development of EST-SSR markers in garlic accessions. Reproducible EST-SSR markers developed in this study could enrich the molecular marker resource for garlic and could be applied for trait mapping, assessment of genetic diversity, marker-assisted selection and functional analysis of candidate genes in commercial inbred varieties as well as in related Allium species.