Mining for SSRs and FDMs from expressed sequence tags of Camellia sinensis

Simple Sequence Repeats (SSRs) developed from Expressed Sequence Tags (ESTs), known as EST-SSRs are most widely used and potentially valuable source of gene based markers for their high levels of crosstaxon portability, rapid and less expensive development. The EST sequence information in the publicly available databases is increasing in a faster rate. The emerging computational approach provides a better alternative process of development of SSR markers from the ESTs than the conventional methods. In the present study, 12,851 EST sequences of Camellia sinensis, downloaded from National Center for Biotechnology Information (NCBI) were mined for the development of Microsatellites. 6148 (4779 singletons and 1369 contigs) non redundant EST sequences were found after preprocessing and assembly of these sequences using various computational tools. Out of total 3822.68 kb sequence examined, 1636 (26.61%) EST sequences containing 2371 SSRs were detected with a density of 1 SSR/1.61 kb leading to development of 245 primer pairs. These mined EST-SSR markers will help further in the study of variability, mapping, evolutionary relationship in Camellia sinensis. In addition, these developed SSRs can also be applied for various studies across species.


Background:
SSRs (Simple Sequence Repeats) otherwise known as Microsatellites are short repeat motifs that are present in both protein coding and non-coding regions of DNA sequences. SSRs show a high level of length polymorphism due to mutations of one or more repeats. The use of SSRs as molecular marker is being favored due to their multiallelic nature, reproducibility, high abundance and extensive genome coverage [1]. Expressed Sequence Tags (ESTs) are single pass sequence of cDNA classes that provide direct information of gene expression and serve as the main source for microsatellites [2]. The traditional methods of developing simple sequence repeat (SSR) markers are usually time consuming and laborintensive. Generally these processes involve genomic library construction, hybridization with the repeated units of nucleotides and sequencing of the clones. The computational approach for developing SSR markers from ESTs provides a better platform than the conventional approach. The EST databases stores EST sequences which are redundant, so they contain repetitive units. The EST sequences can be obtained from the databases and undergo preprocessing and assembly to develop potential SSR markers. Numerous tools (both standalone and web-based) are available for the mining of EST data to design EST-SSR markers at a large scale. The free computational programs and large set of EST data available on the web helps the researchers to perform data mining very easily from their local system rapidly in a very low cost. The tools like cross_match and trimest provided non redundant high quality EST sequence which do not contain the vector contamination and poly-A, -T talils. CAP3 tried to assemble the EST sequneces having overlapping region and produces Contigs by joining the sequeneces. The sequences that do not have common portions are remained as Singletons. MISA helped in detection of the Simple Sequence Repeats from both contigs and singletons.
EST-SSR markers are potential candidates for gene tagging and comparative studies in various species as they are gene specific. Studies have been reported regarding the use of EST-SSR markers in a large number of commercially important plants. Best known for its commercial value, Camellia sinensis (Tea), a perennial ever green shrub is the perfect companion in the daily life. It is perhaps the second most consumed beverage in the whole world with some important health benefits. Tea promotes a healthier immune system by lowering blood pressure and cholesterol. It is mainly produced in a measure amount in India and Sri Lanka in many varieties. Huge amount of EST data in tea have been generated and stored in the databases. Present study deals with the detection of SSRs from the EST sequences of Camellia sinensis available in the public domains and functional annotation of the SSR containing ESTs. The annotation helps to know the putative functions of the ESTs and to find the important functional domain markers (FDM) related to the SSR-ESTs leading to gene ontology study. The gene ontology covers three domains: biological process: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms, cellular component: the parts of a cell or its extracellular environment and molecular function: the elemental activities of a gene product at the molecular level, such as binding or catalysis.

Data source
Sequence data were collected from the public domain GeneBank at NCBI (National Center for Biotechnology Information) [3]. A search for EST sequences of Camellia sinensis resulted in 12851 sequences. All the sequences were downloaded to the local machine for further analysis.
Processing EST sequence 6148 non-redundant EST sequences were searched for vector sequence and removal of the vector sequence was done using cross_match software [4]. The vector sequences were obtained from the UniVec database [5]. Then the trimest program from EMBOSS was used for removing the poly-A and poly-T ends of the EST sequences [6]. This is freely available on the web which can be downloaded and installed in the PC. It allows to trim the poly-A and poly-T ends from the given sequence according to the parameters given.

EST assembly:
After the EST sequences were fully processed up to trimming, they were subjected to CAP3 for assembly [7]. CAP3 is a DNA sequence assembly program, freely available for academic use. CAP3 results Contig files, Singlets files, Qual files, Info files and Out files.

SSR detection:
The 6148 unique EST sequences i.e. 4779 singlets and 1369 contigs were searched for microsatellite sequences using MISA (MIcroSAtellite identification tool) [8]. MISA is a freely downloadable perl script available on internet. Alongwith misa.pl another file misa.ini was also downloaded which contains the search parameters.

Primer designing:
Primer pairs were designed from the obtained SSR sequences using Primer3 tool [9]. The parameters were changed according to own interest. The primer size parameter was changed to min-17, opt-21 and max-27. The GC% was changed to min-45%, max-65%. Then the SSRs were searched for both forward and reverse primers.

Functional annotation:
The Functional Domain Markers were found from the SSR containing sequence using InterProScan at EBI [10].
InterProScan provides the platform to analyze the functional domains with the help of the member databases such as BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, HAMAP, PatternScan, SuperFamily, SignalPHMM, TMHMM, HMMPanther and Gene3D. The SSR-ESTs sequences were searched for the significant matches using a special type of BLAST program BLASTx at National Center for Bioechnology Information against non-redundant protein database entries [11]. BLASTx searches protein database using a translated nucleotide query. The BLASTx was performed keeping the value of identity parameter > 70%. The SSR-FDM contig sequences found from Interproscan were annotated for Biological process, Cellular component and Molecular function using QuickGO (http://www.ebi.ac.uk/QuickGO) at the EBI server [12]. QuickGO is a fast web-based browser for Gene Ontology terms and annotations, which is provided by the UniProtKB-GOA group.

Discussion:
Powerful computational tools were used to mine the publicly available Camellia sinensis EST data. In the present study, 2371 SSRs were found from the total screened EST data Table 1 (see supplementary material). The EST sequences containing SSRs are generally referred to as SSR-ESTs, whereas the markers developed from SSR-ESTs are called EST-SSRs. Total number of 6148 sequences of size 3822.68 kb was examined for microsatellites, out of which 1636 (26.61%) numbers of sequences were found to contain SSRs with a density of 1 SSR/1.61 kb. (Figure 1) shows the frequency distribution of different repeat types identified and (Figure 2) shows the frequency distribution of mono, di, tri, tetra, penta and hexa-nucleotide repeat motifs in EST sequences of Camellia sinensis The most frequent repeat type found within the EST sequences were mono-nucleotide repeats (65.9%) followed by dinucleotide repeats (24.6%), tri-nucleotide repeats (7.8%), tetranucleotide repeats (0.8%) and penta-and hexa-nucleotide repeats (0.8%). The frequency of identified SSR motifs are provided in the Table 2 (see supplementary material). The detected SSR motifs put insight into the frequency distribution of different types of nucleotide repeats in Camellia sinensis. The analysis showed that the frequency is not so high but better than many other species. Ignoring the mono-nucleotide repeats, the di-nucleotide repeats were the most abudant repeat types. Among the di-nucleotide repeat types, CT and AG have the highest frequency. This adds emphasis to the fact that in plant EST-SSRs, AG/CT repeats have been found to be most frequent [13,14]. According to the previous reports, in most of eukaryotes, the GC repeats are abudant but in this study, however, the GC repeats were completely absent [15,16]. Compared to di-meric repeats, tri-nucleotide repeats are less in Camellia sinensis. Studies revealed that in plant the most common tri-SSRs are AAG and CCG [17]. In Tea AAG repeat was not present and though CCG repeats were present but with less frequency. The most frequent tri-repeats in tea were GAA, TCT and CAC. In Tea the length of the SSRs detected varried from 10 to 392. Polymerphic markers are mostly found in lengthy repeats. So the ESTs containing SSRs more than length of 100 were choosen for the design of primer pairs. Out of 97 such SSR-ESTs, 245 primer pairs were designed. (Figure 3) describess the flowchart of the stepwise procedure involved with in sillico mining of SSRs from Camellia sinensis EST sequences. From 469 SSR containing EST contigs, the Mono nucleotide SSR containing sequences were not taken into account. Using InterProScan the rest 314 contigs were analyzed and total 935 differentfunctional domains were found Table 3 (see supplementary material). 250 sequences were found to be SSR-FDMs containing both SSRs and FDMs. These sequences were then assigned to gene ontology terms in the Swissprot database.
All the 469 SSR containing sequences were annotated against the non-redundant protein database to know the functions using BLASTx. The BLASTx result summarizes (Figure 4) 54 (11.51%) putative proteins, 74 (15.77%) predicted proteins, 50 (10.67%) hypothetical proteins, 146 (31.13%) of different functional classes and for 145 (30.29%) sequences no functions were found as there was no specific similarity. The SSR-ESTs were also assigned to gene ontology terms. 59 out of 250 contigs containing FDMs were found to have vital role in various processes, and functions. The (Figure 5) describes the details about the biological Processes, the Cellular Components and the Molecular Function of gene ontology. The gene ontology provided information regarding ESTs related to many genes coding for secondary metabolite production, translation, lipid transportation, stress response, GTP binding proteins, iron and sulfur cluster binding proteins and many other functions.  Currently a number of studies are being reported regarding the development of EST-SSRs in vast number of plant species using the computational tools. Microsatellite markers are very important for studing genetic variability and understanding the genetic information. In the present study, 2371 microsatellites were detected from the available EST data for Camellia sinensis. The redundancy in the EST sequences was reduced by preprocessing and assembly of the EST sequences which helped in detecting unique SSRs. The primer pairs designed can be checked for the transferability in the related species. BLASTx results helped to know the putative functions of the ESTs and the GO annotation revealed lots of functional domains. This will lead to the development of the specific markers to find different secondary metabolites, stress responsible gene and many other gene information that may present in the plants.  TGTA  TGTT  TTAT  TTCT  TTGT  TTTC  TTTG   2  1  1  1  1  1  2  Penta-Nucleotide  11  AACAA  AGGCG  CTCCT  GATTG  TCGAG  TCTCT  TGATC  TTGTT  TTTTC   1  1  1  1  1  1  2  2  1  Hexa-Nucleotide  10  CCTAAG  CCTCCA  CTCAAT  CTCCAG  GAAGCA  GAATTG  TCCAAC  TGGAGG  TTTATT  TTTTTG