Microsatellites in palm (Arecaceae) sequences

Microsatellites are the most promising co-dominant markers, widely distributed throughout the genome. Identification of these repeating genomic subsets is a tedious and iterative process making computational approaches highly useful for solving this biological problem. Here 38,083 microsatellites were localized in palm sequences. A total of 2, 97,023 sequences retrieved from public domains were used for this study. The sequences were unstained using the tool Seqclean and consequently clustered using CAP3. SSRs are located in the sequences using the microsatellite search tool, MISA. Repeats were detected in 33,309 sequences and more than one SSR had appeared in 3,943 sequences. In the present study, dinucleotide repeats (49%) were found to be more abundant followed by mononucleotide (30%) and trinucleotide (19%). Also among the dinucleotides, AG/GA/TC/CT motifs (55.8%) are predominantly repeating within the palm sequences. Thus in future this study will lead to the development of specific algorithm for mining SSRs exclusively for palms.


Background:
Genome, the genetic blue print of the hereditary information of an organism comprises many functional regions, non-coding sections and the vast unexplored areas of the nucleic acids. Determining the nucleotide sequences of the genome has many technical hurdles which could be accomplished by the hand of computational approaches only. Unlocking the knowledge encoded in the genomes can increase our understanding of species, which could lead to substantially increase the productivity of crops. With such enhanced productivity, it could be the sustainable solution for fulfilling the world's demand for the food. Molecular markers are used to provide a link between genotype and phenotype, for the generation of molecular genetic maps and to assess the genetic diversity within and between related species. Molecular markers will provide the relationship between an inherited disease and its genetic cause, and can be used to determine the precise inheritance pattern of the gene that has not yet been exactly localized. Simple Sequence Repeats (SSRs) or microsatellites are becoming the most important molecular markers in both animals and plants, incorporates stretches of tandemly repeated short oligonucleotides [1] whose functional and/or structural characteristics distinguish them amidst the general DNA sequence.
Since SSR markers are highly informative, they are widely used for genetics and breeding studies in several plant species. Thus characterization of microsatellites is extremely important for the development of molecular markers. SSRs are highly abundant and exhibit broad levels of polymorphism in eukaryotic [2, 3] and prokaryotic [4,5] genomes. Their variability in length is caused by slip-strand mutations and that may affect the local structure of DNA molecule or encoded protein [6]. SSRs often serves to modify the genes with which they are associated and their influence on gene regulation, transcription and protein function typically depends on the number of repeats [7]. Thus SSRs provide a prolific source of quantitative and qualitative variation. Transferability between species and sometimes between genera has often been reported [8]. The ubiquitous properties of microsatellites made them important targets for genetic diversity studies, genetic and comparative mapping [9,10]. Moreover, the polymorphism revealed by EST-SSR was similar to that exposed by genomic SSR [11]. Also the SSRs are cross transferable across species and the flanking regions of SSRs are found conserved [12]. A high level of polymorphic loci was obtained regardless of the species considered.
Given the interest of the plant genetics community in SSRs as genetic markers, there has been a particular concern in the establishment of methods for rapid identification of robust and informative SSRs linked to genes of agronomic significance. Compared to genome-wide isolation approaches, gene-targeted strategies are more likely to yield SSRs that are relevant to the goals of marker-assisted selection and germplasm assessment [13].
The distribution of microsatellites in the palm sequences has practical implications with regard to their use as molecular markers. Cross transferability or cross species amplification of coconut microsatellite markers in rattan (climbing palms) [14] and other economic palms (date palm, oil palm, palmyrah and arecanut) [15] is already reported. Thus the incorporation of SSR markers in the palm research are extremely valuable and are increasingly becoming popular in comparative genomics where SSR markers developed from one species could be utilized in a related species towards genetic mapping, characterization, gene cloning, diversity and evolutionary studies [16]. This approach gained momentum in plant genomics during the recent years based on the observation that despite a wide range in genome sizes, plants were found to exhibit extensive conservation of both gene content and gene order [17]. The availability of a very large set of microsatellite markers distributed throughout the genome in oil palm and date palm would facilitate the development of high resolution maps, which are instrumental for applications like positional gene cloning and detailed comparative mapping to other palms like coconut, arecanut, palmyrah palm etc.
Sequences from many genes and genomes are continuously made freely available in the public databases and mining of these sources using computational approaches permits rapid and economical marker development. But the number of robust, informative and user-friendly markers (e.g., SSRs) publicly available for palms is still insufficient for some applications, particularly considering the low intra-specific polymorphism level observed even with microsatellite markers (10-20%). The main objective of the present work is to locate and analyze the SSRs in the sequences of palms. The predicted size of the date palm genome sequence is ~650Mbp. The sequences of major palms like coconut (Cocos nucifera), arecanut (Areca catechu), oil palm (Elaeis) and date palm (Phoenix dactylifera) were processed individually while rest of the arecaceae members have been underwent treatment in mass.

Sequence Preprocessing
Sequences were then pre-processed to minimize the sequencing errors and to avoid redundant sequences, were grouped into contigs. The sequence flaws were obliterated using the standalone tool SeqClean (Gene Index project) [20], for the trimming of poly A (T) tails, low quality/complexity regions and distal oligoN series. Before execution Seqclean program was customized by configuring the vector contamination database UniVec [21] with local Blast [22] for the removal of vector, linker and adaptor sequences. The filtered sequences were clustered and then assembled by adapting CAP3 [23] tool. The tool uses forward-reverse constraints to correct assembly errors and link contigs and uses base quality values in alignment of sequence reads. Automatic clipping of 5' and 3' poor regions of reads and the generation of assembly results in ace file format are remarkable. The contigs and singletons acquired were blended to engender the substrate file for SSR detection.

Microsatellite mining
Subsequently, the identification and localization of perfect as well as compound microsatellites were done using MISA (http://pgrc.ipk-gatersleben.de/misa) tool [24], a repeat searching perl script. Among different programs available in the public domain, the MIcroSAtellite (MISA) search module has improved features that are useful for EST quality control and for designing the primer pairs for EST-SSRs in a batch file [25].The categorized results of the microsatellite searches are stored in two files, one to hold the localization and type of identified microsatellite(s) in a table wise manner and a statistics file which summarizes different statistics as the frequency of a specific microsatellite type according to the unit size or individual motifs. The analysis of occurrence and frequency of SSRs among the palm species was carried out by exporting the MISA results to Microsoft Excel spreadsheets [26]. But to advance for further analysis, it has been found onerous for getting sequences of only SSR containing regions from the bulk palm sequence source file. To resolve this obstacle, the MISA perl script has been configured to get an additional SEQUENCE file holding only SSR containing sequences. This has considerably reduced the manual effort and toil.

Results and Discussion:
Microsatellite density profile it was found that palm sequences are rich source of microsatellites repeats. A total of 46,916 sequences were retrieved from the web base National Center for Biotechnology Information, NCBI (http://www.ncbi.nlm.nih.gov/). Since the sequences in public domain are cumulated exponentially, available sequences as on 14 th January, 2011 were undergone for the present analysis. Sequences of coconut (390), arecanut (22), oil palm (41374) and date palm (250107) were collected besides 5130 sequences of remaining 546 palm species. These values show the complexity of different palm sequences, which should be handled carefully. A minor negligence while processing itself will lead to false positives. The reliable softwares and programs were met the needs at this stage and thus the computational approaches greatly help in getting and manipulating these informations. After pre-processing of these sequences respective size of them were significantly reduced and they were subjected to microsatellite mining. From the 390 Cocos nucifera sequences of 205Kb size, 426 SSRs per Kb were collected. Similarly 5, 3024, 34051 and 577 SSRs per Kb were detected in Areca catechu, Elaeis, Phoenix dactylifera and other palms respectively. These repeats were occurred at a frequency of 2.04 (coconut), 0.7 (arecanut), 0.22 (oil palm), 3.5 (date palm genome) and 0.62 (other palms) per Kb of sequences (Table 1, see supplementary  material). Thus altogether, 38,083 SSRs were detected in the entire palm sequences.
Date palm and coconut were observed to be the richest source of microsatellite repeats. Thus SSR markers developed from date palm and coconut were cross transferable to other species of Arecaceae which will be helpful for the generation of genic regions in the less available species of other palms. With an earlier report, of microsatellite repeats in oil palm EST sequences (2413 ESTs), the predicted frequency of 1SSR/4.4Kb (0.227SSR per Kb) is found to be in agreement with the present report (41374) in oil palm sequences of 0.22 SSR per Kb. So even if more genic regions of oil palm are revealed on a daily basis, the occurrence of its repeat frequency is found to be constant. This observation discloses the conservative and ubiquitous character of SSRs over palms. In cereals, including barley, maize, oat, rice, rye and wheat, lower frequencies of SSRs (7-10% of total ESTs) were found from their available genome database. The SSR frequency reported in the present study might change when the more sequences of other palms are made available in the public database.

Figure 1: Distribution of SSR repeats in palms
The SSRs were grouped as monomer, dimer, trimer, and tetramer and above based on the size of the repeating unit (Table 1, see supplementary material). Mononucleotide repeats were grouped to two as A/T and C/G. Similarly all dinucleotides motifs were grouped into the following four unique classes; (i) AT/TA, (ii) AG/GA/CT/TC, (iii) AC/CA/TG/GT, and (iv) GC /CG. Among all the repeats dinucleotide repeats (49%) were found to be more abundant in palms followed by mononucleotides (30%) and then tri nucleotides (19%) ( Table 2, see supplementary material). This observation was in exception with arecanut and date palm (Figure 1), as these two can be taken concession due to its less sequence availability (arecanut) and non-furnished draft contigs (date palm). Also within dinucleotides, AG/GA/TC/CT motif (55.8%) was observed as dominant repeats.

Conclusion:
By this study, the wide occurrence and conservative nature of microsatellite repeats was accomplished in palms. We have predicted and characterized huge data sets, having all the possible SSRs, their locations, types and classes in palms. The predominant occurrence of GA/CT dimeric and A/T mono type repeats in plants are found to be strengthened in this study. Also the scarcity of AC/GT repeats in plant genomes due to the high frequency of certain amino acids in plants is yet again revealed. The frequency of SSRs in oil palm is found to be preserving even with the addition of more and more updated coding regions. Thus the predicted SSR frequency of different palms can be extended with more upcoming sequence information. This can be targeted towards the genetic mapping, characterization, gene cloning, diversity and evolutionary studies of palms. But, the exploration of genes and genomes will never commit a pause. Magnificent and convoluted amount of data and sequences will be depositing in science each and every day. The development of a specific algorithm which combines the sequence processing and marker extraction will outperform in the near future, which will overwhelm the mining of large data sets. To achieve this target, the currently generated data will be highly useful to design and train algorithm for the extraction of microsatellites exclusively for palms.