Analysis of SSR dynamics in chloroplast genomes of Brassicaceae family.

Simple sequence repeats (SSRs) are present abundantly in most eukaryotic genomes. They affect several cellular processes like chromatin organization, regulation of gene activity, DNA repair, DNA recombination, etc. Though considerable data exists on using nuclear SSRs to infer phylogenetic relationships, the potential of chloroplast microsatellites (cpSSR), in this regard, remains largely unexplored. In the present study we probe various nucleotide repeat motifs (NRMs) / types of SSRs present in chloroplast genomes (cpDNA) of 12 species belonging to Brassicaceae family. NRMs show a non-random distribution in coding and non-coding compartments of cpDNA. As expected, trinucleotide repeats are more common in coding regions while other repeat motifs are prominent in non-coding DNA. Total numbers of SSRs in coding region show little variation between species while considerable variation is exhibited by SSRs in non-coding regions. Finally, we have designed universal primers that yield polymorphic amplicons from all 12 species. Our analysis also suggests that amplicon length polymorphism shows no significant relationship with sequence based phylogeny of SSRs in cpDNA of Brassicaceae family.


Back ground:
SSRs or microsatellites are tandem repeats of mono-, di-, tri-, tetra-, etc. nucleotide motifs. They infest the genomes of most eukaryotes and often exhibit length polymorphism. The reversible length altering mutations are resultant of unequal crossing over and replication slippage. Relative conservation of the flanking regions, allow the variable length microsatellites to be used as locus specific, co-dominant, genetic markers across taxa. SSRs present in non-coding DNA were earlier thought to be non-functional entities, but now we know that they play distinct roles in genome organization, regulation of transcription, DNA recombination and repair, etc. [1] When present in protein coding DNA segments, their expansion or contraction can have huge impact on protein's function. Several human diseases have been linked to expansion of trinucleotide microsatellites within protein coding genes and have been dubbed trinucleotide repeat disorders. [2] Through "guilt by association", several microsatellite loci in plants have been linked to stress tolerance, disease resistance, domestication events and various agronomic traits. [3] Cp DNA has lesser percentage of non-coding component as compared to nuclear DNA, still SSRs are abundant in chloroplast genomes. In contrast to nuclear DNA markers that are inherited both from seed and pollen the cpDNA is inherited only through maternal route in angiosperms. It is considered a highly polymorphic marker that can be used to trace divergence through geographical isolation. [4] The present work finds the relative percentages of different SSR motifs and their distribution in coding and non-coding compartments of cpDNA in members of Brassicaceae family. Brassicaceae includes almost 338 genus comprising of about 3700 species exhibiting cosmopolitan distribution. [5] The family is also known as mustard family and members are mostly annual or perennial herbs. Arabidopsis thaliana, a commonly used 'model plant' also belongs to this family. Chloroplast genome sequences of 12 species of Brassicaceae are present in GenBank [6] and we have included all of them in our analysis of microsatellite loci. The study also tests the appropriateness of using data from cpDNA SSR length polymorphism as a genetic marker to plot phylogenetic relationships.

Mining SSRs from cpDNA and Primer design
A standalone perl script was used to find SSR motifs. For monomers, the minimum repeat size was 10 nt, for dimers, minimum repeat size was 5 nt, for trimer to decamer, minimum repeat size was 3 nt. Both perfect SSRs and compound SSRs were detected. The maximum interruption size between compound SSRs was kept 5 nucleotides. SSRs were searched in full chloroplast genome as well as separate coding and non-coding regions for each species. About 200-400 nt sequences flanking the SSR was used in online tool primer3 [8] for designing primers. Parameters used for primer3 were: optimum primer size -20nt, optimum annealing temperature -59°C, optimum GC content -50%.

SSR dynamics in coding, non-coding and complete cpDNA
The data generated from SSR mining was analyzed using Microsoft Excel ® . Percentages of different types of SSR motifs were calculated and their occurrence in coding vs. non-coding DNA was determined. The number of SSRs in coding and non-coding DNA across species was represented graphically to appreciate the dynamics.

Electrophoresis prediction and phylogenetic tree construction :
The amplicons generated by FastPCR from cpDNA of all species, using the selected universal primers, were loaded into CLC DNA workbench [10].An in-silico gel electrophoretogram was plotted using the CLC DNA workbench. Length polymorphism data was used to generate a distance matrix, using the 'simint' module of NTSYSpc ver 2.2 [11], and further clustered using UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm for plotting a dendrogram. Amplicon sequence data was used to plot dendrogram on structural polymorphism basis. The amplicon sequences were aligned using ClustalW [12]. Neighbor-Joining clustering algorithm was used to plot a dendrogram. The robustness of the tree was tested using bootstrap analysis (1000 iterations).

Discussion:
Chloroplast genomes of 12 species of Brassicaceae family were analyzed for their total microsatellite content, distribution across coding and noncoding domains and relative percentages in different species. The data reveals interesting patterns of SSR distributions. Finally, using the designed universal primers, an attempt was made to use SSR length polymorphism for exploring phylogenetic relationships.

Types of SSR motifs in cpDNA of Brassicaceae :
Total numbers of different types of SSR motifs present in all cpDNA samples were estimated. Mononucleotide and trinucleotide motifs are most common and present in almost the same proportion (42% and 40% respectively). Motifs larger then penta nucleotide represent only 1% of total SSRs (Figure 1a). Twelve percent of the total NRMs are dinucleotide repeats, where AT/TA type repeats were predominant, in contrast to CG/GC type repeat. One of the reasons for this disparity may be that CG repeats are known to increase the stacking energy of DNA and can form Z-DNA in negatively super coiled regions, during transcription [13].
Further, between the two most prominent repeat motifs, mononucleotide motif was overrepresented in non-coding DNA in contrast to the trinucleotide motif that was more common in coding DNA. Other NRMs were also present in larger numbers in non-coding DNA (Figure 1b). Expressed sequence tags (ESTs) represent the coding portion of the genome. [14] Mining of publicly available EST sequences for SSRs have been performed by several groups and almost invariably, trinucleotide motif has been reported as most common [15,16], which corroborates with our results. This is expected as repeats other then trinucleotide motifs, would result in fram shift or null mutations, resulting in non-functional proteins, and hence would generally not be selected through evolution.

Distribution of SSRs in cpDNA of different species
Distribution of total numbers of SSRs present across the 12 species was tested. Total numbers of SSRs in complete cpDNA molecules show considerable variation across different species. Further analysis showed that this huge variation is mainly due to the differences in SSR content in non-coding regions. Coding regions of different cpDNA, in contrast, showed lesser variation in SSR numbers, across the species (Figure 1c & Table 1 in Supplementary data). As a corollary from this analysis, it also appears that despite 51% of cpDNA (of selected species) encodes proteins, total number of SSRs in coding regions range only from 47 to 58, in different species, while total number of SSRs in non-coding region (about 49% of chloroplast genome) range from 84 to 120. Thus, on an average, non-coding compartment of cpDNA in Brassicaceae contains twice the number of SSRs present in coding DNA. Simillar results have been reported for mitochondrial and chloroplast genomes of rice [17]. SSRs are known to undergo cyclical expansion and contraction. These mutations are caused by recombination or by replication slippage [1]. Such changes would be more easily tolerated in non-coding DNA and this could be one of the reasons why more numbers of SSRs are observed in non-coding regions.

Microsatellite polymorphism across Brassicaceae:
Length polymorphism of SSRs can be easily assessed using simple polymerase chain reaction (PCR) assays, with primers designed on conserved flanking regions [18]. Two sets of universal primers, which yield variable length amplicons across the 12 Brassicaceae species, have been designed (Figure 2a). An electrophoretogram was simulated to demonstrate the extent of length polymorphism (Figure 2b). On basis of amplicon size distribution, two major classes emerged. These have been marked in the electrophoretogram, and are also represented in the dendrogram plotted on length variation basis (Figure 2c). In order to test whether sequence based phylogeny (structural polymorphism) correlates with the dendrogram plotted using length variation of SSR, we aligned the amplicon sequences in clustalW (Figure 3). Subsequently, clustering was performed and sequence based phylogenetic tree was constructed ( Figure  2d). Our analysis suggests that SSR length polymorphism and structural polymorphism may not correlate for cpDNA of Brassicaceae. Similar results were reported for SSR data from mitochondrial genome of domestic animals [19] as well as for chloroplast genome in Cucumis species [20].

Conclusion:
This work reveals the distribution of different types of SSR in coding and non-coding compartments of 12 different species of Brassicaceae family.
Out results indicate that, in general, SSRs have more preponderance in non-coding segments of cpDNA in contrast to the coding regions. However, triplet repeats are more prevalent in coding regions, as expected.
Considerable variation in numbers of SSR is observed in non-coding regions of cpDNA across different species. Attempt to use amplicon length polymorphism to construct phylogenetic relationships did not yield results that were in complete congruity with amplicon sequence based phylogeny.