Analysis of unigene derived microsatellite markers in family solanaceae.

The family Solanaceae is the source of several economically important plants. The aim of this study was to trace and characterize simple sequence repeat (SSR) markers from unigene sequences of Solanum lycopersicum, an important member of family Solanaceae. 18,228 unigene sequences of Solanum lycopersicum was taken in order to develop SSR markers and analyzed for the in-silico design of PCR primers. A total of 12,090 (66.32 %) unigenes containing 17,524 SSRs (microsatellites) were identified. The average frequency of microsatellites in unigenes was one in every 1.3 kb of sequence. The analysis revealed that trinucleotide motifs, coding for Glutamic acid (GAA) and AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking sequences of the SSRs generated 877 primers with forward and reverse strands. Functional categorization of SSRs containing unigenes was done through gene ontology terms like Biological process, Cellular component and Molecular function.


Background:
Tomato (Solanum lycopersicum) fruit is an important source of antioxidant (mainly pigment) compounds, as well as lycopene, β-carotene, ascorbic acid and polyphenols [1].Tomato ripening involves a number of physiological processes which include the visible breakdown of chlorophyll and build-up of carotenoids, with massive accumulation of antioxidant components such as lycopene and β-carotene [2] within the plastids.Microsatellites are abundant across prokaryotic and eukaryotic genomes [3].Here microsatellites were extracted from unigene.The unigene database was used to identify molecular markers used in the identification of genes of different plants.Expressed Sequence Tag (EST) databases were also mined for SSR markers.These SSR motifs would serve as locus-specific markers.But a major disadvantage of EST-derived microsatellites is the sequence redundancy that yields multiple set of markers at the same locus.Unigene derived microsatellite (UGMS) markers have the advantage of assaying variation in the expressed component of genome with unique identity and positions [4].Therefore the simple sequence repeat (SSR) or microsatellite marker is currently the preferred molecular marker due to its highly desirable properties [5] and can serve as efficient and cost effective alternative markers in such species [6].Several studies to identify microsatellites in rice, barley, wheat, maize, soybean, grapevine, sunflower and Brassica sp.Have been reported [7,8].In-silico microsatellite marker studies have also been done in commercially important medicinal and aromatic plants [9,10].
Microsatellites or simple sequence repeats (SSRs) are stretches of DNA consisting of tandemly repeated short units of 1-6 base pairs in length [8,11].They are ubiquitous in prokaryotes as well as eukaryotes and can be found both in coding and non-coding region [12].The elevated frequency of length polymorphism associated with microsatellite provides the basis for the development of a marker system that has broad application in genetic research including studies of genetic variation, linkage mapping, gene tagging and evolution [7].The microsatellite markers have become a valuable tool for genetic studies, as they are able to efficiently screen large population sizes [13].The uniqueness and the value of microsatellites arise from their multiallelic nature, codominant transmission, ease of detection by PCR, relative abundance, extensive genome coverage [14].
In this study, SSRs were mined from unigenes.Various types of SSRs and their percentage distributions were determined.SSR markers developed from Solanum lycopersicum unigene sequences are used as marker tags to other plants.The primer sequences are the complementary sequences of the flanking ends of a stretch of simple sequence repeats (SSRs).The functional perspectives of the SSRs suggest that microsatellites are more than mere repetitive sequences and their role have been attributed to many biological functions [9].

Methodology:
Sequence data source: The assembled and functionally annotated sequences of ESTs i.e. unigene sequences of solanaceous plant Solanum lycopersicum were retrieved from unigene database of NCBI (ftp://ftp.ncbi.nih.gov/repository/UniGene/Solanum_lycopersicum/).There were 18,228 unigenes available in the database.These unigenes are nonredundant datasets used to identify microsatellites, primers and gene ontology characterization.

Microsatellite identification:
The unigene sequences were mined for microsatellites using a program MISA (MIcro Satellite) identification tool [15] written in the Perl scripting language.This tool analyses microsatellite repeats in FASTA formatted unigene sequences.The minimum motif repeat size were set to 10 for mononucleotide, 6 for dinucleotide, 5 for trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide in locating the microsatellites.The analysis of SSRs was done on the basis of their types (mono-hexanucleotide), number of repeats, percentage frequency of occurrence of each SSR motif and their distribution in the sequence.The results were cross checked through CUGI's SSR server    The study of occurrences of different types of SSR repeats revealed that the percentage of occurrence of mono-nucleotide SSR (89.36 %) was the highest followed by tri-nucleotide SSR (6.38 %) (Figure 1a).Among mononucleotide repeats, polyA/polyT repeats were predominant while polyC/polyG repeats were rare (Figure 1b).As reported earlier, A-T repeat motifs are the most abundant type of SSRs in plants [14].All dinucleotide repeat combinations excluding homomeric dinucleotides can be grouped into six unique classes, namely, (AG)n, (AT)n, (AC)n, (GT)n, (TC)n and (GC)n.It is evident that AT/TA dinucleotide repeats were more frequent, followed by TC/CT and AG/GA combinations.It is also important to note that GC/CG combination was the least frequent (Figure 1c).1d).Similarly, AAAT/AATT/TAAT/TTAA/TTTA/ATTT motif showed maximum frequency within tetranucleotide (51.72%) followed by AGTG/ATGT/TATG/ATAG/TTGA (20.69 %).This also depicts the abundance of adenine and thymine dominance over other counterpart nucleotides.TTCA, TTTG and GAAG (3.45%) showed the least occurrence (Figure 1e).

Codon Repetitions
The trinucleotide SSRs are triplet codon that code for a particular amino acid.It was observed that out of all triplet codons GAA (encoding Glutamic acid) repetitions are predominant followed by AAG (encoding Lysine) and TTC (encoding Phenylalanine) repeats.The triplet codons form an open reading frame (ORF) translated to proteins (Figure 2).

Amino acid distribution
The trinucleotide microsatellites code for 21 types of amino acids that includes stop codon.It was observed that out of all coded amino acids Serine (Ser) demonstrated the highest percentage of occurrence followed by Leucine (Leu).In Solanum lycopersicum serine occur in Serine/threonine-protein phosphatase 5 and is involved in biological process such as intracellular signaling cascade, lipid metabolic process, protein amino acid dephosphorylation.Serine protease show biological process such as negative regulation of catalytic activity, proteolysis.Methionine (Met) and Aspartic acid (Asp) showed the least occurrence (Figure 3).
The analysis of data revealed that the majority of amino acids were polar (56.67%) in nature (Figure 4a).The hydrophilic (50.91 %) amino acids occurred than more frequently hydrophobic (49.09 %) (Figure 4b).Similarly, frequency of aliphatic amino acids (76.92 %) were more than aromatic amino acids (23.08 %) (Figure 4c).The distribution study of chemical nature of amino acids gives an insight that neutral amino acids occurred more frequently with 75.47 % occurrence in comparison to basic and acidic amino acids 16.98 % and 7.55 % participation (Figure 4d).When we consider the mutation caused due to change in the last nucleotide of triplet codon, one amino acid changes into other.Percent frequency of occurrence of mutation was also tabulated.The analysis of the data revealed that mutation of Stop codon to Tyrosine and vice versa was the highest (Table 1

Study of functional protein of unigene having polymorphic condition
Microsatellites present in the transcribed regions of the genome have the potential to reveal functional diversity [4].SSR motifs showing variation in their tandem repeats, multiple occurrences within same unigene and replacement of SSR motifs give a special account of polymorphism.Due to the polymorphic study there was change in the function of corresponding proteins.The SSR motifs could be utilized as molecular markers according to gene or protein of interest.Out of whole unigene sequences, 3 unigenes have dinucleotide SSR (Table 2 see supplementary material) and 8 unigenes having trinucleotides SSR repeats showed change in the protein function.The change in coding amino acids were also listed with corresponding trinucleotide SSRs (Table 3

see supplementary material).
Some amino acids such as Threonine, Lysine and Glutamine are responsible for homeodomain protein.Cysteine is for zinc-finger like protein.The findings support the theory that SSRs are randomly distributed in genomes and generally showed direct or indirect role in protein regulation [20].
Change in gene ontology characterization has also been studied with reference to changes in functional proteins due to polymorphism of dinucleotide and trinucleotide SSRs (Table 4 see supplementary material).It was observed that the proteins involved in polymorphism belong to transport, ovule development, metabolic process, defense response, regulation of transcription and biosynthetic process of biological category.The cellular components for these proteins were mostly cytoplasm, plastids, plasma membrane and mitochondrion.These proteins were DNA, GTP protein and zinc ion binding type and some have transcription factor activity, trypsin inhibitor activity and defensin protein in their molecular function.

Change in physicochemical properties of amino acids in trinucleotide polymorphism
The function of protein changes due to the change in amino acids repeat encoded by SSR repeats.The physicochemical properties of amino acids also vary due to conversion of amino acids (Table 5 see supplementary material).It was observed that in almost all the cases the changeover was from hydrophilic amino acids to hydrophobic amino acids.Two unigenes were found in which hydrophilic amino acid transformation remains the same property i.e.Glutamine and Glutamic acid converts to lysine which is hydrophilic in nature.

Gene ontology classification
Unigenes with corresponding SSRs were assigned GO terms of biological process, cellular component and molecular function (Table 6 see supplementary material).It was observed that some unigenes have known function but some show similarity with proteins of other plants.Some show no function or are putative.It was observed that most of the SSR containing unigenes belong to transport category of biological process.This indicates the development of SSR markers for proteins and enzymes involved in transport of lipid, potassium ion, and phytohormones in the metabolic pathway.Group of proteins were reported which played enzymatic roles in folic acid derivative, gibberellins and malate pathways.Identified transcription factors such as Homeodomain/Homeobox protein, auxin response factor 8 and NAC domain associated with these SSR have been identified that have role in gene regulation process.There were other proteins identified as regulatory gene category such as Myb-related transcription factor protein, glycine rich protein and homeodomain protein.Superoxide dismutase [SOD], carbonic anhydrase and telomere binding protein, CTR1 like kinase protein, WRKY transcription factor 2 and Isochorismate synthase are the prominent proteins that figured out under different heads of defense and stress response.Different class of enzymes

Primer designing
The primer designing have been done for PCR amplification of the desired microsatellites using Primer 3.0 software.The primers flanking the microsatellites repeat-motifs could be designed for 877 (47.02%) of the 1,865 microsatellites in the analyzed plant species [18].It was observed that the forward and reverse primers obtained from trinucleotide SSRs were (628) maximum followed by dinucleotide (227).The primers from tetranucleotide and hexanucleotide SSRs were 14 and 8.

Conclusions:
This study is a step forward towards the utilization of in-silico approaches to analyze microsatellites (SSRs) from unigene sequences of plants.Unigene database provide a valuable resource for the development of SSR markers which are associated with transcribed genes.Our study revealed the analysis of microsatellites in UniGenes of Solanum lycopersicum.UGMS markers identified and characterized in this study provided insight about the abundance and distribution of SSR in the expressed organeller genome of Solanum lycopersicum.SSR markers are very informative because they show co-dominancy and highly polymorphic.SSRs markers are highly mutable loci, can be used for scanning of genome by genotyping and a particular region can be identified in the genome.
The development of SSR markers from unigene sequences saves both cost and time, once sufficient amount of data is available.It is also interesting to note the mutational changes among trinucleotides SSRs due to change in last triplet codon.This helps in deducing markers of own choice to study.Out of the unigenes investigated, some possessed more than one SSRs.Flipping of one SSR to another led to the change in function of protein.
Microsatellites have proved very useful as molecular markers in diverse areas of genetic research including genome characterization and mapping.
This study demonstrated the utility of computational approaches for mining SSRs from ever increasing repertoire of publicly available plant unigene sequences present in different databases.Computational approaches provide an attractive alternative way to conventional laboratory methods for rapid and economical development of SSR markers by utilizing freely available genomic sequences in public databases.

Fig. 1 :
Fig. 1: a) Percentage distribution of different SSRs.b) Percentage distribution of mononucleotide SSRs.c) Percentage distribution of dinucleotide SSRs.d) Percentage distribution of trinucleotide SSRs.e) Percentage distribution of tetranucleotide SSRs.f) Percentage distribution of pentanucleotide SSRs.g) Percentage distribution of hexanucleotide SSRs.

Figure 2 :
Figure 2: Frequency of distribution of triplet codons

Figure 3 :Figure 4 :a
Figure 3: Percentage distributions of amino acids

Figure. 4 :
a Percentage frequency of polar & non-polar amino acids.b.Percentage frequency of hydrophilic & hydrophobic amino acids.c.Percentage frequency of aromatic & aliphatic amino acids.d.Percentage frequency of neutral, basic & acidic amino acids.
[19]unigene derived microsatellite (UGMS) markers have the advantage of unique identity and positions in the transcribed regions of the genome.With the availability of large unigene databases, it is now possible to systematically search for microsatellites in the unigenes[19].The SSR repeat motifs were analyzed from these unigenes.Out of 18.228 unigene sequences (22.2 Mb), 12,090 showed the presence of simple sequence repeats (SSRs) suggesting that 66.32 % of unigenes contained SSRs.The average density of SSRs was 1(one) SSR per 1.3 kb of unigene sequence screened.

Table 4 :
Changes in gene ontology Change in gene ontology Sr. no.

Table 5 :
Change in characteristics of amino acids in trinucleotide polymorphism Sr. no.

Table 6 :
Gene ontology based functional annotation and classification of SSRs of Solanum lycopersicum Sr. No.