Classification and comparative analysis of Curcuma longa L. expressed sequences tags (ESTs) encoding glycine-rich proteins (GRPs).

Glycine-rich proteins (GRPs) are a group of proteins characterized by their high content of glycine residues often occurring in repetitive blocs. The diverse expression pattern and sub cellular localization of various GRPs suggest their implication in different physiological processes. Several GRPs has been isolated and characterized from different monocots and dicots. However, little or no information is available about the structure and function of GRPs in asexually reproducing plants. In this study, in-silico analysis of expressed sequence tag database resulted in the isolation of fifty-one GRPs from Curcuma longa L., an asexually reproducible plant of great medicinal and economic significance. Phylogenetic analysis grouped the GRPs into four distinct classes based on conserved motifs and nature of glycine-rich repeats. Majority of the isolated GRPs exhibited high homology with known GRPs from other plants that are expressed in response to various stresses. The presence of high structural diversity and signal peptide in some GRPs suggest their diverse physiological role and tissue specific localization. The isolated sequences can be used as a framework for cloning, characterization and expressional analysis of GRPs in response to various biotic and abiotic stresses in Curcuma longa as well as other asexually reproducing plants.


Background:
The glycine rich proteins (GRPs) belong to a group of super family that is characterized by the presence of semi-repetitive glycine-rich motifs.These groups of proteins have a glycine content of 20 to 70% that are arranged in (Gly)n-X repetitions.Although the first genes encoding GRPs have been isolated from plants, they have been reported in a wide variety of organisms from cynobacterias to animals [1].GRPs are broadly classified into four major groups based on conserved motifs and the arrangement of glycine repeats.The class I GRPs contain a signal peptide followed by a high glycine-content region with (GGX)n repeats.These proteins are attributed with structural function due to their cell wall localization [2].Class II GRPs may or may not have a signal peptide.They carry a C terminal cysteine rich region following the glycine rich region and characterized by the presence of universal (GGXXXGG)n repeats.Class III GRPs may carry a signal peptide and have the lowest glycine content as compared to other classes.They are charactrized by the presence of GXGX repeats and show a high degree of structural diversity.The class IV includes the RNA binding GRPs which has the characteristic RNA recognition motif (RRM) or a cold shock domain in addition to the glycine rich domain.A few of the RNA binding GRPs are also characterized by the presence of CCHC zinc-fingers in their structure.
In the past few years, functional characterization of several plant GRPs has been investigated.It is believed that, they are developmentally regulated as well as modulated by biotic and abiotic factors.Although most of the GRPs are attributed with a structural function owing to their cell wall locations, recent development suggest that GRPs are indeed diverse in their location and function, the only similarity being the presence of glycine rich repeats Curcuma longa L. (turmeric) of the family Zingiberaceae is one of the most important crop with great medicinal and economic significance.Its medicinal uses are indeed diverse, ranging from cosmetic face cream to the prevention of Alzheimer's disease.Turmeric is also qualified as the queen of natural Cox-2 inhibitors [5].India is the world's largest producer, and exporter of turmeric followed by China, Indonesia, Bangladesh and Thailand [6].However, turmeric is completely sterile and propagated exclusively by vegetative means using rhizome.This seems to have eroded their genetic base making them suceptible to major biotic and abiotic stresses.Characterization and comparative analysis of GRPs in turmeric can provide a wide array of informations on the regulation of different stress responses in vegetatively propagated plants.Recent advances in Curcuma genomic technologies have generated a large number of expressed sequence tags (ESTs) that has been made available in public database, thereby offering an opportunity to classify and compare glycine rich protein sequences in turmeric.As of July 2010, GenBank had released 12593 EST sequences from Curcuma longa.In the present study, we describe the isolation, classification and characterization of glycine rich proteins in Curcuma longa EST database using known GRP sequences.

Methodology:
A basic local alignment search tool (BLAST) TBlastN search [7] was performed using protein sequences of reported plant GRPs as baits against the Curcuma longa expressed sequence tag (EST) database.12593 Curcuma longa EST sequences were mined consisting of two tissue libraries of rhizomes 6870 (DY395309-DY388440) and leaves 5723 (DY388439-DY382717).The EST sequences were screened against the UniVec database from NCBI (ftp://ftp.ncbi.nih.gov/pub/UniVec/) for detecting vector and adapter sequences by using the program Cross_Match.CAP3 program was used to assemble the EST sequence into contigs for creating a non-redundant dataset.The GRP sequences used as baits includes those reviewed by Sachetto-Martins All the turmeric GRPs isolated was subsequently translated to obtain their putative protein sequences.The Open Reading Frames (ORFs) for each searched contig was predicted using the Expasy Translate Tool (bo.expasy.org/tools/dna.html).Protein sequences obtained were used in a second round of TBLASTN search against the non-redundant protein database at the National Center for Biotechnology Information (NCBI) to identify their closest homologues.Additional domains were detected using the Prosite (http://bo.expasy.org/prosite)and Pfam (http://www.sanger.ac.uk/Software/Pfam/search.shtml) prediction programs.The signal peptides were predicted using signalP server (http://www.cbs.dtu.dk/services/SignalP).ClustalX program [18] was used to align GRPs deduced from the turmeric EST database.The phylogenetic tree was constructed using the Molecular Evolutionary Genetics Analysis (MEGA) software package version 2.1 [19].The neighbor joining distance method was used with pair wise deletion to treat the amino acid gaps during multiple alignment of turmeric GRPs.For construction of the phylogenetic tree, the confidence levels for the nodes were determined with 1000 replications using the internal branch test [20].

Results and Discussion:
Typical GRP protein sequences were used to search the Curcuma longa EST database for genes encoding glycine-rich proteins.Fifty-one potential turmeric GRP gene sequences were isolated and distributed into four distinct classes-class I (GGGX); class II (GGXXXGG); class III (GXGX and class IV, RNA binding GRPs) Table 1

(see supplementary material).
Similar in-silico approach has also been utilized earlier to identify GRPs in other important plants such as sugarcane [15] and eucalyptus [16].
The turmeric GRP sequences were almost equivalent to the other monocotyledonous GRP sequences already published.Fourteen sequences encoding GRPs with GGGX repeats were identified in the Curcuma longa EST database.The sequences were quite different and related to previously known GRPs from monocots and dicots.Four sequences showed high similarity with AtGRP6, the cold shock glycine rich protein from Arabidopsis.Likewise, four and three sequences showed high similarity with GRPs from Oryza sativa and Zea mays respectively.Eleven of the 14 class I GRPs showed the presence of a signal peptide at their N-terminal end suggesting their location in the cell wall or cell membrane.Ten sequences encoded GRPs that were highly enriched in histidine having a GGGH repeats.Similar results were also retrieved in Eucalyptus GRPs [16].Searching the turmeric EST database using the previously reported GRPs with cysteine rich domans and C terminal homology to nodulins resulted in the identification of five GRPs with GGXXXGG repeats.The tripeptide between the glycine residues were composed of Y, N and R amino acid.One among the five-class II turmeric GRPs-CL.CON.1727showed high homology with Triticum aestivun predicted protein grp having a distinct signal peptide.The class III GRPs with GXGX repeats consists of the lowest glycine content of only 20%.In turmeric, twenty-one different sequences were identified encoding this type of GRP.These GRPs were also rich in alanine and arginine amino acids besides having the glycine rich domains.The GRP sequences were highly diverse representing heterogeneous groups of  1).Four well-separated cluster groups were obtained in the phylogenetic tree each representing the member of the particular classes of GRPs.The GRPs with GGXXXGG repeats and the RNA binding GRPs were relatively closure as compared to other classes.Likewise, the turmeric GRPs with GGGX repeats formed a completely separated cluster as compared to other three classes.

Conclusion
Turmeric is a sterile monocot, exhibit high stigmatic incompatibility and undergoes vegetative means of reproduction.It has a smaller triploid genome with n=21 that exhibit secondary polyploidy.Thus, it can be effectively used as a model plant for characterizing and analyzing various genes that are expressed in response to different stresses in asexually reproducing monocots.In the present study, we identified fifty-one Curcuma longa glycine-rich protein sequences from the EST database of the plant.Although several genes encoding GRPs has been isolated from different species, only a few has been cloned and characterized and their functions determined.With the availability of large number of in-silico derived GRPs from different plants species, greater information on the role of GRPs in diverse processes such as stress responses, signal transduction and developmental regulation can be determined.The greater diversity in structure, modulation and localization among the grp genes expressed from turmeric suggest that they are directly or indirectly involved in several physiological processes.Thus, the in-silico derived GRPs isolated in the present study will act Supplementary material: [3].In plants, the genes encoding GRPs are open access induced by physical, chemical and biological factors such as temperature, wounding, pathogen infection, salinity, drought, flooding, light, salycylic acid etc [1].This diverse functionality suggests that, GRPs are components of different multimolecular complexes where glycine rich domains are required for maintaining stability and flexibility of molecular interactions [4].Several GRPs has been characterized in different plants such as Arabidopsis, rice, sugarcane and Eucalyptus.The differential modulation and sub cellular localization together with broad structural diversity suggest that GRPs do not represent the same family of proteins, but a group of protein that share a common structural motif [1].
The remaining four appears closely related to HvGRP1 of Hordeum vulgare.The class II GRPs has been found to interact with cell wall associated kinase molecule that initiate the recognition of various environmental signals in response to external stresses and to transduce them into the cell [21].
ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 8(3):142-146 (2012) 144 © 2012 Biomedical Informatics proteins with no significant sequence similarity except within the glycine rich motif.Five GRPs with GHGH repeat showed highest homology to a Zea mays aluminium induced GRP.Likewise, two turmeric GRPs-CL.CON.1712 and CL.CON.1872exhibited highest similarity with a cold-drought regulated grp from Medicago sativa.This suggests that classes III group of GRPs from turmeric must be getting expressed in response to abiotic stresses.Five out of the 21 class III GRPs possessed signal sequence in the N-terminal end reflecting their extra cellular localization.Many GRPs isolated from other plants had signal peptide and found to be located outside the cell [2, 22].No Oleosin GRPs were isolated from turmeric as like other monocotyledonous plants.As the Oleosin GRPs are meant for tapetal development in dicots, the absence of Oleosin GRPs in turmeric further supports the existing difference in the anthers and pollen grain development between monocots and dicots [23].The class IV GRPs consists of a RNA binding motif in the Nterminal end followed by a C-terminal rich in glycine repeats.It is broadly classified into four sub-classes-sub class I with RNA recognition motif (RRM) in the N-terminal end, sub-class II with RRM conserved motif and a CCHC zinc finger motif within the glycine rich region, sub-class III with a cold shock domain in the N terminus and CCHC zinc fingers within glycine rich motif and sub-class IV with two RRM motifs in the N terminus [1, 15, 24].Eleven Curcuma longa sequences encoding RNA binding GRPs were identified and classified according to their domain organizations.Five sequences from turmeric EST database, CL.CON.447,CL.CON.794,CL.CON.1078,CL.CON.1582 and CL.CON.3068showed highest homology with Oryza sativa Japonica group, putative RNA binding glycine rich protein.All the five GRPs encoded a RRM conserved motif in the N-terminus and a CCHC zinc finger motif within the glycine motif classifying them as the sub-class II RNA binding GRPs.Three sequences exhibited homology with an absisic acid inducible RNA binding GRP from Zea mays.They were classified as the subclass I type RNA binding GRPs due to the presence of a single RNA recognition motif in the N-terminus of the protein sequence.The rest of the three sequences were classified as subclass IV RNA binding GRPs.They showed homology with different sequences from Medicago sativa, Sorghum bicolor and Triticum aestivatum respectively each having atleast two RRM motif followed by C-terminal glycine rich region.No subclass III GRPs were identified in Curcuma longa EST data bank.Absence of subclass III GRPs has also been reported in other monocotyledonous plants [25].However, Fusaro et al [15] has reportedly isolated ten RNA binding GRPs of subclass III in the sugarcane EST data bank.The greater diversity among the GRPs of turmeric suggests their origin through DNA recombination.The high GC contenet in the grp genes make them as ctive site for recombination events resulting in high variability in different classes of turmeric GRPs.The existance of high recombination in glycine rich regions has been already proved in mammals [3].Alignment conducted with turmeric EST encoded GRPs using MEGA ver 2.1 resulted in a distinct unrooted tree (Figure

Figure 1 :
Figure 1: Unrooted dendrogram of glycine rich protein sequences encoded by Curcuma longa ESTs.The relationships were calculated using MEGA (p distance, neighbour joining method and bootstrap test with 1000 replications, pairwise deletions).The analysis was based on the ClustalX alignment of sequences

Table 1 :
Curcuma longa ESTs encoding different classes of glycine rice protein's including the data about the homologous sequence, accession numbers and e-value.