Identification and annotation of abiotic stress responsive candidate genes in peanut ESTs

Peanut (Arachis hypogaea L.) ranks fifth among the world oil crops and is widely grown in India and neighbouring countries. Due to its large and unknown genome size, studies on genomics and genetic modification of peanut are still scanty as compared to other model crops like Arabidopsis, rice, cotton and soybean. Because of its favourable cultivation in semi-arid regions, study on abiotic stress responsive genes and its regulation in peanut is very much important. Therefore, we aim to identify and annotate the abiotic stress responsive candidate genes in peanut ESTs. Expression data of drought stress responsive corresponding genes and EST sequences were screened from dot blot experiments shown as heat maps and supplementary tables, respectively as reported by Govind et al. (2009). Some of the screened genes having no information about their ESTs in above mentioned supplementary tables were retrieved from NCBI. A phylogenetic analysis was performed to find a group of utmost similar ESTs for each selected gene. Individual EST of the said group were further searched in peanut ESTs (1,78,490 whole EST sequences) using stand alone BLAST. For the prediction as well as annotation of abiotic stress responsive selected genes, various tools (like Vec-Screen, Repeat Masker, EST-Trimmer, DNA Baser, WISE2 and I-TASSER) were used. Here we report the predicted result of Contigs, domain as well as 3D structure for HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein.


Background:
Peanut (Arachis hypogaea L.) is an important source of protein and vegetable oil. Most of the species among the genus Arachis are wild and are not domesticated. Cultivated peanut (A. hypogea L.) is an allotetraploid (2n = 4x = 40) species which is widely cultivated around the world in tropical, sub-tropical and warm temperate climates [1]. Thus, peanut is an internationally important crop for human consumption and oil production. Abiotic stresses such as drought, cold, salinity and sodacity are major factors affecting the crop yield. There are many genes or gene networks are involved in response to abiotic stresses. Some of the major challenges for the breeding programs are developing abiotic stress tolerant, improved oil quality and flavoured varieties of peanut. As peanut can be cultivated in arid and semi-arid regions and drought or dehydration is most frequent stress reduced the yield. Therefore, study on drought stress responsive genes and its regulation is very much important. Use of molecular markers and marker-assisted selection for crop improvement and evolution showed little variation which resulted in limited application. Therefore, the development of transgenic peanut has enormous potential for enhanced productivity, trait improvement and identification of agronomically useful genes.
Expressed sequence tags (ESTs) are considered as most appropriate tool for discovering a gene, mapping a gene and analyzing various traits necessary for crop improvement. They are generated from large scale sequencing of clones from cDNA libraries prepared from mRNA isolated at a specific stage of cell/tissue development. Many EST libraries have been constructed and are available in sequence databases. As the functions and regulation of most of the newly identified genes are not known, but can be determined by multiple sequence alignment with known genes/ ESTs available in the databases.
EST approach developed during the establishment of genomics era and in HGP (Human Genome Project) has proved to be the most high throughput technologies ever invented [2]. Use of EST maps in major crop genome projects such as Zea mays [3], Glycine max [4] and Triticum aestivum [5] have been successfully applied. Study on identification and functional validation of some drought induced genes differentially expressed during gradual water stress in peanut has been conducted [6].
In the present study, we aim to identify the drought stress responsive candidate genes in peanut ESTs available in databases. Experimental data of expression analysis of drought stress responsive genes and its corresponding EST sequences as heat maps and supplementary tables reported previously [6] were screened. Furthermore, a phylogenetic analysis was performed to find a cluster of utmost similar ESTs for each selected gene. Individual EST of the mentioned cluster were further searched in peanut ESTs (1, 78,490) using stand alone BLAST. To predict as well as annotate abiotic stress responsive selected genes various tools like Vecscreen (a system that is developed by NCBI for quickly identifying and removing segments of a nucleic acid sequence that may be of vector origin and act as contaminant, before sequence analysis), RepeatMasker (a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences), EST trimmer (a tool used for trimming EST sequences concerning ambiguous sequences, removing of distal oligoN series from either the 5' or 3' end and size cut off), DNA Baser (that allows to assemble a set of contiguous sequences, contigs) and WISE2 (that compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors) were used. Here we report six best putative candidate genes belonging to three different gene families for drought stress response in peanut. Further, Predicted result of contigs, domain as well as 3D structure for HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein were also discussed.

Screening of Peanut ESTs
Experimental data for expression analysis of drought stress responsive genes shown in the heat map and supplementary table [6] were downloaded. Depending upon the higher expression during drought five gene families was selected. Total twenty two ESTs were screened from the above selected five different gene families. With the efforts of Peanut Genome Initiative (PGI), out of 2, 52,832 ESTs reported from the genus Arachis at NCBI, 1, 78,490 ESTs were belongs to cultivated peanut A. hypogaea. Total ESTs sequences of A. hypogaea and above twenty two screened ESTs sequences were retrieved from the NCBI database.

Phylogenetic Analysis
To find relationship between ESTs of individual gene family with representative genes, phylogenetic analysis was performed using MEGA 5.1 Beta (a unified tool for showing automatic and manual sequence alignment, concluding phylogenetic trees, estimating rates of molecular evolution, mining web-based databases, inferring inherited sequences, and testing evolutionary hypotheses) software available at http://www.megasoftware.net. It is used for phylogenetic analysis in various research fields  The phylogenetic tree represented the nearest gene with respect to reference gene is gi|108954762; (b) Dendrogram depicting phylogenetic relationship between screened ESTs of desiccation protective protein and representative gene of Lotus japonicu (gi|388503839|gb|BT140191.1). The phylogenetic relationship is analysed by using MEGA 5.1. The phylogenetic tree represented the nearest gene with respect to reference gene is gi|108954989; (c) Dendrogram depicting phylogenetic relationship between screened ESTs of type 2 metallothionein and representative gene of A. hypogaea (gi|71040682|gb|DQ097730.1). The evolutionary relationship is analysed by using MEGA 5.1. The phylogenetic tree represented the nearest gene with respect to reference gene is gi|108955131; (d) Dendrogram depicting phylogenetic relationship between screened ESTs Hsp70-60 KDa chaperonin and representative gene of Oryza sativa (gi|297722722|ref|NM 0011867). The phylogenetic relationship is analysed by using MEGA 5.1. The phylogenetic tree represented the nearest gene with respect to reference gene is gi|108955006.

Standalone BLAST
The standalone BLAST wasperformed against whole peanut ESTs and using screened ESTs as query sequences. BLAST analysis was done on the basis of their highest bit score and lowest e-value.

Data Normalization
After BLAST analysis, the shortlisted ESTs were further checked for the presence of vector sequences, polyA tail, interspersed repeats and low complexity regions in the DNA followed by their removal by using different online available software's like VecScreen [available at www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html and was used for the identification of a nucleic acid sequence that may contain a vector origin. A hash search algorithm was used to remove contaminating sequences [8]. The algorithm initially uses a hash lookup, followed by a gapless hit extension. Finally, ESTs that passed the filters but possessed an unmasked sequence were discarded], EST trimmer (available at http://imed.med.ucm.es/EMBOSS/runs/tmp.0.xuYbpD/inde x.html) and can recognize one or more nucleotide sequences with any 3' poly-A tail (or, optionally, 5' poly-T tail) in the input sequences that are at least the specified minimum length [8] and Repeat Masker available at www.repeatmasker.org/and is a program that curtains DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the database is a detailed explanation of the repeats that are contemporary in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked. In various contig assemblies, repeat masker has been used [9].

EST Assembly
EST assembly was performed using DNABaser (www.dnabaser.com/) software. Data filtering and assembly of EST sequences were accomplished with the help of DNABaser (v3.5.3), which incorporates quality filtering, clustering, and assembly into one distinct pipeline. The filtering step includes masking of vector sequence and low-quality regions and annotation of low-complexity regions, repeats, and poly (A) regions. The final assembled ESTs were named as contigs.

Identification of putative candidate genes
The contigs of individual gene families were subjected to BLAST analysis with the whole genome of Arabidopsis available in TAIR10 (www.arabidopsis.org/) database. The putative candidate gene was identified by WISE2 (www.ebi.ac.uk/Tools/Wise2/) analysis on the basis of lowest e-value and highest WISE2 score (in bits). Wise2 help to perform the comparison of a protein sequence to a genomic DNA sequence which allows understanding of the errors in intron and frame shifting using Gene Wise algorithm [10]. It helped to select the putative candidate genes with known functions.
Three dimensional structure prediction of identified domains 3D structure of the identified domains of the protein sequences were predicted using I-TASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/). I-TASSER (The iterative threading assembly refinement) server determines 3D structures of protein based on the sequence-tostructure-to-function paradigm algorithm. It predicts secondary structure, tertiary structure and functional annotations on ligand-binding sites, enzyme commission numbers and gene ontology terms. The accuracy of prediction is based on the confidence score of the modelling [11, 12].

Discussion:
Peanut (A. hypogaea) is an important oilseed crop having major nutritional as well as economic value. It is the third major oilseed crop in the world next to soybean and cotton. Production of peanut at global level plays a significant role as far as nutrition, fuel and energy, sustainable agriculture and enhanced productivity is concerned. To enhance crop production, genetic improvement through genomic research is more beneficial [2]. Numerous applications and recent developments from the EST concept have been made after it was proposed by Adams et al [13]. EST sequencing was aimed to clone the full length sequences of genes with agronomic value. Large-scale EST sequencing offers an access into the genome of an organism. The ESTs gave significant evidence about its coding content and expression patterns in different tissues of plants in different environmental conditions. The EST resource of A. hypogaea will play major role for genome sequencing and gene expression analysis in different abiotic stresses. Therefore, to get necessary information about genes involved in abiotic stresses, ESTs were selected which currently encompass more entries in the public databases than any other form of sequence data. Thus, EST datasets provide a vast resource for gene identification and expression profiling. By doing EST assembly, researchers will be able to learn about tissue specificity and expression profiling of peanut genes [14].  (Figure 1a, 1b, 1c & 1d).
The shortlisted ESTs were further normalized by removing vector sequences, Poly-A tail and masking of low complexity region using different online available software's. We found that out of 3073 ESTs, 2931 EST sequences did not contain vector sequences were used for further analysis. As reported by Joshua et al [15], a number of quality features are evaluated for EST assembly such as: (a) the occurrence of chimeric contigs; (b) the constancy of mate-pairs in the same contig; (c) phylogenetic analysis using known genes and their relationships; and (d) the extent of redundancy among the assembly's contigs and singletons. In an ideal assembly, for the generation of a single contig, ESTs transcribed from a single gene are conjoined together. A phylogenic approach was also used to assess EST assembly quality [16]. After normalization, EST sequences were assembled for finding the contigs using DNABaser (v3.5.3) (Figure 2). These contigs were further BLAST (v2.2.8) with Arabidopsis Whole Genome (TAIR v10) Table 2 (see  supplementary material). On the basis of lowest e-value, out of five, three gene families (i.e. 17KDa Class II Heat Shock Protein, DnaJ protein and Type 2 Metallothionein) were selected. WISE2 (v 2.1.20) was performed to select putative candidate genes with known functions as well as intron and exon identifications [17,18]. Based on highest WISE2 score (in bits) six putative candidate genes were found and their respective families with their roles are mentioned ( Table 2).
The functional characterization of protein depends upon their 3D structure, therefore from the resulting assembled contig sequences, conserved protein domains were assessed from InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan) [19] which allows to scan protein sequence for matches against the InterPro collection of protein signature databases. The different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can thus provide insights into their function. We also predicted the possible domains in three protein sequences of HSP17.3KDa protein (Figure 3a), DnaJ protein (Figure 3b) and Type 2 Metallothionein protein ( Figure  3c) by using InterProScan. InterProScan of HSP17.3KDa protein has Alpha crystallin/HSP20 domain (IPR002068) as well as HSP20-like Chaperone and other unassigned HSP20 domains (Figure 3a). Likewise, DnaJ protein showed N-terminal HSP domain (IPRO01623), C-terminal HSP domain (IPRO02939) and others (Figure 3b). Again, metallothionein type 2 has entirely a single domain of metallothionein family-15 type 2 domains (IPRO00347) (Figure 3c).Earlier it was reported that Small heat shock proteins (sHSPs) varies from 15-42 kDa [20] that play a significant role in providing resistance to variety of abiotic stresses such as drought, salt, cold and oxidants besides high temperature [21,22]. Tzi et al [23] reported that DnaJ protein plays an important role in various stress response mechanism. Clement et al [24] reported that DnaJ protein interacts with Hsp70 that might also be involved with ABA signalling. Metallothioneins are associated in metal detoxification, homeostasis, and protection during oxidative damage [25]. It was reported that metallothionein transcripts were present in both cultivated peanut and wild species [26,1].
Simultaneously, we predicted the 3D structure of corresponding protein sequences of contigs HSP17.3KD protein (Figure 4a), DnaJ protein ( Figure 4b) and Type 2 Metallothionein protein (Figure 4c) by using I-TASSER. Biological role of above three proteins were also noted from I-TASSER result. The quality of the generated models are estimated based on a confidence score (C-score), ranges from -5 to 2 where a high value signifies a model with a high confidence and vice-versa. C-score is highly correlated with TM-score and RMSD thus, TM-score and RMSD are known standards to measure the accuracy of structure modelling thereby measuring structural similarity between two protein structures. RMSD is an average distance of all residue pairs in two structures and is sensitive to local errors (i.e., a misorientation of the tail) which occurs inspite of the correct global topology hence, TM-score must be used for solving these errors. A TM-score >0.5 indicates a model of correct topology. Therefore, the best predicted model is selected on the basis of confidence score; TM-Score as well as RMSD value Table 3 (see  supplementary material). I-TASSER server predicts and displays various features in different sections for best model studies. We are considering the prediction and generation of the best model based on C-Score, their structural analogs and binding sites. The C-Score value for the best predicted model HSP17.3KD, DnaJ protein and Type 2 Metallothionein protein (0.3), (-3.18) and (-2.98) respectively and furthermore, highly similar structures in PDB (as identified by TM-align) of HSP17.3KD, DnaJ and Type 2 metallothionein were identified and listed in Table 4 (see supplementary material). Template proteins with similar binding sites for HSP, DnaJ and Type 2 metallothionein were listed in Table 5 (see supplementary  material). The best binding site is predicted on the basis of Cscore LB (Range = 0-1) and BS-Score (>1) values. A higher score Cscore LB indicates a more reliable ligand-binding site prediction and BS-score reflects a significant local match between the predicted and template binding site (Table 5) [11,12].
We found that HSP 17.1 KDa has important role in some molecular function (GO: 0005212) that contributes in the structural integrity of the cell and less role in creating protein oligomers (GO: 0051260). Similarly DnaJ interact selectively and non-covalently with unfolded proteins (GO: 0051082) and is involved in assisting the covalent and non-covalent assembly of single chain polypeptides or multi-subunit complexes into the correct tertiary structure (GO: 0006457). On the other hand, type 2 Metallothionein interact selectively and non-covalently with an adenyl-ribonucleotide, any compound consisting of adenosine esterified with (ortho) phosphate or an oligophosphate at any hydroxyl group on the ribose moiety (GO:0032559) and is thus involved in the chemical reactions and pathways involving the phosphate group, the anion or salt of any phosphoric acid (GO:0006796). Govind et al [6] described that for cellular mechanisms, groups of gene products are involved that act in coordination in response to stimuli of water withhold.

Figure 4: (a)
Result showing 3D structure of HSP17.3KDa protein predicted by I-TASSER. The coloring method is based on secondary structure. The cartoon shaped structure representing α helix and the yellow colored arrow representing β-sheets. The black color part of ribbon representing coil and the purple color representing turn in this 3D structure; (b) Result showing 3D structure of DnaJ protein predicted by I-TASSER. The colouring method is based on secondary structure. The cartoon shaped structure representing α helix and the yellow colored arrow representing β-sheets. The black colour part of ribbon representing coil and the purple colour representing turn in this 3D structure; (c) Result showing 3D structure of Type 2 Metallothionein protein predicted by I-TASSER. This structure is not containing any α helix and βsheets. But it is completely made up of only coil (black colour part of ribbon) and turns (in purple colour).

Conclusion:
The present in silico study allowed identification of six best putative candidate genes belonging to three different gene families for drought stress response in peanut. Contigs, domains as well as 3D structure of HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein were also predicted. Thus, EST approach play key role in providing an excellent resource for novel gene discovery and its annotation for drought stresses. Table 3: Best predicted model with their C-Score, TM Score and RMSD value where C-Score is the confidence score for the predicted model, TMscore is a measure of global structural similarity between query and template protein and Root Mean Square Deviation is the RMSD between residues that are structurally aligned by TM-align.