High quality SNPs/Indels mining and characterization in ginger from ESTs data base

Ginger (Zingiber officinale Rosc.) is an important herb of the family Zingiberaceae. It is accepted as a universal cure for a multitude of diseases in Indian systems of medicine and its rhizomes are equally popular as a spice ingredient throughout Asia. SNPs, the definitive genetic markers, representing the finest resolution of a DNA sequence, are abundantly found in populations having a lower rate of mutation and are used for genomic analysis. The public ESTs sequences mostly lack quality files, making high quality SNPs detection more difficult since it is exclusively based on sequence comparisons. In the present study, current dbESTs of NCBI was mined and 38115 ginger ESTs sequences were obtained and assembled into contigs using CAP3 program. In this analysis, recent software tool QualitySNP was used to detect 11523 potential SNPs sites, 8810 high quality SNPs and 1008 indels polymorphisms with a frequency of 1.61 SNPs / 10 kbp. Of ESTs libraries generated from three ginger tissues together, rhizomes had a frequency of 0.32 SNPs and 0.03 indels per 10 kbp whereas the leaves had a frequency of 2.51 SNPs and 0.23 indels per 10 kbp and root is showing relative frequency of 0.76/10 kbp SNPs and 0.02/10 kbp indels. The present analysis provides additional information about the tissue wise presence of haplotypes (222), distribution of high quality exonic (2355) and intronic (6455) SNPs and information about singletons (7538) in addition to contigs transitions and transversions ratio (0.57). Among all tissue detected SNPs, transversions number is higher in comparison to the number of transitions. Quality SNPs detected in this work can be used as markers for further ginger genetic experiments.


Background:
Ginger (Zingiber officinale Rosc.) is a valued crop of family Zingiberaceae, rhizomes of which are used in medicine as well as in spice. India is a leading ginger producer, accounting for about 30% of the global share and during 2012-13 it produced 7.45 lakh tonnes from an area of 1.57 lakh hectares out of total global production of 20.95 Lakh tons [1]. Ginger is cultivated as a spice crop in many states of India. Out of the country's total ginger production, Kerala, Meghalaya, Orissa, Gujarat, Assam, and Arunachal Pradesh together contribute 65 per cent. There is an international market for Indian ginger at current selling price of around $2800-$2850 per ton [2]. Reports on evaluation of existing rich diversity in indian ginger germplasm are based on phenotypic and phytochemical characteristics which exhibit plasticity and sensitivity to environmental conditions [3].
Molecular characterization using PCR based markers are established as robust, reproducible and reliable when compared to morphological and phytochemical markers. Use of these markers reduces the cost, time and labour during analysis [4,5]. Markers ISSR, SSR, IRAP etc. has been reported for assessing genetic variation in Zingiber or other members of the Zingiberaceae family [6-8] that exclude reports on the discovery and use of quality SNPs to understand the genetic analysis of ginger. Single nucleotide polymorphism is basically single-base allelic variation between two haplotype sequences or between any of the homologous pair of chromosome. SNPs are abundantly found among variations prevalent in genomic DNA both in coding and non-coding regions [9]. These are responsible for various genetic traits, conserved through evolution and compatible for next generation high-throughput genotyping. However, sequencing of selected DNA fragments for SNPs identification has been subjected to limitations like; higher rate of sequencing error and intensive cost incurred during sequencing the fragments amplified. The alternate cost effective option for SNPs discovery from public database is by using the most extensive resource of ESTs data (dbESTs) hosted by The National Center for Biotechnology Information (NCBI) [10]. ESTs database has been commonly used for discovery of new genes, exon-intron structure verification, cDNA array construction and gene mapping [11]. These polymorphisms obtained from ESTs represent numerous functional genes which controls many genetic traits [12][13][14][15]. Many bioinformatics tools, programs as well as pipelines are developed for mining of SNPs by using several input and/or output formats, computational algorithms, filtration and evaluation strategies for getting quality SNPs.  [28]. The haplotype-based strategy can make full use of redundancy in sequences by re-clustering them, so that the influence of sequencing errors is avoided and poor quality sequences are removed. QualitySNP pipeline identifies paralogs and quality SNPs on heterozygous diploid as well as polyploid species. Therefore, in the present attempt, an effort is taken to utilize the existing updated ESTs tissue libraries of Zingiber officinale to find the SNP/Indels polymorphisms using 38115 ESTs and categorized into three tissue libraries leaves (13282), rhizomes (12763) and roots (12092) ESTs [29,30]. High quality SNPs detecting tool QualitySNP is used to identify the high quality exonic and intronic polymorphisms, haplotypes and DNA substitutions like transitions, transversions and Indels.

Sequence Pre-Processing and Clustering:
To get high quality ESTs, Poly-A/T tails and unexpected vector sequences were filtered and trimmed using Trimest [32] online program of EMBOSS suite and SeqClean software [33] with reference of the UniVec database of NCBI. ESTs having length ≥ 50 bp are traced out if any found to increase assembly quality. The ESTs of high quality were then assembled into contigs using CAP3 [34] program at 90% identity. Tissue-wise ESTs assemblies were conducted to reduce redundancy.

High Quality SNPs Discovery:
The Linux based command line program QualitySNP pipeline used for extraction of SNPs [35]. QualitySNP detected the haplotypes present in the contigs through ESTs re-clustering and discrepancies in nucleotide were extracted between identified haplotypes of a contig. These are considered as potential SNPs (pSNP). Basing on confidence scores of SNPs and allele, quality SNPs (qSNP) were identified [15]. The nucleotide discrepancies percentage is obtained from qSNP/pSNP ratios using the QualitySNP algorithm.

Prioritizing High Quality SNPs:
Contigs from all assemblies of ginger were processed through ORF finder program [36] for locating possible positive or negative open reading frame. From whole ESTs assembly only 101 contigs contains ORF with AUG as a start codon. Out of 8810 qSNP only 2355 qSNP were located in exonic/orf region while 6455 qSNP in non-coding regions. Distribution of qSNP into exonic and intronic region is shown in Figure 1(a) & 1(b).

Discussion:
We obtained 6323 contigs and 17421 singletons from 38115 ESTs from all tissues together accounting to 5455657 consensus sizes of 23708402 base pairs. In this study, a total of 11523 potential SNPs and 8810 high quality SNPs sites and 1008 indels polymorphisms are discovered from 38115 numbers of analysed ESTs with an average frequency of 1.61 SNPs / 10 kbp and 0.18 indels/10 kbp. The size of the contigs varied from 100 to 2954 bp, with an average length of 862 bp. An overview of the tissue wise assembly pertaining to contigs, singletons, indels, and SNPs detection and other parameters are depicted in Table 1 Table 1. As compared to the SNPs analysis in ginger, total 31815 potential SNPs, 16772 high quality SNPs and 1815 indels were mined out of total ESTs 83565 in potato [15], 37344 SNPs in Arabidopsis [37], in the maize prevalence of SNPs in non-coding and coding were found to be 1 per 31 bp and 1 per 124 bp respectively [38]. The average SNPs occurrence in Apple ESTs was found to be one in every 706 bp [39]. SNPs frequency is higher, ie. 1.61 SNPs per 10 kbp in certain of the genomes like ginger. Similar discoveries on SNPs were also reported in Arabidopsis ecotypes viz; one SNP every 3.3 kb in Landsberg erecta and one SNP every 6.1 kb in Columbia [40]. One SNP per 20 bp is reported in bread wheat between genes from the A, B and C genomes [41] and in maize the SNPs frequency is found to be one Indel /160 bp and one SNP/70 bp [42]. QualitySNP detected high frequency of transitions in ginger in the present analysis, which corroborates previous SNPs discovery reports [42]. High frequency of C to T mutation in this ESTs derived SNPs in ginger may be due to methylation [42]. The reported abundance of A/T ratio as well as its reverse complement remains unexplained. Out of total SNPs substitutions detected in this attempt, transversions of all tissues is/are comparatively higher than the transitions which corroborates previous report on the ginger SNPs detection and assigns the reason of ginger is being vegetative propagated crop through rhizome [43].

Figure 1: Distribution of High Quality SNPs (a) Exonic (b) Intronic (c) Nucleotide Substitutions
Our study for the first time provides information about high quality SNPs (8810) and 1008 indels polymorphisms in addition to information about potential SNPs (11523) because of the use of triple filtration based on stringent analysis of ESTs database using QualitySNP when compared with the pearl script AutoSNP version 1.0 based analysis of ginger ESTs with higher number of potential 64026 SNPs sites and 7034 indels polymorphisms [43]. The present analysis provides additional information about tissues wise presence of haplotypes (222) distribution of high quality exonic (2355) and intronic (6455) SNPs and information about singletons (17421) in addition to contigs (Figure 1(a), 1(b), 1(c) and Table 1-4). Different software"s used for SNPs mining during in silico study using the same ESTs database in Sea bass has been reported which exhibits level of stringency implied, and the difference in output in quality SNPs during analysis [44] justifies the use of most recent and efficient QualitySNP software integrated with better filtration strategy. We are able to discover the reduced number of quality SNPs in addition to potential SNPs, which may help geneticists and breeders working in/on cultivar identification, germplasm conservation in ginger to a greater extent.

Conclusion:
We have detected 11523 potential SNPs sites with 8810 high quality, and 1008 indels polymorphisms as well as 1.61 SNPs / 1000 bp frequency using recent software QualitySNP. Of ESTs libraries collected from 3 tissues, rhizomes had a frequency of 0.32 SNPs and 0.03 indels per 1000 bp, but the leaves had 2.51 SNPs and 0.23 indels per 1000 bp and root is showing relative frequency 0.76/1000 bp SNPs and 0.02/1000bp indels. The present analysis provides additional information about tissues wise presence of haplotypes (222) distribution of high quality exonic (2355) and intronic (6455) SNPs and information about singletons (7538) in addition to contigs transitions and transversions ratio was 0.57. Among all tissue detected SNPs, transversions number is higher in comparison to the transition. The quality SNPs detected can be used as potential markers for ginger genetic studies.