Analysis of expressed sequence tags (ESTs) from cocoa (Theobroma cacao L) upon infection with Phytophthora megakarya

Phytophthora megakarya, the causative agent of cacao black pod disease in West African countries causes an extensive loss of yield. In this study we have analyzed 4 libraries of ESTs derived from Phytophthora megakarya infected cocoa leaf and pod tissues. Totally 6379 redundant sequences were retrieved from ESTtik database and EST processing was performed using seqclean tool. Clustering and assembling using CAP3 generated 3333 non-redundant (907 contigs and 2426 singletons) sequences. The primary sequence analysis of 3333 non-redundant sequences showed that the GC percentage was 42.7 and the sequence length ranged from 101 – 2576 nucleotides. Further, functional analysis (Blast, Interproscan, Gene ontology and KEGG search) were executed and 1230 orthologous genes were annotated. Totally 272 enzymes corresponding to 114 metabolic pathways were identified. Functional annotation revealed that most of the sequences are related to molecular function, stress response and biological processes. The annotated enzymes are aldehyde dehydrogenase (E.C: 1.2.1.3), catalase (E.C: 1.11.1.6), acetyl-CoA C-acetyltransferase (E.C: 2.3.1.9), threonine ammonia-lyase (E.C: 4.3.1.19), acetolactate synthase (E.C: 2.2.1.6), O-methyltransferase (E.C: 2.1.1.68) which play an important role in amino acid biosynthesis and phenyl propanoid biosynthesis. All this information was stored in MySQL database management system to be used in future for reconstruction of biotic stress response pathway in cocoa.


Background:
Theobroma cacao (cocoa) is a diploid tree grown in tropical countries [1].Worldwide many people depend on cocoa for their income.Cocoa is grown in a range of conditions such as full sun, or more traditionally under shade.In India, cocoa has been grown as a mixed crop under arecanut, coconut and oil palm shades.Demand for cocoa has been increased tremendously not only as a raw material for chocolate industry, but also for its flavor and other properties which imparts several health benefits [2, 3].Diseases are major problem for decline in cocoa production and causing annual crop loss of 20-30 % [4].The major diseases of cocoa include black pod (Phytophthora spp.), witches' broom (Crini pellis perniciosa), and frosty pod rot (Moniliophthora roreri) causing heavy loss in production worldwide.Phytophthora megakarya, causative agent for black pod disease in West African countries is the most damaging pathogen in cocoa industry.Although Phytophthora megakarya only exists in Africa, the species Phytophthora palmivora and Phytophthora capsici are responsible for the disease in South America and India.Fungicides are used to control the disease with varying success and at significant cost to small hold farmers [5,6].

EST Analysis
EST analysis includes the following steps: 1) EST preprocessing, 2) EST assembly, and 3) functional annotation.The implemented steps are illustrated in Figure 1.

Primary sequence analysis
The primary sequence analysis of GC percentage, average length of contigs and length range of contigs were processed using custom developed perl script (DSA.pl).For the detection of number of clustered sequences present in different contigs the CAP3 assembly files (.ace) were analyzed using a perl script (cap3_analyzer.pl).

Database design
The information which was obtained from the processing and annotation of the EST sequences were deposited in a MySQL relational database.Three different tables were created using SQL for storing sequences, blast hit and functional annotation.

Results and Discussion: EST pre -processing and assembly
The 6379 EST sequences retrieved from ESTtik database were processed using SeqClean tool resulting in 6349 good quality EST sequences which were used for further analysis.By the contig assembly using CAP3 tool 3333 non redundant EST (907 contigs and 2426 singletons) sequences were obtained Table 2 (see supplementary material).

Primary sequence analysis
The primary sequence analysis showed that total GC content of non redundant EST collection as 42.7%, average length of the EST collections is 419 residues/ sequence and the sequence length ranged from 101 residues to 2576 residues Table 2 (see supplementary material).CAP3 clustered the total ESTs in to 907 contigs.Number of sequences in different contigs ranged from 2 to 141.Contig374 contained the maximum number of sequences i.e. 141 (Figure 2).

Functional annotation
Similarity search (blastx) was executed against the nonredundant database.Totally 1230 orthologous genes were annotated with a significant E-value of < e-10.The blast result showed that contig473 (2 sequence assembled) showed high similarity (95.75% with E-value 3.5 E-138 ) to heat shock protein and contig396 (3 sequence assembled) showed high similarity (93% with E-value 1.9 ), carbonic anhydrase (Contig576), glutamine synthetase (Contig308), thioredoxin (Contig80, Contig280, Contig667), cyclophilin (Contig758), f-box family protein (Contig660), glutathione peroxidase (Contig184), ascorbate peroxidase (Contig573, Contig487), lipid transfer protein (Contig480, Contig342, ), cellulose synthase (Contig356), expansin, and pathogen related protein (contig 868) could be identified in the present study.Other major proteins involved in cell growth, cellular communication, cellular transport, transport mechanisms, energy pathway, protein destination and protein synthesis process can also be found in our EST collection (see supplementary material).Gesteira et al [17] identified pathogenesis related proteins, receptor kinase, MAP kinase and trypsin inhibitors as proteins related to Moniliophthora perniciosa infection in cocoa through comparative analysis of EST.In a similar work, Verica et al [18] identified proteins like chitinase, heat-shock proteins and beta-cyanoalanine synthase in cocoa were upregulated when treated with inducer of defense response.The cDNAs developed for the differently expressed genes in cocoa in response to witche's broom disease were putatively categorized as belonging to signal transduction, response to biotic and abiotic stress, metabolism, RNA and DNA metabolism, protein metabolism and cellular maintenance classes [19].Gene Ontology classification (GO), HMMER search against Pfam database, Interproscan and Enzyme search were done using Blast2go tool.Gene ontology results revealed that most of the sequences were related to cellular function; stress response and biological process (see supplementary material).Enzyme search against KEGG, annotated 272 enzymes belonging to 114 metabolic pathways.The annotated enzymes were aldehyde dehydrogenase (E.C: 1.Three different tables were created using SQL commands in MySQL relational database management system.The results obtained in EST processing and primary sequence analysis were organized in the first table.The second table possessed the information obtained in similarity search and further functional annotation results were saved in the third table.These three tables were logically linked.Each row in the table was assigned a unique serial number.All the information was deposited in 3333 rows in each table that can be retrieved by either logical or key word search.

Conclusion:
Four libraries of EST sequences derived from Phytophthora megakarya infected cocoa tissues have been analysed.Functional annotation resulted in 1230 orthologous genes, which included 272 enzymes and others were defense related and cellular functional genes.The annotated information was organized in a MySQL database.This information will be useful for the reconstruction of biotic stress response pathways in cocoa.

Figure 1 :
Figure 1: EST analysis work flow Non redundant EST sequences were subjected to blastx [12] similarity search.Further, the homologous sequences were made stringent by selecting those having E-value below e-10.Gene Ontology [15] search, enzyme search, Interproscan and KEGG mapping were done using Blast2go (www.blast2go.org)tool [16].

Figure 2 :
Figure 2: Contigs vs number of sequences clustered in corresponding contigs.

8]. ESTs can
Genomic research provides new tools to study the genetic and molecular bases of different traits.Complete genome of the cocoa has been recently published [7].Expressed Sequence Tags (ESTs) are sequenced regions of cDNA copies of mRNA that are expressed under different conditions and represents part of the transcribed portion of the genome [

Methodology: Primary sequence source Four
libraries of EST sequences derived from Phytophthora megakarya infected cocoa leaf and pod tissues belonging to the genotypes PNG and UPA134 were used in this study.Totally 6379 redundant EST sequences were retrieved from ESTtik database [