Synonymous codon usage in chloroplast genome of Coffea arabica

Synonymous codon usage of 53 protein coding genes in chloroplast genome of Coffea arabica was analyzed for the first time to find out the possible factors contributing codon bias. All preferred synonymous codons were found to use A/T ending codons as chloroplast genomes are rich in AT. No difference in preference for preferred codons was observed in any of the two strands, viz., leading and lagging strands. Complex correlations between total base compositions (A, T, G, C, GC) and silent base contents (A3, T3, G3, C3, GC3) revealed that compositional constraints played crucial role in shaping the codon usage pattern of C. arabica chloroplast genome. ENC Vs GC3 plot grouped majority of the analyzed genes on or just below the left side of the expected GC3 curve indicating the influence of base compositional constraints in regulating codon usage. But some of the genes lie distantly below the continuous curve confirmed the influence of some other factors on the codon usage across those genes. Influence of compositional constraints was further confirmed by correspondence analysis as axis 1 and 3 had significant correlations with silent base contents. Correlation of ENC with axis 1, 4 and CAI with 1, 2 prognosticated the minor influence of selection in nature but exact separation of highly and lowly expressed genes could not be seen. From the present study, we concluded that mutational pressure combined with weak selection influenced the pattern of synonymous codon usage across the genes in the chloroplast genomes of C. arabica.

ribosomes to participate in translation of mRNA [12]. The other biological factors that affect patterns of synonymous codon usage bias are genome size [17], length of the gene [18], codon context [19], rate of recombination [20] and amino acid composition [21].
In plant molecular evolution, chloroplast genomes generate interest among biologist owing to its smaller size, larger copy number and known functions of many genes at molecular level [22]. Translational process in the chloroplast genome has been reported to be similar to that of unicellular organisms, indicating that synonymous codon usage may be identical to that of Escherichia coli [23]. The significance of mutational pressure in shaping the SCU variations in chloroplast genome was already established [24]. However, natural selection is also a driving force that frames SCU variations in plants and algal lineages [25]. The neutral theory of molecular evolution demonstrated the inverse relation between the rate of molecular evolution and the amount selective forces. Synonymous substitution in protein coding genes is in a slower rate than pseudo genes [26,27], indicating the influence of selective forces on the rate of synonymous codon evolution. Coffee is a most attractive beverage crop in the world. More than 100 species of Coffea are diploids in nature (2n=2x=22) except Coffee arabica (2n=4x=44) which is autogamous (self -fertile) and considered as allotetraploid. C. arabica is a species for centre of attraction owing to its inherent quality. Unfortunately, this species is highly susceptible to major pests and diseases. The complete nucleotide sequence of C. arabica chloroplast genome has been determined [28]. However, studies on synonymous codon usage bias in chloroplast genome of C. arabica (155,189 bp) have not been reported. Chloroplast genome of C. arabica comprises a total of 130 genes (79 protein coding genes, 29 tRNA genes, four ribosomal genes and 18 intron sequences). In this study, synonymous codon usage was analyzed using 53 protein coding genes (PCGs), having more than 100 codons by measuring codon usage indices such as relative synonymous codon usage, effective number of codons (ENC) and codon adaptation index (CAI) and the findings on synonymous codon usage are discussed.

Methodology:
Sequence data of C. arabica chloroplast genes A total of 53 protein coding genes in the chloroplast genome of Coffea arabica (EF044213) were retrieved from the National Centre for Biotechnology Information (NCBI) and identity of those genes were presented Table 1 (see supplementary  material). To avoid sampling errors, PCGs contain more than 100 codons were chosen for analysis.

Relative Synonymous Codon Usage (RSCU)
To analyze the characteristics of variations in synonymous codon usage by neglecting the influence of amino acid composition, the relative synonymous codon usage values of all sequences were determined according to equation (described in supplementary material) [29].

Effective number of codons (ENC)
This index is used to measure the extent of synonymous codon usage bias. The ENC values would be 20 when only one codon is used for each amino acid, In contrast, when codons are used randomly, the ENC values is 61. If the calculated ENC is greater than 61 due to more even distribution of codon usage than expected, it is adjusted to 61. Thus, the expected ENCs are calculated by following equation [30], ENC = 2 + s + {29/ [s 2 + (1-s 2 )]} Where s = GC3 (GC content at the third codon position)

Identification of optimal codons
Optimal codons occur most frequently only in highly expressed genes. Difference in RSCU of a given codon between putative high and low expression data set was calculated and tested the significance (p ≤ 0.05) using one-way analysis of variance (ANOVA). Codons that occur significantly at high frequency in highly expressed genes were regarded as putative optimal codons.

Codon adaptation index
Codon adaptation index (CAI) is used to measure the extent of bias towards preferred codons in a gene by defining the translationally optimal codons that are mostly represented in a reference set of highly expressed genes [31]. It takes value from zero to one and a higher value reveals a stronger codon bias with high expression level. In this study, the ribosomal protein coding genes have been used as reference for estimating CAI values on the basis of equation (described in supplementary material).

Correspondence analysis
Correspondence analysis (COA) is extensively used for analyzing multidimensional data [34]. COA was performed to analyze RSCU values to explore the different features of synonymous codon usage (SCU) patterns across the 53 protein coding genes in the chloroplast genome of C. arabica.

Correlation analysis
Correlation analysis was used to demonstrate the relationship between base compositions and synonymous codon usage patterns. This analysis was implemented based on the Pearson correlation analysis way. All statistical processes were carried out with software Past version 2.12 [35]. Figure 1: ENC Vs GC3 plot showing the grouping of majority of genes on or just below the expected GC3 curve. Red spots indicate genes that are independent of GC compositional constraints as they lie considerably below the continuous curve.

Discussion: Base compositional analysis of ORFs of 53 protein coding genes
Total and silent base compositions were identified Table 2 (see  supplementary material). Intricate correlations were found between A, T, G, C, GC and A3, T3, G3, C3, GC3 contents in the 53 ORFs of chloroplast genome of C. arabica Table 3 (see supplementary material). Interestingly GC3 has significant correlations with A, G, C, and GC contents, prognosticating that GC contents may have high influence, balancing between mutational pressure and natural selection. We observed that A was significantly correlated with A3, T3, C3 and GC3 contents and T had higher correlations only with A3 and T3. Base composition analysis of ORFs revealed that G, C content has significant correlations with T3, G3, GC3, and G3, C3, GC3 respectively. This implies that compositional constraints may have direct influence in the evolution of synonymous codon usage in the chloroplast genome of C. arabica.

Characteristics of relative synonymous codon usage
Overall and strand specific codon usage patterns of 53 PCGs were analyzed Table 4 (see supplementary material). Strand asymmetry was reported in the chloroplast genome of Euglina gracillis with regard to base composition [36]. Strand specific analysis was carried out to examine the differences in preference for codons in leading and lagging strand and found that there was no difference in the selection of preferred codons for coding all the 18 degenerate amino acid in both the strand. Arg, Leu and Ser have six fold degeneracy. A ending codons TTA and AGA were preferred to code Leu and Arg respectively whereas TCT was most frequently used to code Ser. The amino acids Ala, Gly, Pro, Thr and Val possess four-fold degeneracy. We found that T ending codons were often used for coding Ala (TCT), Pro (CCT), and Thr (ACT) but A ending codons were preferred for coding Gly (GGA) and Val (GTA). In two fold degenerate family, A ending codons were used most frequently to code Glu (GAA), Gln (CAA) and Lys (AAA) whereas T ending codons were preferred to code His (CAT), Phe (TTT), Tyr (TAT), Asn (AAT), Asp (GAT) and Cys (TGT). Three fold degenerate Ile was preferably coded by ATT. All the rare of non preferred codons were observed to be ending in C or G. Putative optimal codons were identified for amino acids Asp (GAT), His (CAT), Ile (ATA), and Arg (AGA). No optimal codons were identified for any of the four fold degenerate amino acid family. The values of ENCs among 53 PCGs were found to be varying from 36.79 to 53.86 with mean of 46.87 and S.D of 4.02 in the chloroplast genome of C. arabica and the overall GC3 values varied from 19.60% to 35.30% with a mean and S.D of 27.82% and 3.52 respectively, ensuring the heterogeneity of codon usage ( Table 2).

ENC Vs GC3 plot
The ENC plot is developed for determining the variations in SCU across a number of genes by exhibiting intraspecific and interspecific variations. It has been considered as an alternative to complex multivariate statistical methods to analyze SCU patterns in a genome. This plot effectively demonstrates SCU variation when the genome under study has significantly different GC content from 0.50 [30]. ENCs and GC3 were estimated and plotted for 53 PCGs, following the null hypothesis that GC3 compositional constraints alone influences expected SCU patterns. If a gene is influenced by GC compositional constraints, it would lie on or below the expected curve. If translational selection acts on a gene, it lies distantly below the expected curve. Accumulation of majority of genes on or below the left hand side of the expected GC3 based on null hypothesis suggested that SCU variations across majority of PCGs in C. arabica chloroplast genome were influenced by GC3 compositional constraints (Figure 1). To confirm this further, Parity rule 2 (PR2) plot [37] was analyzed (Figure 2). If GC3 mutational pressure acts on the genes, G and C (A and T) nucleotides should be used proportionally. To analyze whether codon preferences were restricted in strongly biased ORFs, relationship between synonymous G, C and A, T contents were analyzed. It was observed that G, C, and A, T contents were used proportionally ( Figure 2) and this confirmed the influence of GC3 biased mutational pressure in framing codon usage biases. Neutrality plot [38] (Figure 3) revealed that some amount of selection may act on PCGs in the chloroplast genome of Coffea arabica as there was no correlation between GC12 and GC3 [39]. However, some of the genes, viz., rps12, ndhF, ndhE, ndhB, atpF, petB, petD, psaB, psbA, psbC, psbD, rpl16, rps8 and rps14 were grouped distantly below the expected GC3 curve indicates the interaction of some other factors that are independent of base compositional constraints.

Figure 4:
Correspondence analysis plots of RSCU values of 53 ORF's in the C. arabica chloroplast genome. Distribution of these ORFs was found to be mainly due to base compositions.

Correspondence analysis (COA)
Frequently used multivariate statistical analysis, viz., correspondence analysis was used for determining the SCU variation across 53 PCGs in the C. arabica chloroplast genome. To avoid effect of aminoacid composition, RSCU values were used instead of codon counts. In this study, ORF of each gene was represented as 59 dimensional vectors (excluding three stop codons, Met and Trp) and each dimension denotes RSCU value of one codon. The first, second, third and fourth axes account for 10.94%, 8.09%, 7.35% and 6.89% respectively ( Figure  4). To unveil the factors that are responsible for the distribution of genes in the COA plot, ordination of 53 PCGs of axis 1 to 4 were analyzed for possible correlation with silent base composition and various indices of codon usage Table 5 (see supplementary material). Axis 1 was significantly correlated with A3 (r = -0.481, p < 0.01) and with C3 (r = 0.542, p < 0.01). Axis 3 was in correlation with G3 (r = -0.276, p < 0.01) and C3 (r = 0.395, p < 0.01). These correlations indicated the possible influence of nucleotide compositional constraints in framing codon usage. ENC, an index measuring the level of gene expression was found to be in correlation with axis 1 (r = 0.364, p < 0.01) and with axis 2 (r = -0.377, p < 0.01). Another index for measuring gene expression, viz., CAI was significantly correlated with axis 1 (r = 0.498, p < 0.01) and with axis 2 (r = 0.381, p < 0.01). Hence, it could be predicted that gene expression levels also slightly influences (weak selection) codon usage pattern in the chloroplast genome of C. arabica.
An equilibrium between positive and negative directional mutational pressure on GC base pairs results in the base composition of a genome [40]. The non random usage of synonymous codons (codon bias) was detected in all coding sequences of prokaryotes and eukaryotes as a result of the interaction between two significant evolutionary forces such as directional mutational pressure and selection in nature for translational optimization [2, 31, 6, 7]. Codon bias of genes in the chloroplast genomes has been reported to be towards A and T ending codons due to the compositional bias towards AT rich content [41, 42, 24]. Codon bias is reported to be relatively weaker in angiosperms [25]. A similar finding was also noticed from the ENC values of PCGs in the C. arabica chloroplast genome as ENCs of all the genes are greater than 35, indicating a weaker bias. Correlation analysis between total nucleotide contents and silent base contents revealed that base compositional constraints greatly influence in framing the SCU pattern in the PCGs of C. arabica chloroplast genome. Nucleotide composition is considered to be the most important factor that influences the SCU variations in chloroplast genomes [43]. Since all chloroplast genomes have high AT content, AT biased mutational pressure is believed to be the factor responsible for codon usage bias. But in the psbA gene of higher angiosperms, the codon usage is directly linked to tRNA population for translational optimization, indicating selection acts on this gene [44]. Since selection influences codon usage of psbA, lower rate of synonymous substitutions was reported in psbA gene [45].
Identified putative optimal codons were found to be ending in A/T and found consistent with previous findings [46]. In our study, ENC plot of selected chloroplast genes in C. arabica unravels the possibility that some amount of selection may act on petB, petD, psaB, psbA, psbC, psbD, rpl16, rps8 and rps14 as these genes lie considerably below the expected GC curve. However almost all the other genes lie on or just below the continuous curve, revealing the influence of compositional constraints in regulating SCU variation in C. arabica chloroplast genome, further confirmed using PR2 bias plot. COA analysis to find out the factors responsible for codon usage discrepancies shown that compositional constraints rather than selection plays crucial role in shaping the SCU variation across genes in C. arabica chloroplast genome as there was a grouping of A/T ending and C/G ending codons along the axis 1. Nevertheless, influence of selection on codon usage of some of the genes cannot be nullified because factors responsible for selective constraints are regarded as dynamic processes [44]. It was reported that selection against mutational pressure may narrow the GC3 distribution [39]. Similar to previous finding [47], we observed a narrow distribution of GC3 contents (19.36%-35.30%) in the Coffea arabica chloroplast genome and no correlation was observed between GC12 and GC3. This result demonstrates that some amount of selection may act in the chloroplast genome of Coffea arabica. Correlations of axis 1 & 2 with CAI, and axis 1 & 4 with ENC point out the influence of gene expression levels in framing the codon usage patterns. However, no exact separation is observed in the COA plot between highly and lowly expressed genes based on either ENC or CAI. In contrast to the previous findings that revealed significant correlation between length of CDS, hydropathy, aromaticity and codon bias [48,49,22], but our study could not find any significant correlation between these parameters and codon bias. Thus it can be concluded that the factor responsible for SCU variation across genes in the C. arabica chloroplast genome may possibly due to mutational pressure combined with weak selection. Moreover, information about the rare and preferred codons can be effectively used for enhancing expression of genes by optimizing synonymous codons. To the best of our knowledge, this is the very first report describing the codon usage bias in the genus Coffea.