Synonymous codon usage in Thermosynechococcus elongatus (cyanobacteria) identifies the factors shaping codon usage variation

Analysis of synonymous codon usage pattern in the genome of a thermophilic cyanobacterium, Thermosynechococcus elongatus BP-1 using multivariate statistical analysis revealed a single major explanatory axis accounting for codon usage variation in the organism. This axis is correlated with the GC content at third base of synonymous codons (GC3s) in correspondence analysis taking T. elongatus genes. A negative correlation was observed between effective number of codons i.e. Nc and GC3s. Results suggested a mutational bias as the major factor in shaping codon usage in this cyanobacterium. In comparison to the lowly expressed genes, highly expressed genes of this organism possess significantly higher proportion of pyrimidine-ending codons suggesting that besides, mutational bias, translational selection also influenced codon usage variation in T. elongatus. Correspondence analysis of relative synonymous codon usage (RSCU) with A, T, G, C at third positions (A3s, T3s, G3s, C3s, respectively) also supported this fact and expression levels of genes and gene length also influenced codon usage. A role of translational accuracy was identified in dictating the codon usage variation of this genome. Results indicated that although mutational bias is the major factor in shaping codon usage in T. elongatus, factors like translational selection, translational accuracy and gene expression level also influenced codon usage variation.


Background:
Most amino acids in the organisms are coded by more than one triplet of nucleotides which are called synonymous codons that usually differ by single nucleotide in the third codon position but for some amino acids, it differs in the second position also.Codon usage is specially a genome strategy in which each genome has a particular codon usage signature that reflects particular evolutionary forces acting within that genome [1].Codon usage pattern reflects a distinct pattern between [2-5] and within prokaryotic and eukaryotic organisms [6][7][8][9].With time, different pattern in codon usage in bacterial genomes indicates lateral gene transfer events and the analysis of codon bias helps establish the fact that the horizontal gene transfer is a major evolutionary force [10].Variation in codon usage bias depends on various factors such as gene expression level [11,12] with highly expressed genes showing greater codon usage bias than lowly expressed genes [13], amino acid composition [14], gene length [15] and tRNA abundance [16,17].Within the genomes which are the source of valuable biological information [18], codon usage pattern has both practical and theoretical importance in understanding the molecular basics of biology.This strategy can directly be utilized in molecular characterization of species and can be used as a molecular marker for the characterization of molecular evolution.DNA primers and oligonucleotide probes can also be designed on the basis of this knowledge [19].Natural selection acting through external environmental factors also influence genomic pattern of codon usage [20], variations in which can quantitatively predict the level of gene expressivity [21].
Among bacteria, primary factor responsible for codon usage variation is the differences in mutational biases.However, many species possess additional variation among genes that is consistent with the natural selection [22].Among organisms, codon bias is maintained by a balance between selection, mutation and genetic drift [16,23] where translational accuracy appears to be a major responsible factor as certain preferred codons are translated more accurately and efficiently [24].Thermosynechococcus elongatus BP-1 is unicellular rod-shaped thermophilic cyanobacterium with an optimum growth temperature of 55°C.During the course of evolution, T. elongatus is branched very near to the origin of cyanobacteria [25].Because of its ability to undergo natural transformation, this obligate photoautotroph has widely been used as a model organism for the study of photosynthesis.We analyzed codon usage pattern of whole genome of T. elongatus in order to understand the complete genetic organization and the factors shaping the codon usage pattern in this organism.

Methodology: Sequence source and parameters
Complete genome and coding sequences of T. elongatus was obtained from NCBI (www.ncbi.nlm.nih.gov/genbank/genomes).Whole genome consists of 2476 genes.To minimize sampling error, only coding sequences (CDS) with correct initial and terminal codons and minimum 100 codons were considered.CDS with uncertain annotation or annotated as hypothetical were excluded by using PERL scripts.

Indices of codon usage bias
Frequency of 59 codons for 18 amino acids (except Met, Trp and stop codon) was examined by using different codon indices such as relative synonymous codon usage (RSCU) [26]; effective number of codons (Nc) [27]; codon adaptation index (CAI) [28]; frequency of G+C in a coding sequence (GC content); frequency of G+C at third position of codons (GC3s) and frequency of A, T, G, C at the synonymous 3 rd positions of codons (A3, T3, G3, C3 respectively).
RSCU is the ratio of the observed frequency of codons to the expected frequency given that all the synonymous codons for the same amino acids are used equally.RSCU values >1 indicates more frequent use of corresponding codons than the expected frequency whereas the reverse is true for RSCU value <1 [26].Nc measures the bias of synonymous codons and Nc ranges from 20 (when only one codon is used per amino acid) to 61 (when all codons are used in equal probability) [27].
CAI is used to estimate the extent of bias towards codons that were known to be preferred in highly expressed genes.CAI value ranges from 0 to 1.0, and a higher value indicates a stronger codon usage bias and higher expression level.CAI is a well accepted measure of gene expressivity [28].

Multivariate analysis
Correspondence analysis (CA) has been successfully used to investigate the variation of RSCU values among genes.CA plots all the genes in a 59-dimensional hyperspace according to their usage of the 59 informative codons, excluding Met, Trp, and termination codons.It identifies the axes that represent the most prominent factors contributing to the variation among genes.Statistical analysis was performed with SPSS version 16.0.

Discussion:
Overall codon usage analysis T. elongatus genome contains moderately high (53%) GC content and therefore, due to compositional constraint, it is expected that there will be no or very little biasness in distribution of G-, C-, A-and T-ending codons.Among 18 most widely used codons, 8 are GC-ending (6 C-ending and 2 G-ending) whereas 10 codons are AT-ending (7 T-ending and 3 A-ending) Table 1 (see supplementary material).
Nc and GC3s was used to further analyze the degree of heterogeneity in codon usage in T. elongatus.It was observed that Nc values for the genes range from 34.77 to 61 with a mean value of 49.17 and standard deviation of 3.65.This reflected a marked variation of codon usage in the genes.Analysis with GC3s values also confirmed the heterogeneity of synonymous codon usage among genes of T. elongatus.Strong influence of compositional constraints on codon usage bias in the genes could be understood from the presence of significant negative correlation (r = -0.558,P< 0.01) between GC3s and Nc.These results supported our hypothesis that compositional constraints positively influenced detection of codon usage variation in the genes in this organism.
Nc plot (plot of Nc vs GC3 content) reflected the determinants of the codon usage variation among genes in different organisms.Since Nc is constrained by G+C content, it is plotted against GC3s of the gene to investigate patterns of codon usage.It has been suggested that if GC3 content was the only determinant of the codon usage variation among the genes, the Nc value would fall on the continuous curve between Nc value and GC3 content [27].In the Nc plot of T. elongatus (Figure 1), majority of genes with higher GC3s have lower Nc value and except few genes, majority of genes lay far below the expected curve suggesting additional codon usage bias.This again indicated certain additional factors along with compositional constraint in shaping the codon usage bias of this genome.Translational selection therefore, also seems to affect codon usage because of wide range of Nc values for the same GC3 content.

Multivariate statistical analysis
To examine whether amino acid composition exerts any constraint on synonymous codon usage, we performed CA on RSCU values.The first axis accounted for 17.99% of total variation as compared to 7.77% for the second axis.A much higher value for the first axis in comparison to the second axis indicated a single major explanatory axis that accounted for the codon usage variation in the organism.Bivariate correlation analysis established the relationships between the first and second major axes with different codon usage indices and has led to a significant negative correlation between Nc and GC3s (r = -0.558,P < 0.01).We therefore, suggest that the genes located at the right side of the first axis with higher GC3s and GC content and lower ENC values reflect strong codon bias and this variations in codon usage is strongly correlated with nucleotide composition of the genes.The first major axis showed a significant positive correlation with C3s and G3s but is negatively correlated with A3s and T3s Table 2 (see supplementary material).We showed that highly biased genes, those with G-and C-ending codons were clustered on the positive side whereas the codons ending with A and T predominate on the negative side of the first major axis (Figure 1).Significant negative correlation was observed between Nc and GC3s (r = -0.558,P <0.01) in comparison to Nc and GC (r = -0.394).A significant positive correlation was also observed between Axis 1 and GC3s confirming that GC3s plays an important role in shaping codon usage bias in this genome.Moreover, when G3s and C3s are considered separately, the correlation coefficient exhibited by the positions of genes along with the first axis with C3s (r=0.542,P <0.01) is significantly larger than that with G3s (r = 0.151, P <0.01), indicating that the contribution of C3s to the inter-genomic variation in overall GC3s content is greater than that of G3s.We therefore, speculate that the presence of compositional mutational bias shapes the codon usage bias in T. elongatus genome.

Translational optimal codons
To identify optimal codons which are preferred among the highly expressed sequences, we analyzed the codon usage in 10% of genes from both extremes of axis 1.Among highly biased genes, there were 20 codons with significantly higher usage (relative to synonyms) that could be inferred as translational optimal codons [7].Among these 20 codons, there were 12 C-ending (60%), 6 G-(30% G) and 2 Tending codons (10%) but no A-ending codon Table 3 (see supplementary material).This is consistent with the correlation of Axis 1 with C3s (r = 0.542, P<0.01).Most frequently used stop codon among highly expressed genes was UAG whereas among lowly expressed genes it was UGA.
The predominant nature of C-ending codons suggested that translational selection is also operating in codon usage variation in this organism.If the compositional constraints are the only factor in shaping the codon usage variation among the genes, then A-and T-ending codons should have been present in good proportion but it is not there.

Differential base usage in third codon position
The correlation of the frequencies of four bases in the third position against Nc values of all the genes has been checked Table 4 (see supplementry material).We considered the highly biased genes having low Nc values as highly expressed and vice versa.In all the genes, the frequency of C at third codon position increased with decreasing Nc values, whereas frequencies of A and T increases with Nc.However, the frequency of G is not influenced in the third codon position.Thus, the influence of mutational bias is reflected in the choice of bases in the third position.However, this is expected since the optimal codons are, in general, chosen in accordance with the mutational bias.It is expected that due to the translational selection, mutational bias appears to be more prominent in the third codon position of highly expressed genes   with longer translational timing.Correlation analyses between gene length and gene position on axis 1, GC3s and Nc values (r = 0.158, 0.170 and -0.173, respectively, P< 0.01) were all significantly correlated.The observed significant correlation in T. elongatus genome revealed that gene length influences codon usage of this organism.Gene length was negatively correlated with Nc (r = -0.173)but it showed positive correlation with GC3s (r = 0.170).Therefore, codon bias is lower in longer genes than in shorter ones (Figure 3).It is meaningful that up to certain extent, translational accuracy also plays a role in dictating the codon usage variation in this genome.

CAI and codon usage variation: role of gene expressivity
Codon Adaptation Index (CAI) has been extensively used as a measure of gene expression level in organisms.Within T. elongatus genome, CAI showed positive correlation with the positions of genes along axis 1 and GC3s ( r = 0.274 and r= 0.229 respectively, P<0.01) (Figure 4) but was negatively correlated with Nc (r =0.389).This has reflected the effect of gene expression level on codon selection pattern in T. elongatus.Data suggested that genes with higher expression level exhibing a greater degree of codon usage bias were distributed at right side of the first axis and prefered to use G-or C-ending codons.This result indicated a role of expression levels of genes in codon usage variation along with compositional constraints and supported the argument of presence of additional factors along with compositional constraint in shaping the codon usage bias in this genome.Furthermore, CAI showed a positive correlation with C3s (r = 0.605, P <0.01) whereas negative correlation with G3s (r = -0.243,P <0.01) supporting our hypothesis that the influence of mutational bias is reflected in the choice of bases in the third codon position and thus, favor translational selection.

Conclusion:
Codon usage pattern was analyzed in T. elongatus genome to elucidate the factors responsible for variation in codon usage.
From the analysis, it is clear that compositional constraint is not the only factor responsible for shaping the codon usage variation in this genome.In T. elongatus, there is a single major explanatory axis which accounted for codon usage variation.CA of T. elongatus genes revealed strong correlation of GC3s with this axis.In addition, a negative correlation was observed between Nc and GC3s.These results suggested that mutational bias is the major factor in shaping codon usage in this organism.Among all the four nucleotides at third position, C3s is found to directly influence codon usage variation.Thus, the influence of mutational bias is reflected in the choice of bases in the third position and supported our hypothesis that compositional mutation bias possibly plays an important role in shaping the codon usage bias of this genome.The 3 rd base of a codon is said to wobble and very often, changes in the 3 rd base of a codon might not change the amino acid encoding reflecting that base compositional mutational bias could lead to different codon choice within the same protein sequence [23].Highly expressed genes tend to use C-or G-at the synonymous positions compared with lowly expressed genes.Preference of C-ending codons in the highly expressed genes might be related to the translational efficiency of the genes as it has been reported that RNY (R-purine, N-any nucleotide base, and Y-pyrimidine) codons are more advantageous for translation [28].It is interesting to note that 68% of the preferred triplets in the highly expressed genes of T. elongatus are pyrimidine ending codons suggesting that besides, mutational bias, translational selections are also operating extent in determining the codon usage variation in this organism.The interpretation of pyrimidine selection involves the energy level of the codonanticodon interaction and is connected to translational fidelity [12].Gene length was also found to influence codon usage and this fact supported the role of translational accuracy in dictating the codon usage variation of this genome.Gene expression level also influences codon selection pattern of this organism.Thus, it is imperative to note that mutational bias is the major factor in shaping codon usage in T. elongatus genome but the role of other factors such as translational selection and accuracy and gene expression level may not be excluded in codon usage in this organism.

Figure 1 :
Figure 1: Position of T. elongatus genes along the two major axes of variation in the correspondence analysis on RSCU.

Figure 2 :
Figure 2: Nc plot for T. elongatus genes.The solid line indicates the expressed Nc value if codon bias is only due to GC3s.

Figure 3 :
Figure 3: Plot of Nc vs Gene length for T. elongatus genes.
Selection for translational accuracy is predicted to have a positive correlation between codon bias and gene length [15].From the plot drawn with gene length against Nc (Figure 2), it is clear that shorter genes have a much wider variance in Nc values and vice versa for longer genes.Lower Nc values of longer genes may be expected due to the direct effect of translation time on fitness or to the extra energy cost associated BIOINFORMATION open access ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 8(13): 622-628 (2012) 625 © 2012 Biomedical Informatics

Figure 4 :
Figure 4: Scatter plot of CAI with Axis 1

Table 2 :
Relationship between two major axes and codon usage indices identified through bivariate correlation analysis

Table 3 :
Relative synonymous codon usage for the highly and lowly expressed genes in Thermosynechococcus elongatus BP-1 identified as optimal at p < 0.01; each group contains 10% of sequences at either extreme of the major axis generated by CA; AA-amino acid; N-number of codons; h highly expressed genes; l lowly expressed genes