Unusual codon usage bias in low expression genes of Vibrio cholerae.

Positive correlation between gene expression and synonymous codon usage bias is well documented in the literature. However, in the present study of Vibrio cholerae genome, we have identified a group of genes having unusually high codon usage bias despite being low potential expressivity. Our results suggest that codon usage in lowly expressed genes might also be selected on to preferably use non-optimal codons to maintain a low cellular concentration of the proteins that they encode. This would predict that lowly expressed genes are also biased in codon usage, but in a way that is opposite to the bias of highly expressed genes.


Background:
In most species synonymous codons are not used with equal frequencies; the phenomenon known as codon usage bias.Codon bias is generally governed by a balance between mutation, genetic drift and natural selection [1-5].
In various organisms, such as Escherichia coli and Saccharomyces cerevisiae, synonymous codon usage bias has been shown to be correlated with the abundance of isoaccepting tRNA [6].An optimal codon is thought to increase translation rate [7-9].On the other hand, the presence of non-optimal codons has been postulated to reduce translation rate [10], probably due to a relative scarcity of cognate tRNA species.Non-optimal codons have selective advantage to maintain a low cellular concentration of the proteins that they encode [11].It was reported previously that non-optimal codons occur at high frequency in the signal sequence of secretory genes in Escherichia coli [12].The high occurrence of non-optimal codons in the signal sequence of secretory proteins has also been observed in the gram-positive bacterium Streptomyces coelicolor [13].
Apart from gene expression level, gene length also has important role in affecting synonymous codon usage bias.Several earlier studies have also documented strong effects of gene length on codon bias in a variety of organisms.The level of synonymous codon usage bias has been shown to be positively correlated to gene length in Escherichia coli [14].In Drosophila genome, longer genes had lower codon usage bias [15].However, Hou and Yang [16] reported that in S. penumoniaes genome, the longer genes had higher expression level and higher codon usage bias.
Cholera remains a heavy burden to human health in some developing countries including India where sanitation is poor and health care is limited [17][18][19][20].After the publication of the complete genome sequence of Vibrio cholerae [21], the etiological agent of cholera, extensive possibilities, earlier unavailable, have opened up to understand the genetic organization of Vibrio cholerae.The present study demonstrates an unusual trend in synonymous codon usage pattern of lowly expressed genes of Vibrio cholerae genome.Contrary to the usual expectation, we have identified 138 genes that are highly biased yet lowly expressed.Moreover, the usage pattern of non-optimal codons in lowly expressed genes depends on the gene length.Our results clearly suggest that translational selection has significant influence on the codon usage pattern of lowly expressed genes depending on gene length.

Methodology:
The complete genome sequence of Vibrio cholerae has been downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes and the coding sequences were extracted.To minimize the sampling errors [22], only those coding sequences that are more than or equal to 30 amino acids has been retained for our analysis.Correspondence analysis [23] available in CodonW 1.4.2(J.Peden, 2000; http://www.molbiol.ox.ac.uk/cu/) was used to investigate the major trend in relative synonymous codon usage variation among the genes.We have also used CodonW to calculate Relative Synonymous Codon Usage (RSCU) values and gene length.Synonymous codon usage bias was measured by calculating the 'effective number of codons used in a gene' (Nc) [22,24].The values of Nc range from 20 (when one codon is used per amino acid) to 61 (when all the codons are used with equal probability).In the present study, a gene is designated as highly biased if Nc< 36, and lowly biased if Nc> 44.We have used Codon Adaptation Index (CAI) to calculate gene expressivities.CAI is widely accepted as an effective measure of potential level of gene expression [25].CAI of individual genes were calculated taking a reference gene set of all the ribosomal proteins, which are known to be www.bioinformation.netWe have sorted our dataset according to the CAI values.We have taken out genes from extreme 20% of population from both ends of the sorted dataset.Using the above criteria, a gene is considered as lowly expressed if its CAI < 0.318 and highly expressed if it's CAI > 0.502.The transfer RNA gene copy numbers were taken from the tRNA scan database (http://lowelab.cse.ucsc.edu/GtRNAdb/Vibr_chol/).The Student's t-test was used to evaluate the significance of all the pairwise differences.The correlation coefficients were determined using SPSS (13.0) to assess the statistical significance of the correlation, if any.

Results and Discussion: Codon usage bias and gene expression level
Many studies have demonstrated a positive correlation between degree of codon bias and level of gene expression [25, 29, 30].As a result, it is generally expected that lowly expressed genes should have lower codon bias and highly expressed genes should have higher codon bias.When analysis was performed taking all the genes in V. cholerae genome, we also observed a negative correlation (r= -0.2994, P< 0.01) between Nc and CAI which indicates that degree of codon bias increases with the increase in gene expression level.However, careful inspection of the plot between Nc and CAI reveals (Figure 1) that though lowly biased genes are lowly expressed but not all the highly biased genes are highly expressed.Finally, we have identified 138 genes which show unusual pattern of codon usage (i.e., high codon usage bias yet lowly expressed).

Correspondence analysis on RSCU: Identification of translationally non-optimal codons
We have performed correspondence analysis (CoA) on the set of highly and lowly expressed genes.Since codon usage by its very nature is multivariate, one of the most popular multivariate methods for studying codon usage variation is correspondence analysis [23].Correspondence analysis identifies the major trends in the variation of the synonymous codon usage data and distributes genes along continuous axes in accordance with these trends.Correspondence analysis on relative synonymous codon usage (RSCU) detected one major trends of codon usage variation on first axis of inertia.The first axis accounted for 17.48% of the total variation and no other axis accounted for more than 6.97% of the total variation.As expected, the position of the genes along the first major axis is significantly correlated with the corresponding CAI values (r= -0.9648, P< 0.01).By looking at the distribution of codons along the first two major axes (Figure 2), we have identified five most preferred codons in lowly expressed genes situated at the most extreme positions in the positive side of the Axis 1.The names of these codons are: AGG, CGA, CGG, AUA, and AGA.Non-optimal codons are defined by their low usage in the genome and the low abundance of their corresponding tRNA [8, 31].If we compare the RSCU values of the above five codons among other synonymous alternatives (Table 1 in supplementary material), we can see that they are used less frequently among other synonyms.Moreover, we have used tRNA gene copy number data (Table 1 in supplementary material) to assess the abundance of their corresponding tRNA.Table 1 also suggests that the five most preferred codons in lowly expressed genes has either nil or lowest abundance of the corresponding tRNA gene copy number.Thus, these five codons can also be considered as the non-optimal codons.

Correspondence analysis on RSCU of low expression genes
We have identified a set of 138 genes showing unusual pattern of synonymous codon usage, i.e., they are highly biased (low Nc) but lowly expressed (low CAI).Therefore, correspondence analysis was performed on RSCU of lowly expressed genes to analyze the differential nature of selective constraints acting on synonymous codon usage pattern of lowly expressed genes.CoA detected one single explanatory axis of major synonymous codon usage variation.The first major axis accounted for 9.58% of the total variation in codon usage and no other axis accounted for more than 5.34% of the total variation.More importantly, we have noted that position of the genes along the first major axis is significantly correlated with Nc (r= -0.3395, P< 0.01) and gene length (r= -0.2818, P<0.01).Thus genes placed at positive side of Axis 1 are highly biased and of smaller length.We have also compared the average length of lowly biased and highly biased groups of lowly expressed genes.The average length of highly biased genes is 50.15 and the average length of lowly biased genes is 372.08.This difference in gene length between lowly biased and highly biased groups of lowly expressed genes is statistically significant at P<0.001.
We have also analyzed the codon distribution along first major axis generated from CoA on RSCU of lowly expressed genes (data not shown).The name of the most preferred codons at the extreme of the positive side of Axis 1 is: AGG, AUA, UCA, ACA, AGA and those at the extreme of the negative side of Axis 1 are: CGC, CGG, CCG, CGU, CUG.One interesting observation is that among the five most preferred codons at the positive side of Axis 1, three are non-optimal codons (please see section Correspondence analysis on RSCU: Identification of translationally non-optimal codons).On the other hand, there is no non-optimal codon present among the most preferred codons at the negative side of Axis 1.

Non-optimal codon usage: Shorter gene length and greater codon biasness of lowly expressed genes
From the above results it is clear that the frequencies of non-optimal codons are greater in the highly biased group of lowly expressed genes and the average length of these group of genes are significantly smaller than the average length of lowly biased lowly expressed genes.The presence of non-optimal codons has been postulated to reduce translation rate [10], probably due to a relative scarcity of cognate tRNA species.Considering these facts it is reasonable to argue that selective constraints on the usage of non-optimal codons are greater in the highly biased groups of lowly expressed genes than lowly biased lowly expressed genes.
If synonymous codon usage pattern among the lowly expressed genes is explained by selection to reduce translational rate, is this consistent with the length effect?Several earlier studies have also documented the influence of gene length on codon bias in a variety of organisms effect could be explained by selection for translation rate, e.g., in a short gene with 100 codons, such mutation would increase translation time by 1%, whereas the same mutation in a gene with 1000 codons would increase translation time by only 0.1%.In the present study, among the lowly expressed genes, mutations of non-optimal codon will have greater relative effect in smaller genes compared with larger genes.Thus such mutations are likely to be counter selected in short genes than in long genes.

Conclusion:
In summary, the present study attempts to focus on the unusual trends in synonymous codon usage pattern in lowly expressed genes of V. cholerae genome.Selection forces governing the synonymous codon usage in bacterial genes usually vary across or within the genomes.One might, therefore, expect to observe species-specific and/or gene-specific trends in synonymous codon usage pattern.This study finds that selective preference of the nonoptimal codons in shorter lowly expressed genes has made them highly bias and might have a greater role in translational pausing to allow correct folding of proteins.The unusual pattern of synonymous codon usage observed in a subset of lowly expressed genes of Vibrio cholerae genome may provide a new starting point for the study of the organism's environmental and pathobiological characteristics.It will be interesting to see if the synonymous codon usage pattern could influence to determine the gene expressions that are unique to its survival and replication during human infection [35]

Figure 1 :
Figure 1: Variation of Effective Number of Codons (Nc) against Codon Adaptation Index (CAI).

Figure 2 :
Figure 2: The distribution of codons of all Vibrio cholerae genes along the first and second axes of the correspondence analysis.

17, 36].
as well as in the environment [

Table 1 :
RSCU values and corresponding tRNA copy number of Vibrio cholerae genome.