A cross talk between codon usage bias in human oncogenes

Background: Oncogenes are the genes that have the potential to induce cancer. The extent and origin of codon usage bias is an important indicator of the forces shaping genome evolution in living organisms. Results: We observed moderate correlations between gene expression as measured by CAI and GC content at any codon site. The findings of our results showed that there is a significant positive correlation (Spearman's r= 0.45, P<0.01) between GC content at first and second codon position with that of third codon position. Further, striking negative correlation (r = -0.771, P < 0.01) between ENC with the GC3s values of each gene and positive correlation (r=0.644, P<0.01) in between CAI and ENC was also observed. Conclusions: The mutation pressure is the major determining factor in shaping the codon usage pattern of oncogenes rather than natural selection since its effects are present at all codon positions. The results revealed that codon usage bias determines the level of oncogene expression in human. Highly expressed oncogenes had rich GC contents with high degree of codon usage bias.


Background:
Nature has gifted the genetic code that provides the basic instructions and information to direct efficient protein synthesis and folding.There are sixty-one codons that specify for only twenty amino acids found commonly in protein sequences; most of these amino acids (building blocks of protein) can be encoded by more than one codon (i.e., a triplet of nucleotides); such codons are described as being synonymous, and mostly differ by one nucleotide in the third position [1].The term codon bias or more preferably codon usage bias represents the unequal usage of synonymous codons for encoding amino acids which may vary significantly between genomes, between genes in the same genome, and within a single gene [2-3].Since the 1970s, the unequal use of synonymous codons has been confirmed in many organisms.To date, the codon usage patterns in many organisms have been interpreted for diverse reasons.Recently, it has been reported that two major factors are involved in the continuation of codon usage bias: weak natural selection and mutational pressure [4].The selection associated with translational efficiency/accuracy is often termed as 'translation selection'.Moreover, scientific investigation also reported that synonymous codon usage pattern varied at distinct sites along a coding sequence, balances of strong versus weak base pair bonding, maintenance of DNA and RNA secondary structure, and translational efficiency and fidelity [5].That is why codon usage bias among different organisms or within the genes of the same organism has invited much attention and various works on the subject have been published in recent years.Lavner and Kotlar (2005) suggested that there are three possible ways in which selection may act on codon bias in the human genome: (1) Increasing translation efficiency in highly expressed genes; (2) Regulating translation efficiency of some proteins that can be a disadvantage at high levels; and (3) Improving translation efficiency and reducing the rate of amino acid misincorporation in the production of biosynthetically expensive proteins [3].Many genomic analyses have been done on oncogenes but till date very little is known about the codon usage patterns and the factors that influence them.Codon usage patterns are important for bringing out molecular mechanism and evolutions of oncogenes.In this paper we have analyzed the key genetic factors playing crucial role in determining the codon usage pattern in fifty (50) oncogenes.To the best of our knowledge, it is the first systematic study to verify and insist that the synonymous codon usage pattern is one of the factors affecting the codon usage in oncogenes..

Retrieval of Sequence data
A list of human oncogenes was compiled from the web site (http://cbio.mskcc.org/CancerGenes/Select.action).Complete nucleotide coding sequence of each of the concerned gene, was obtained from NCBI nucleotide database website (http:/ /www.ncbi.nlm.nih.gov).Codon usage bias was measured in the 50 oncogenes listed in Table 1 (see supplementary material).The complete coding sequence (cds) of each oncogene was analyzed using PERL program developed by us.

Analysis of synonymous codon usage bias
We measured the non-uniform usage of synonymous codons for the oncogenes by analyzing several genetic indices given below:

Nucleotide composition
The frequency of the nucleotide G+C at the synonymous third codon position (GC3s) is a good indicator of the extent of base composition bias [6].The frequencies of the nucleotides A, C, U (T), G in the complete coding sequences of each oncogene and the occurrence of overall (G+C)% content at the different codon positions GC1%, GC2%, GC3% was calculated to study the relationship between codon usage variation and compositional constraints.

Effective number of codons
The effective number of codons (ENC) is generally used to measure the codon usage bias of a gene that is independent of the gene length and number of amino acids [7].The ENC value ranges from 20-61.For a gene in which only one codon is used for each amino acid, this value would be 20 while all codons are used equally the value would be 61 [7].The ENC value closer to 20 indicates, strong codon usage bias in the gene and these biased genes are expressed highly [8].The ENC values for all cds sequence were computed as per Wright (1990) [7].In addition, to examine the influence of GC content on codon usage, the relationship of ENC and GC3s content of each gene was plotted according to the equation described by Wright (1990) [7].

Codon adaptation index
Codon adaptation index (CAI) is used to estimate the degree of bias toward codons in highly expressed genes and thus assesses the effective selection which helps in shaping the codon usage pattern [9-10].The CAI ranges from 0 to 1, for a gene in which all synonymous codons are used equally, the value would be 0 for no bias while only optimal codons are used, value will be 1 for strongest bias [11].The CAI value was measured as per Sharp PM et al. [12].

Frequency of optimal codons
Frequency of optimal codons (Fop) is used to measure codon usage bias in a gene [11].Fop is calculated as the ratio of the number of optimal codons used to the total number of synonymous codons [13].The Fop value ranges from 0.36, for a gene in which codon usage pattern is uniform, to 1 for a gene in which codon usage is highly biased [11].We used the formula given by Lanver & Kotlar to calculate the Fop values for each of the cds selected for the present study

Relative synonymous codon usage
Relative synonymous codon usage (RSCU) is calculated by dividing the observed frequency of a codon by the expected if all synonymous codons for that amino acid were used equally

Correlation analysis
Correlation analysis was used to identify the relationship between the pattern of synonymous codon usage and the genetic indices used for the present study.This analysis was implemented based on the Spearman's rank correlation analysis.All statistical analyses were carried out by using software SPSS.

Results:
In this present study, the selected oncogenic cds sequences were downloaded from NCBI nucleotide database using a perl program.The program was written in such a way that it selects only those cds sequences which have perfect start and stop signal and devoid of any unknown bases (N).We found fifty cds sequences in correct format for codon bias study.The extent of codon usage bias was determined in these fifty oncogenes Table 1.Two amino acids methionine and tryptophan coded by single codon ATG and TGG, respectively and three stop codons (TAA, TAG, and TGA) would not reveal any usage bias and therefore discarded from the calculation.

Codon usage bias and correlation with GC3s
The overall percentage of guanine and cytosine contents GC% and adenine and thymine contents AT% on the first, second, and third codon positions of the 50 target oncogenes of human were investigated Table 2 (see supplementary material).It can be assumed that the evolution of codon usage might be either controlled by natural selection or by mutation pressure.To determine the extent of the role of these two evolutionary forces on the codon usage pattern of human oncogene, we performed correlation analysis between different nucleotide constraints.First we calculated the GC content at different codon positions (Figure 1) and it was found that the GC content at each codon position varies among the genes.Finally, compared GC content at first codon position (GC1) and second codon positions (GC2) with that of third codon positions (GC3s) and observed a significant positive correlation (r=0.45,P<0.01) (Figure 2), that reveals base compositions are prone to the result of mutation pressure rather than natural selection, since at all codon positions its effects are present.

Effective number of codons and its relationship with GC3s values
The average ENC value used by the oncogenes was found to be 53.74 with a range of 38 to 60. Thirty eight oncogenes had ENC values in the range 50-60, 11 in the range 40-50 and 1 between 38 and 40 (Figure 3).Therefore, codon usage bias is in most cases little, although some variation is evident.Moreover the GC3 values were found to range from 0.3 to 0.1.We calculated the correlation coefficient between ENC and GC3s values.The results showed that the ENC value was strongly negatively correlated with the GC3s values of each gene (r = -0.771,P < 0.01).These calculations suggested that genes with higher GC3s values and lower ENC values had strong bias.Finally, we plotted the ENC against the GC3% values to investigate the general codon usage variation with different GC content of each gene (Figure 4) [7].The continuous curve represents the expected positions of genes where GC3 values are the only determinant factor shaping the codon usage pattern.Most genes were found to be located on or above the reference line, representing that the codon usage pattern was only determined by GC3 values.Moreover, some genes located above the reference line, indicates that GC3 is not the only factor for shaping the codon usage pattern other factors like nucleotide composition, may be involved for these genes.

Level of oncogene expression and codon bias
The level of expression of oncogene was measured through codon adaptation index (CAI) values [10, 15], which varied from 0.124 to 0.735 with the mean of 0.395 and standard deviation of 0.159.The CAI value indicates that most of the genes selected for the present study are highly expressive in nature.Moreover, a significant negative correlation was observed between CAI & GC3s (r=-0.489,P<0.01) and CAI & GC content (r=-0.463,P<0.01).Furthermore, significant positive correlation was also observed in between ENC and CAI (r= 0.644, P<0.01) (Figure 5).The results revealed that codon usage pattern determines the level of all expression in human and highly expressed genes have high GC contents and a greater extent of codon usage bias.We also calculated the frequency of the occurrence of synonymous codons for the amino acids.The frequency was allied with statistical analysis to find out the highest and lowest frequently used codon (Figure 6).Relative synonymous codon usage (RSCU) values for each synonymous codon were calculated to find out the highest and least abundant codons.The results of our analysis indicate that the highest abundant codon is CTG for Leucine and GTT for Valine.Least abundant codons are GTC, ACT, TCG, CTA, and ATA for amino acid valine, threonine, serine, leucine and isoleucine, respectively (Figure 7).

Discussion:
In brief, we analyzed the codon usage pattern and the key genetic factors playing decisive role in determining the pattern of codon usage for the fifty oncogenes.Based on the hypothesis that gene expressivity and codon composition are strongly correlated, the codon adaptation index has been defined to provide an intuitively meaningful measure of the extent of the codon preference in a gene.The present study was carried out to analyze the CAI, Fop, ENC, RSCU, base composition for the oncogenes, and also to find out the level at which the above mentioned genetic factors are involved in the formation of codon usage pattern.As per our mentioned objectives in this present study, we selected fifty oncogenes from Homo sapiens for CUB analysis.The accurate coding sequences having correct initial and termination codons were retrieved using a program in perl, developed by us.After analyzing the cds sequences it was found that 70% of the cds selected for the study are rich in GC.We also predicted the heterogeneity of codon usage by analyzing the effective number of codons (ENC).We also measured the variation of codon usage bias among the oncogenes, further confirmed by the distributions of GC content at the third synonymous codon positions.These results indicate that apart from compositional constraints, other trends might influence the overall codon usage variation among the oncogenes.We calculated the CAI values for the oncogenes and it was found that seventy five percent of the cds selected from Homo sapiens qualify as highly expressed genes.We analyzed normalized AT and GC frequency at each codon site.Significant correlation was observed between gene expression as measured by CAI and GC content at any codon site.Among all GC3s showed highest correlation (-0.489) with gene expression.The frequency of the occurrence of each synonymous codon for the amino acids was calculated.The frequency was allied with statistical analysis to find out the highest and lowest frequently used codon.At the end of our frequency analysis we found that AAC, GAC, TGC, CAG, GAG, CAC, AAG and TAC are the codons used most frequently among cds sequence of oncogenes.

Conclusion:
The mutation pressure is the major determining factor in shaping the codon usage pattern of oncogenes rather than natural selection since its effects are present at all codon positions.The results revealed that codon usage bias determines the level of oncogene expression in human.Highly expressed oncogenes had rich GC contents with high degree of codon usage bias.Supplementary material contains two tables (Table 3 and Table 4).Table 3 contains the frequency of optimal codons (Fop) in the complete coding region of 50 oncogenes.Table 4 contains relative synonymous codon usage values (RSCU) of 50 cds selected in this study.

Figure 1 :
Figure 1: Percentage of GC content at three codon positions.

[ 14 ]
. Thus, an RSCU value close to 1 indicates a lack of bias, RSCU ˃1 indicates a codon used more frequently than expected randomly, and RSCU ˂1 indicates a codon used less frequently than expected randomly [14].

Figure 2 :
Figure 2: Correlation between GC content at first and second codon positions (GC1 & GC2) with that at synonymous third codon positions (GC3s).GC12: average GC content at first and second codon positions.

Figure 4 :
Figure 4: Distribution of ENC and GC content of the third codon position of 50 different oncogenes.The continuous curve represents the expected curve between ENC and GC contents under random codon usage.

Figure 6 :
Figure 6: Frequency of highest and least used codons among the 50 cds selected for the present study.

Table 1 :
The information of 50 Oncogenes used in this study with accession number and gene length

Table 2 :
GC content and the AT contents at different codon positions in the complete coding regions of 50 oncogenes