Nucleotide composition determines the role of translational efficiency in human genes

The basic sequence features were analysed that influence gene expression via codon usage bias of the selected forty coding sequences of Homo sapiens in a simple prokaryotic model i.e. E. coli K-12 genome. The prime objective was to elucidate the interrelationships among tRNA gene copy numbers, synonymous codons, amino acids and translational efficiency using tRNA adaptation index. It was evident from RSCU scores and principal component analysis, that only those preferred codons were used by the isoacceptor tRNAs that had G and C base at their third codon position. Relationship between tRNA adaptation index and amino acids, revealed that valine, arginine, serine and isoleucine showed significant positive correlation with gene expression. Therefore, it could be inferred that GC content in these genes might have the major role in shaping the codon bias and affecting the translational efficiency of the coding sequences.

certainly not covered by other CUB measures. In the recent years, tAI was extensively used for studying problems in diverse biomedical disciplines like functional genomics, evolutionary biology, and systems biology [5]. Human protein-based therapeutics is the emerging area of drug development, primarily due to high sensitivity and specificity of these molecules that resulted in tremendous success rates in drug development. However, the integral complexity of proteins restricts their synthesis in living cells, by which production of recombinant proteins on a commercial scale becomes more expensive. Another drawback is that such proteins are not orally bioavailable as they get denatured or proteolyzed in the gut.
Thus, the cost of expensive heterologous human protein production may be reduced by incorporating methods of codon usage bias and codon optimization to make synthetic genes. Because of its fast growth rate and other famous genetic uniqueness E. coli has been widely used to produce recombinant proteins [6]. The newly synthesized human protein in E. coli cells come across several translational errors such as amino acid substitution, stalling, termination, and possibly frame-shifting. And this occurs when the codon bias of the human genes and E. coli differ from each other [6]. Therefore, we hypothesize to obtain a significant correlation between tRNA levels, and other sequence features related to the translational activity, wherein the coding sequences are analyzed to improve the expression efficiency. These sequence features include gene expression level [7], gene length [8], protein amino acid composition [9], tRNA abundance [10], mutation frequency and patterns [11], and GC composition [12], tRNA adaptation index (tAI) and amino acid frequencies.
This study is purported to predict the expression levels of forty human genes that are of medical importance in E. coli K-12 genome. It also focuses on the relationship between the codon usage of human genes and the tRNA genes (as tRNA molecules play a vital role in transporting the anti-codons to pair with its respective codons) of E. coli K-12. We used tAI as a prime measure to investigate the role of tRNA in protein expression and to estimate the gene expression levels.

Materials and Methods: Data retrieval:
Forty coding sequences (CDS) of human genes were retrieved from NCBI (http://www.ncbi.nlm.nih.gov) database to analyze the relevant factors of nucleotide contents and synonymous codon usage patterns, provided in the supplementary sheet.

Relative Synonymous Codon Usage (RSCU):
It is a simple measure of the heterogeneity in the usage pattern of synonymous codons [13]. RSCU value greater than one indicates that the codon is over-represented and vice versa [14].
Where Xij is the frequency of the j th codon for the i th amino acid and ni being the number of alternative synonymous codons available for the i th amino acid.

tRNA adaptation index:
It is measured relative to the supply of the tRNAs that are required for translation of codons to amino acids. The tRNA availability is the driving force for translational selection. The tRNA adaptation index estimates the extent of adaptation of a gene (cds) to its genomic tRNA pool. It is a measure used for predicting gene expression [4].
Where lg is the length of the gene in codons, and Wik is the relative adaptiveness value of the codon.

Aromaticity and Isoelectric point (pI) of proteins:
Aromaticity is the relative frequency of aromatic amino acids in a protein [15] in the translated gene product. Isoelectric point of an amino acid is the pH at which the amino acid doesn't migrate in an electric field. The values of pI represent the zwitterionic effect on the amino acids.

Principal component analysis (PCA) and Correspondence analysis (COA):
PCA is a multivariate statistical method for simplifying the multidimensional information of the data matrix into a twodimensional map [16]. Usually the first and the second principal components contain maximum information. COA identifies the major trends in the variation of the data and distributes the genes along continuous axes, with each subsequent axis explaining a decreasing amount of the variation [17].

Software used:
All the codon usage bias and the base compositional analyses were performed using an in-house PERL program developed by SC (corresponding author). PCA was done by XLSTAT. COA of amino acid usage was performed using CodonW software. Heatmap based on the relative frequencies of codons and amino acids were plotted using software package XLSTAT. Codon usage variation was analyzed based on RSCU value for each synonymous codon using XLSTAT. SPSS -version 21.0 was used to make other graphical representations.

Statistical analysis:
Microsoft Excel was used to perform the basic data analysis and interpretation. Pearson's correlation and statistical test of significance were performed (p<0.01 and p<0.05) using SPSSversion 21.0.

Results and Discussion:
The prime goal of the study was to analyze the codon usage patterns, to predict the expression level of the proteins (Table 1), and to evaluate the degree of heterogeneity in codon usage.

Base compositional dynamics:
The human coding sequences were analyzed to calculate individual nucleotides, as well as GC and AT content at the third position of the synonymous codons. The observed pattern of base compositions is presented in the supplementary sheet. These human genes were abundant in guanine (27.34%) and cytosine (25.77%) followed by adenine (24.67%) and thymine (22.22%). Similarly, the average percentage of GC (53.12%) was found to be higher than that of AT (46.87%). It was reported that high GC content is a consequence of a GC-biased repair of mismatches during recombination [18]. GC contents at three codon positions (GC1, GC2, and GC3) were calculated The GC3 and GC1 contents were 62.83% and 55.85%, respectively. In human, the GC content of large genome fragments (isochores) ranges from 30% to 60% and the GC content at the third codon position (GC3) ranges from 25% ≥ 90% [19].

GC3 on codon bias:
To examine the association between codon usage bias and mutational pressure in these genes, we correlated GC3 with GC12, GC23, and GC13, where GC12 is the average of GC1 and GC2, similarly, GC23 is the average of GC2 and GC3, GC13 is the average of GC1 and GC3. From the Fig. 1, it was observed that GC3 showed significant positive correlation with GC23 (r=0.948, p<0.01), GC13 (r=0.927, p<0.01) and GC12 (r=0.651, p<0.01), which indicated that similar effect of GC mutation bias on the three positions of codons. Codon usage might have been subjective to an underlying bias in the dinucleotide usage [20].
Hence it can be inferred that codon bias in the selected human genes, is more likely to be characterized by mutation bias. However, to predict the expression level of the genes, %GC at three codon positions was correlated with tAI. We observed weak correlation of tAI with GC1 (r= -0.07), GC2 (r= -0.10) and GC3 (r= -0.11), respectively. This indicated that GC content is not a good predictor of gene expression. In conformity with our study, GC3 content was also found to be a very poor predictor of human gene expression in E. coli K-12 [21]. This reveals that high %GC content may not be required for the efficient expression of genes in E. coli and indicates that besides %GC there might be other forces affecting human gene expression in E. coli genome. Bernardi, 1993 observed that codon usage bias is mainly due to the difference in the patterns of GC content found in the human genome [22]. Association between codon usage bias and GC content in the surrounding non-coding region could be taken as a support for directional mutational pressure [23].

Codon usage pattern:
The codon usage pattern of the forty human coding sequences was investigated by calculating RSCU values (Table 2). It was found that the preferred codons (RSCU > 1.6) are biased strongly towards G/C bases in the third position supporting the findings of Dass and Sudandiradoss 2012 [24]. The over-represented codons (RSCU>1.6) have been depicted in red color, and the under-represented ones (RSCU<1.6) are in black to light grey colours (Figure 2). Most of the preferred codons end in G or C, thus GC3 content might play a key role in gene expression, as also studied by Duret and Mouchiroud 1999 [25].

Principal Component Analysis (PCA):
To further test if mutation bias is the consequence of codon usage bias, we performed PCA using RSCU values of all codons. The GC and AT bases should be uniformly used due to mutation bias in coding sequences. In paradox to this statement, natural selection for codon choice should not cause the proportional use of GC and AT, since tRNA isoacceptors show an important role in modeling the codon usage bias, thereby influencing gene expressivity [26]. From this analysis, it was evident that the coding sequences did not show uniform usage of GC and AT base pairs which further indicated that natural selection might have played a key role in determining gene expression (Supplementary sheet). The pattern of codon usage in human genes was revealed in Figure 2. PCA revealed that the first axis (F1) accounted for 27.16% and a second axis (F2) accounted for 7.90% of the total variation in the codon usage. Furthermore, the synonymous codons having positive F1 values showed maximum usage of G or C at the synonymous third codon position (except GGG, TTG). But negative F1 values of codons revealed that the codons were enriched with A or T in the third position (except CGT) Figure 3.

Amino acid usage and gene expression:
The correlation analysis between tAI and different amino acids revealed that the amino acids Val (r=0.67, p<0.01), Arg (r=0.65, p<0.01), Ser (r=0.65, p<0.01) and Ile (r=0.64, p<0.01) showed highly significant positive correlation with tAI. This suggests that the amino acids could influence tAI i.e. gene expression through the corresponding optimal codon-anticodon pairing.
The codon-anticodon combination in these genes seemed to be constrained by tRNA availability due to limited copy numbers in E. coli genome as compared to the human genome. We observed that preferred codons correspond to the most abundant tRNA species and similar results were also reported by Ikemura [27]. In another study, Zhang et al. (1991) [28], showed that in E. coli, S. cerevisiae, D. melanogaster, and primates (mainly Homo sapiens) the proteins with a high percentage of low-usage codons can be considered as cases, wherein, excess of the protein may possibly be detrimental. In bacteria Saier (1995) showed evidence that the translation of proteins involved in some specific functions may be regulated by using rare codons [29]. Our results indicate that the translation efficiency of human genes in a heterologous system like E. coli K-12 genome may function at the expense of the demand for tRNA molecules carrying specific amino-acids and the supply of matching codons by the coding sequences on ribosomes [30].
Correspondence analysis (COA) of amino acid usages revealed a single major axis of variation in the genes. Axis 1 revealed an insignificant correlation with GRAVY, aromaticity, hydrophilicity, molecular weight of proteins, gene expressivity measure (CAI), the effective number of codons and tAI. However, the isoelectric point of the proteins indicated moderately negative correlation with the axis 1 ( Table 2). The correlation analysis of axis 1 with pI suggests that amino acid usage may poorly affect the surface charge distribution of proteins with the prevalence of acidic and basic residues in these sequences.

Conclusion:
This study was undertaken to comprehend the role of base composition on translational efficiency of human genes in E. coli and it revealed that GC3 content is a poor predictor of gene expression (tAI). The present study reveals that although the protein composition is determined by GC richness in the genes, other factors like RSCU, tAI and pI of the proteins. However, the codon usage variation was constrained by compositional pattern. Principal component analysis also supported that most codons showed biased effect on guanine and cytosine at the third codon position and the preferred codons usually end with either G or C. Positive significant correlation of gene expression parameter with a few amino acids viz. Val, Arg, Ser, Ile indicated that amino acids composition might influence the gene expression. Only pI showed moderate negative correlation with axis 1 of the correspondence analyses. It could be stated that mutation might have played a major role in predicting the level of gene expression. This study has provided a basic understanding of the causes for codon usage bias, which could be beneficial for further