Codon usage bias as a function of generation time and life expectancy.

It has recently been demonstrated that human natural codon usage bias is optimized towards a higher buffering capacity to mutations (measured as the tendency of single point mutations in a DNA sequence to yield the same or similar amino acids) compared to random sequences. In this work, we investigate this phenomenon further by analyzing the natural DNA of four different species (human, mouse, zebrafish and fruit fly) to determine whether such a tolerance to mutations is correlated with the life span and age of sexual maturation for the corresponding organisms. We also propose a new measure to quantify the buffering capacity of a DNA sequence to mutations that takes into account the observed mutation rates within every genome and the effect of the corresponding mutation. Our results suggest there is a propensity for tolerance to mutations that is positively correlated with the life expectancy of the considered organisms. Moreover, random sequences that are constrained to produce the same protein as the naturally occurring sequences are found to be more buffered than completely random sequences while being less buffered than the natural sequences. These results suggest that optimization toward protective mechanisms tolerant to mutations is correlated with both life expectancy and age to sexual maturity at both the levels of codon usage bias and the bias of the natural sequence of codons itself.


Background:
In gene coding regions, DNA is put together in triplets to form 64 distinctive codons. 61 of those codons synthesize for the 20 amino acids, each of which can be synthesized by as few as one or as many as six synonymous codons. An intuitive assumption is that codons and amino acids are evenly used throughout different genomes. However, it has been discovered that amino acid incorporation as well as codon usage are biased in all organisms from bacteria to mammals [1][2][3]. It is anticipated then that selective forces are acting to maintain the current balance between mutation and selection, resulting in optimal coding regions. Some of the proposed factors include degree and timing of gene expression, codon-anticodon interaction, transcription and translation rate and accuracy, codon context, and global and local G+C content.
In humans, bias has been reported on nucleotide abundance in the three codon positions, dinucleotide usage, and di-codon usage within coding regions [4]. This bias leads to a high usage of acidic amino acids as well as reflects the avoidance of stop codons. Bias in codon usage has been attributed to selection towards a more efficient translation model. This suggests codon usage in highly expressed genes is biased toward optimal codons corresponding to more abundant tRNAs. Such models take into consideration both the elongation rate and the translation accuracy [5][6]. In more recent studies [7][8], it has been shown that the codon usage bias is found to relate to gene expression with highly expressed genes favoring codons corresponding to the more abundant tRNAs (thus increasing translation efficiency) and lower expressed genes favoring less abundant tRNAs. In addition, highly expressed genes in humans are found to be shorter in general and contain less intronic content [9]. This study also reports a bias in the amino acid usage where highly expressed genes avoid complex amino acids, where complexity is based on the weight and shape of the amino acid. Other selection factors proposed include the effects of DNA polymerase and repair mechanism, methylation, CpG islands [10-11], tissue or organelle specificity [12], increasing mRNA stability, transcription rate [13-14] and evolutionary age [15]. None of the previously mentioned factors manage to provide a complete explanation for codon usage bias. It is likely that all of these forces work together competitively with selection, organelle and organism specificity to produce the current bias found in every gene and every genome.
In a previous paper [16], we demonstrated that in humans, bias in codon usage makes the coding DNA significantly more tolerant to point mutations than random sequences. If only substitution events are considered in a codon, the degeneracy of the genetic code allows for a single nucleotide conversion to result in a codon representing the same amino acid, a different amino acid (missense mutation), or a stop codon (nonsense mutation). Analysis based on single nucleotide mutation rates and similarity among amino acids showed that point mutations in the natural coding sequence of humans are significantly more likely to yield the same or similar amino acids than would be the case in random DNA sequences. Moreover, the coding sequence based on the natural transcription reading frame was shown to be more buffered to mutations than the other two possible shifted reading frames. Here, we say a codon is buffered to or tolerates point mutations when it tends to translate to the same or similar amino acid as the original one more often than would be expected by random when exposed to point mutations.

Buffering capacity
Buffering capacity refers to the tendency of a coding sequence to allow for point mutation events that result in a sequence of the same or similar amino acids, as opposed to missense or nonsense mutations. In order to quantify this property, we have developed a mathematical model that approximates the buffering capacity for an individual sequence by taking into account the codon composition, individual nucleotide mutation rates, and a distance matrix between amino acids that reflects the resulting damage caused by individual amino acid substitutions(See supplementary material).
Selection toward error minimization or increased buffering is not a new concept. For example, the natural coding or the natural mapping between the 20 amino acids and the 61 coding codons is believed to have come to its current state by evolution and selection and it has been shown to provide high tolerance to mutations or translation errors [17][18][19]. In [19], the authors have shown that only two in a billion randomly generated mappings would provide better error minimization than the natural genetic code.
In this paper, we test for a possible correlation between the expected life span of an organism and the buffering capacity in its coding DNA. This hypothesis is motivated by the fact that an organism that lives long will require genes buffered against mutations to protect against harmful changes to protein sequences while a species with a short life span does not need such a property since it will not be subjected to nearly as many point mutations throughout its life. Furthermore, organisms that have a long life span usually reach sexual maturity at a later age at which time their DNA gets passed to the next generation. Therefore, it is expected that the longer the life span or age to sexual maturity of an organism, the higher the buffering.
To calculate the buffering capacity of a given coding DNA sequence, we used a similar measure to the one used in [16]. The computed buffering capacity takes into consideration both the organism-specific rates of mutation and the consequences of the corresponding mutations. For mutation rates, we have used the neighbor-dependent substitution rates reported in a recent study based on observed neutral mutations within inserted retrotransposable elements [20].
To estimate the consequence of a mutation, an amino acid similarity matrix is used to measure the distance between the original amino acid and the amino acid resulting from the mutation [19]. This similarity matrix is based on computations of the change in the structure and folding free energy of a protein when a single amino acid is mutated to another one at all positions in a set of 141 different proteins. On the other hand, a nonsense mutation is thought to be the worst type of mutation; nonetheless there is no natural way to weigh a nonsense mutation compared to other types of missense mutations. As an approximation, in this paper we consider it to be three times as detrimental as the worst missense mutations.

Dataset
The species selected include Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), Mus musculus (mouse), and Homo sapiens (humans). These were selected based upon varying rates of average life expectancy and time to sexual maturity as well as the availability of sequence data and species specific mutation rates. The rates for life expectancy and time to sexual maturity were derived from the Human Aging Genome Resources [21]. These rates are provided in  16,800 human genes in this dataset were considered. These sequences were further filtered to exclude genes whose nucleotide sequence is not a multiple of three, or who do not begin with the start codon ATG. In addition, gene sequences that do not end in one of the three stop codons (TAA, TAG, or TGA) were removed. For zebrafish, mouse, and fruit fly, mRNA sequences were downloaded from the RefSeq database. The total number of genes used for each organism is provided in Table 2 (See supplementary material).

CDS categorization
The data downloaded represents known mRNA sequences from the coding sequences (CDS) of genic regions. These datasets were categorized and labeled as "natural". For each known CDS, a randomly generated mRNA sequence was constructed that yields the exact same protein. These correspond to a group labeled as "constrained random" since they are randomly generated sequences constrained by the property that they must yield an identical amino acid sequence upon translation. A third set of sequences labeled "random" was constructed consisting of mRNA sequences of the same length as the natural group.

Analysis of buffering capacity
For each of the natural, constrained random and random sequences, the buffering capacity was calculated as described in the buffering capacity section. Mutation rates for fruit fly, zebrafish, and human were downloaded from an existing dataset [20]. Mutation rates for the mouse were calculated using the publicly available web server using the methods described in [20] with the mouse B1_Mur1 repetitive element from RepBase as the ancestral sequence and a set of 4,010 annotated B1_Mur1 occurrences in the mouse genome as the descendant sequences. A distribution of the buffering capacities for each of these three groups was examined. A distribution close to zero would indicate a set of sequences that are highly buffered to point mutations at the mRNA level that do not significantly affect the corresponding translated protein sequence. The main hypothesis is that if a species has a built-in tolerance to mutations, the distribution of buffering capacity scores between the naturally occurring mRNA sequence and the random sequence sets should be significantly different.

Statistical analysis
For further analysis of the results two different statistical analyses tools are used. A t-test is used to measure the significance of the increased buffering in the natural coding sequences as compared to the random sequences. Second, for each of the four species, we use Cohen's d measurement to quantify the effect size of codon usage bias in increasing the tolerance to mutations in the natural coding sequences compared to the random sequences and to quantify how much more the constrained random sequences are buffered than the completely random sequences. Note that a direct comparison between the three organisms' coding sequences would be invalid because the mutation rates are time independent and were computed within every genome separately. The mutation rates are relative values within the same genome only. However, the comparison between the natural coding and the random sequences under the same mutation rates should tell us how much bias is there in the natural DNA structure toward buffering within every genome separately. Therefore, the effect size within every genome should be a valid indicator for comparison between the different genomes.

Discussion:
Our results show that both human and zebrafish DNA are significantly more buffered than random sequences. Buffering is more apparent in human while being absent in fruit fly. Furthermore, random sequences that are constrained to produce the same protein as the natural ones are found to be more buffered than completely random sequences while being less buffered than the natural sequences themselves in both human and zebrafish. This property is also more apparent in human than zebrafish while being absent in fruit fly. These results suggest evidence that bias in the DNA structure toward tolerance to mutations is correlated with life span and is realized on two levels: protein selection and codon usage bias. Distributions of buffering capacities for fruit fly, zebrafish, mouse, and human are given in Figure 1. Note that the x-axis for each of these cannot be compared directly due to the different mutation rates in each of these organisms. As Figure 1 clearly indicates, a difference in the distribution of buffering capacity for the random, constrained random and natural mRNA sequences cannot be observed in the fruit fly while the separation is evident in zebrafish and more pronounced in the mouse and human.
Analysis of the effect size Table 3 (see supplementary material) shows that the separation between the naturally occurring mRNAs and random sequences is greatest in the mouse and in the human. When the effect size is compared to the life expectancy ( Figure 2) and time to sexual maturity (Figure 3), a clear trend results with an exception occurring with the mouse buffering distribution. This may be due to a difference in the method for calculating mutational rates within the mouse. In order to provide a more accurate measure of the relationship between life expectancy and codon buffering, a more thorough analysis is needed for organisms with a wider range of life expectancies and age to sexual maturity. Nonetheless, the presented results demonstrates an overall significant correlation between buffering capacity to mutations as a result of codon usage bias and both the life expectancy and the time to sexual maturity of the corresponding organisms. To our knowledge, this is the first time such a hypothesis has been tested. Our analysis also shows that the optimization toward higher tolerance to point mutations is accomplished at both levels the codon usage and the sequence of codons themselves. These results add a new additive explanation to current theorems about codon usage bias in the natural transcripts of these organisms. Finally, as more organisms get sequenced and as more accurate and comprehensive studies of mutation rates become available, we hope to continue this work on a larger scale, with additional phylogenetic information included as well.

Methodology: Buffering capacity
Buffering capacity refers to the tendency of a coding sequence to allow for point mutation events that result in a sequence of the same or similar amino acids, as opposed to missense or nonsense mutations. In order to quantify this property, we have developed a mathematical model that approximates the buffering capacity for an individual sequence by taking into account the codon composition, individual nucleotide mutation rates, and a distance matrix between amino acids that reflects the resulting damage caused by individual amino acid substitutions. For a given DNA or mRNA coding sequence Here, we define the distance between two codons as a measure of the difference between the encoded amino acids and to approximate this distance, we use the amino acid similarity matrix proposed in [19] as follows: In (2) ( ) max Sim is the maximum similarity value in the same similarity matrix. If the new codon j s C is a stop codon (nonsense mutation), the distance will be considered three times the highest distance given by (2). Note that other values for the nonsense mutation distance produce similar results (not shown).
As an approximation of the neighbor dependent mutation probability natural coding against random DNA, replacing the mutation probabilities by the relative missing period mutation rates should have no effect on the significance test and the effect size measurement since both statistical measures are scale independent