Codon usage pattern in human SPANX genes

Background: SPANX (sperm protein coupled with the nucleus in the X chromosome) genes play a crucial role in human spermatogenesis. Codon usage bias (CUB) is a well-known phenomenon that exists in many genomes and mainly determined by mutation and selection. CUB is species specific and a unique characteristic of a genome. Analysis of compositional features and codon usage pattern of SPANX genes in human has contributed to explore the molecular biology of this gene. In our current study, we have retrieved the sequences of different variants of SPANX gene from NCBI using accession number and a perl script was used to analyze the nucleotide composition and the parameters for codon usage bias. Results: Our results showed that codon usage bias is low as measured by codon bias index (CBI) and most of the GC ending codons were positively correlated with GC bias as indicated by GC3. That mutation pressure and natural selection affect the codon usage pattern were revealed by correspondence analysis (COA) and neutrality plot. Moreover, the neutrality plot further suggested that the role of natural selection is higher than mutation pressure on SPANX genes. Conclusions: The codon usage bias in SPANX genes is not very high and the role of natural selection dominates over mutation pressure in the codon usage of human SPANX genes.


Background:
The genes of SPANX family (sperm protein associated with the nucleus in the X chromosome) located in a cluster on Xq27 chromosome encode the protein products that are expressed in germ cells, non gametogenic tissue as well as several tumors [1]. SPANX genes encode small unfolded proteins of approximately 100 amino acid residues and these resemble with the high mobility group A (HMGA) proteins to some extent which are involved in the formation of different nucleoprotein complexes. They can form dimers and complex with other proteins resembling the HMGA proteins [2]. SPANX proteins are linked with the nuclear envelope in transformed mammalian cells, similar to the one in human spermatozoa. SPANX genes emerge to have evolved under strong positive selection, parallel to genes associated with reproduction [3]. They consist of two subfamilies SPANX-A/D and SPANX-N. SPANX-A/D proteins are found within the cytoplasm associated with the nuclear envelope in the mature spermatids [4]. SPANX-A/D genes map within segmental duplications that are the regions involved in genomic rearrangements resulting in an abnormally high level of structural polymorphisms. SPANX A1 serves as a biochemical marker to study unique structures in spermatozoa. Accordingly, the SPANX-B and the SPANX-C genes were shown to be present in variable copy number (ranging from one to >11) in the normal population. SPANX-A/D genes help in spermatogenesis but their expression was not found in nongametogenic tissue. Analysis of SPANX gene homologs (nonhuman primates) showed that SPANX-A/D genes arose nearby 7 million years ago and followed expansion in hominids [3]. The SPANX-N gene subfamily found in all mammals gave rise to the SPANX-A/D subfamilies in the hominoid lineage. The SPANX N (N1, N2, N3 and N4) are mapped 1.3 Mb away from the cluster of SPANX-A/D gene and SPANX-N5 is located on the short arm of the X chromosome at Xp11 [5].
It is well known that genetic code consists of 64 codons out of which 61 encode 20 standard amino acids but the remaining three codons encode termination signals (UAA, UAG, and UGA). The usage of synonymous codons is different in the genes of an organism and also among other organisms. Unequal usage of synonymous codons is called ''codon usage bias''. Codon usage bias is an intricate evolutionary phenomenon, and exists in diverse organisms, from prokaryotes to unicellular and multicellular eukaryotes. The usage of synonymous preferred codons is a unique property of a genome [6]. Generally, mutational pressure and natural selection have been reported to be the two vital factors contributing to synonymous codon usage discrepancy among genes of an organism [7]. However, mutation in the synonymous codon generally occurs in the third base position without varying the primary sequence of the protein product. In some organisms, mutation pressure plays a central role in influencing the pattern of synonymous codon usage with extremely high A, T, G or C content. Further the processes of DNA replication, transcription, gene structure, and environmental conditions significantly influence codon usage pattern [8]. The alteration of synonymous codon usage pattern is a skill for reengineering genomes from the nucleotide level to the mega base scale [9]. Codon usage bias has practical implications in mRNA translation, new gene discovery, design of transgenes, and studies of molecular biology and evolution.
Analysis of codon usage pattern is a key tool for understanding the molecular mechanism of codon distribution. The present study was undertaken to elucidate the compositional features and codon usage pattern in SPANX genes in human. Our analysis has given a novel insight into the codon usage patterns of SPANX genes that would assist in better understanding of the synonymous codon usage pattern as well as the factors influencing it.

Methodology:
Coding sequence data Using accession numbers different variants of SPANX genes were retrieved from NCBI (http://www.ncbi.nlm.nih.gov/). Only those coding sequences (cds) were considered for analyses which are exact multiples of three bases with proper start and stop codon. The accession numbers of 46 cds are shown in Table 1 (see supplementary material).

Indices of codon usage bias
Relative synonymous codon usage (RSCU) was calculated for the 59 synonymous codons for exploring the pattern of codon usage in the translation of amino acids. RSCU >1.6 indicated that codons were over-represented while the RSCU values >1.0 indicated that the codon is more frequently used [10]. The formula used to estimate RSCU is as follows where, X ij is the frequency of occurrence of the j th codon for i th amino acid (any X ij with a value of zero is arbitrarily assigned a value of 0.5) and ni is the number of codons for the i th amino acid ( i th codon family).
The codon adaptation index (CAI) was used to estimate the degree of gene expression level of a single gene. The CAI value ranged between 0 and 1.0, and high value of CAI indicates high gene expression [11]. The CAI is calculated as where ωk is the relative adaptiveness of the kth codon and L is the number of synonymous codons in the gene.
The codon bias index (CBI) measures the extent to which preferred codons are used in a gene. The formula used to calculate CBI is as follows

CBI=Nopt−NranNtot−Nran
Where Nopt is the number of preferred optimal codons, Ntot is the total number of codons, and Nranis the expected number of optimal codons if random codon assignments were made for each amino acid [12]. GRAVY (Grand Average of Hydropathicity) values are the sum of the hydropathy values of all the amino acids in the encoded protein of the gene divided by the number of residues in the sequence [13]. Aromo stands for aromaticity and refers to the frequency of aromatic amino acids (Phe, Tyr, Trp) in the translated gene product [14].
The frequency of overall A,T,G,C and their frequency at third codon position , overall GC content and GC contents at first, second and third (GC1, GC2, GC3) position were calculated using a perl script. GC3s was used as a good marker for compositional constraint bias.

Analysis tools
Codon usage parameters and compositional dynamics were calculated (excluding the codons for Met, Trp, and the termination codons) using the Perl script developed by corresponding author SC. Correspondence analysis (COA) is a multivariate statistical analysis used to analyse the variation in codon usage pattern using XLSTAT. Correspondence analysis uses RSCU value and its axes 1 and 2 contribute to total variation. Correlation and regression analysis were carried out by using the multi-analysis software SPSS 21.0.

Result & Discussion:
Codon usage bias can be affected by the overall nucleotide composition of genomes [15]. Therefore, we first analyzed the compositional features of coding sequences from different variants of SPANX genes. It is observed from the  In order to investigate the codon usage pattern of SPANX gene, we correlated codon usage with GC3 content. Figure 1 shows the heat map of the correlation coefficients between codon usage and GC bias in human SPANX gene. In our analysis, most of the G-and C-ending codons were positively correlated with GC3, and most of the A-and T-ending codons were negatively correlated with GC3. However, twelve G/C ending codons namely ACG, TTG, CAG, CTG, GAG, GCG, ATC, AGC, TCC, TGC, GAC and GGC showed a negative correlation between codon usage and GC3 whereas eight A/T ending codons, ATA, AGA, CAA, CGA, AGT, CCT, GTT and GCT showed a positive correlation with GC3. This indicates that twelve G/C ending codons will show decreasing usage with increasing GC bias as indicated by GC3 and eight A/T ending codons will show increasing usage with increasing GC3 bias. Palidwor et.al reported that GC ending codons were mostly positively correlated with GC3 and AT ending codons were mostly negatively correlated with GC3 in codon usage pattern in prokaryotes, plants and human thus supporting our results [17].
To investigate the variation of codon usage in the SPANX genes, correspondence analysis (COA) was performed based on the RSCU values of each gene (Figure 2). The COA of different variants of SPANX gene detected the first principal component (F1'), which could account for 51.84 % of the total synonymous codon usage variation, whereas the second principal component (F2') accounted for 19.25 % of the total variation. Again, several significant correlations were observed between the two principal axes and nucleotide contents Table 3 (see supplementary material). Axis 1(F1) showed a significant positive correlation with A, T, A3, T3 but showed significant negative correlation with C3, GC, GC1, GC2, GC3, Gravy and CBI. Axis 2 (F2) of COA showed significant positive correlation with GC2, GC3, ARO, Gravy and CBI while significant negative correlation with A, T, G, C, A3, T3, G3, C3, GC1, CAI and Laa. Our analysis suggests that mutational pressure and natural selection might have played a major role in shaping the dynamics of codon usage patterns within different variants of SPANX gene supporting the finding of Wei [18].
A neutrality plot was drawn to estimate the magnitude of natural selection against mutation pressure in the codon usage pattern of SPANX gene. Neutrality plot is the regression analysis of G12 (GC12 average of GC1 and GC2) on GC3. The points in the neutrality plot are not diagonally distributed and the values of GC3 are in a narrow distribution, indicating that GC12 and GC3 are definitely not due to the mutational bias (Figure 3). On the other hand, the regression curve (green line) tended to slope towards the horizontal axis. The regression coefficient of GC12 on GC3 in SPANX genes is 0.242, indicating the relative neutrality is 24.20 % while the relative constraint is 75.80 % for GC3. This result suggests natural selection played a major role while mutation pressure played a minor role in shaping the codon usage pattern in SPANX gene. Jia et.al. also found that natural selection played a prominent role in codon usage pattern in Bombyx mori [19]. We also found similar result.

Conclusions:
The codon usage bias is not very high in SPANX genes. The overall GC content is low and the gene is AT rich. Natural selection is the major determining factor in shaping the pattern of codon usage in different variants of SPANX gene rather than mutation pressure.