Synonymous codon usage in forty staphylococcal phages identifies the factors controlling codon usage variation and the phages suitable for phage therapy

The immergence and dissemination of multidrug-resistant strains of Staphylococcus aureus in recent years have expedited the research on the discovery of novel anti-staphylococcal agents promptly. Bacteriophages have long been showing tremendous potentialities in curing the infections caused by various pathogenic bacteria including S. aureus. Thus far, only a few virulent bacteriophages, which do not carry any toxin-encoding gene but are capable of eradicating staphylococcal infections, were reported. Based on the codon usage analysis of sixteen S. aureus phages, previously three phages were suggested to be useful as the anti-staphylococcal agents. To search for additional S. aureus phages suitable for phage therapy, relative synonymous codon usage bias has been investigated in the protein-coding genes of forty new staphylococcal phages. All phages appeared to carry A and T ending codons. Several factors such as mutational pressure, translational selection and gene length seemed to be responsible for the codon usage variation in the phages. Codon usage indeed varied phage to phage. Of the phages, phages G1, Twort, 66 and Sap-2 may be extremely lytic in nature as majority of their genes possess high translational efficiency, indicating that these phages may be employed in curing staphylococcal infections.


Background:
Staphylococcus aureus, a gram-positive bacterium, causes various mild as well as life-threatening diseases to human and other animals [1].Numerous virulence factors are synthesized in this organism by the co-ordinated action of multiple regulators.To exterminate the S. aureus infections, several antibiotics (that block the synthesis of cell wall, DNA, RNA, and protein) had been administered from 1943.Sadly, S. aureus strains with the resistance to multiple antibiotics were evolved gradually worldwide.Currently, vancomycin, linezolid, daptomycin, and tigecycline are being used in clinics to eradicate the multi-drug resistant S. aureus strains [2].These agents are, however, either not completely safe for use or associated with the development of staphylococcal resistance.The S. aureus vaccines, so far developed by different groups, failed in the clinical trials though they yielded promising results in the animal models [3].Bacterial viruses or bacteriophages (or simply phages) that usually kill the host bacteria have been presumed to substitute antibiotics as these agents showed promising results in the treatment of various bacterial infections including staphylococcal infections [4].Over 250 staphylococcal phages have been reported so far [5].Of the staphylococcal phages, relatives of phages K, G1, and Twort exhibited therapeutic potentials [6].All these phages belong to Myoviriade family and are virulent in nature.Therapeutic potentials of other staphylococcal phages, particularly those belonging to the families Podoviridae and Siphoviridae have not been demonstrated yet [7].
Synonymous codon usage biases had been determined in numerous living organisms and in viruses [8][9][10][11][12].The results unanimously demonstrated that codon usage is species-specific and even varies gene to gene of the same species.Codon usage is usually positively co-related with the GC base composition of the genome.Several other factors such as mutational pressure, translational selection, secondary structure of protein, replication-translation selection, environmental factor, etc. were reported to influence the codon usage bias.Codon usage analysis has enriched both the basic and applied biological sciences in number of ways.It has assisted in understanding the expression levels of the genes, provided clues about the evolution of the genes and genomes, maximized protein expression in the heterologous system, enhanced immunogenicity of vaccines, etc.Previously, codon usage biases were investigated in the twelve of these phage genomes [16].The results suggested that phages K, 44AHJD, and P68 are extremely virulent in nature and may be useful in therapy.Currently, very little is known about the codon usage bias and the lytic ability of the rest thirty nine phages.In the present study, we performed codon usage analysis in the genes of forty staphylococcal phages (including phage K) at length.Our results revealed that phages G1, Twort, 66 and Sap-2 are extremely virulent in nature, indicating that these phages could be used in curing staphylococcal infections.

Methodology:
Complete genome sequences of forty staphylococcal phages were downloaded from different databases [available in NCBI (USA) and EMBL (UK) web pages].The forty phages whose genomes were downloaded are 187, 2638A, 29, 47, 52A, 53, 69, 71, 85, 88, 92, 96, X2, 37, 3A, 42e, 55, 66, 80alpha, CNPH82, EW, G1, K, P954, PH15, PT1028, ROSA, SAP-2, SAP-26, Twort, phiETA2, phiETA3, phiMR11, phiMR25, phiNM1, phiNM3, phiPVL-CN125, phiPVL108, phiSauS-IPLA35 and phiSauS-IPLA88.Phage K was included in the study in order to compare its codon usage bias with those of other staphyloccal phages, particularly, phages G1 and Twort.The basic characteristics of all forty phages were presented in Table 1 (see supplementary material).While phages PH15 and CNPH82 infect S. epidermidis, other phages infect S. aureus.Except phages P954, PT1028, ROSA, and phiPVL-CN125, all other phages are distributed across the three major families, namely, Myoviridae, Podoviridae and Siphoviridae.Gene number in the 40 phages varies from 20 to 214.Genome sequence of S. aureus USA300 [19] were also downloaded for a comparative study.Coding sequences of all forty phages were extracted by a program in in EMBOSS 6.4 [20] designated 'Coderet'.Coding sequences carrying less than 50 codons were not considered in this study.Hence a total of 2722 protein coding genes from 40 phages were utilized for all analysis.All coding sequences of S. aureus USA300 were also extracted by similar manner.The relative synonymous codon usage (RSCU) in all forty phages was determined to understand the overall codon usage variation among the genes and genomes.RSCU is defined as the ratio of the observed frequency of codons to the anticipated frequency if all the synonymous codons for those amino acids are utilized uniformly [21].The RSCU values greater than 1.0 denote that the corresponding codons are more often used than expected, whereas, the opposite is true for the RSCU values less than 1.0.G3s, A3s, T3s, and C3s are the frequencies of G, A, T, and C and GC3s is the frequency of (G+C) at the synonymous third positions of codons.Nc, the effective number of codons used by a gene, is usually used to measure the bias of synonymous codons and independent of amino acid compositions and codon number [22].The values of Nc vary from 20 (when one codon is used per amino acid) to 61 (when all the codons are used equally).The highly-biased genes are generally expressed highly [23].As gene expression levels of the phages are not known with certainty, we have considered the highly-biased genes as the highly expressed in case of staphylococcal phages under study.The correspondence analysis

Discussion and Conclusion Synonymous codon usage variation in 40 staphylococcal phages
To study the codon usage variation in the forty staphylococcal phages, the overall RSCU values in 2722 protein-coding genes of these phages were determined (data not presented).As expected from the AT-rich phage genomes, A and T ending codons appeared to be predominant in all the phages (data not shown).To determine the possible codon usage variation in the above phage genes, both Nc and GC3s were determined and found to vary from 36.52 to 48.44 (with a mean of 44.39 and standard deviation 3.061) and from 0.159 to 0. including mutational bias seem to be responsible for the codon usage alternation in the staphylococcal phages.

Effect of mutational pressure on the codon usages variations
To determine the factors responsible for the codon usage variation in different organisms, both Nc plot (a plot of Nc versus GC3s) and correspondence analysis (CA) were employed extensively [11].Genes were suggested to fall on the continuous curve between Nc and GC3s, if GC3s were the only determinant of Nc.Secondly, comparison of the actual distribution of genes with the expected distribution under no selection could indicate whether codon usage bias is influenced by the factors other than mutational bias [22].The Nc plot reveals that a small fraction of 40 staphylococcal phage genes lie on the expected curve towards GC poor regions (Figure 1), suggesting their origination from the extreme mutational bias.In contrast, majority of the genes with low Nc values lie well below the expected curve (Figure 1), indicating that these phage genes possess an additional codon usage bias which does not depend on GC3s.Interestingly, genes of phages K, G1 and Twort are mostly overlapped, lie far away from the expected curve and tend to be separated from the rest phage genes.The data suggest that the effect of mutational bias on codon usage variation in the phages G1, K, and Twort is very weak.To verify the suggestions made from the Nc plot, correspondence analysis of the RSCU values of the 2722 protein-coding genes of the above 40 phages were performed as described in Methodology.The distributions of the genes along the first two major axes were shown only as they accounted for 6.64% and 5.44% of the total variation, respectively (Figure 2).An analysis between the positions of genes along first as well as second major axis ( generated by CA) with the nucleotide composition at the third codon position revealed that the first major axis is negatively correlated with A3s and T3s but positively correlated with G3s and C3s Table 2 (see supplementary materials).Interestingly, second major axis does not show any correlation with A3s, T3s, G3s and C3s.The results suggest that A and T ending codons are clustered on the negative side, whereas, G and C ending codons are located on the positive side of first major axis.To determine the differences between these two clusters of genes, codon usage of 10% genes located at the extreme right side of axis 1 was compared with the 10% of genes located at the extreme left side of axis 1.As evident from Table 3 (see supplementary material), there are 24 over-represented codons in the genes located at the positive side of the first major axis.Out of 24 predominant codons, there are 10 A ,13 T ending codons and 1 C ending codon, which actually represent 41.67% A, 54.17% T, and 4.17% C ending codons.The results suggest that genomic GC composition has a powerful effect in separating the genes along first major axis according to their RSCU values.The data from correspondence analysis therefore supports the data from of the Nc plot and together they indicate that mutational bias plays a major role in shaping the codon usage variation in the genes of 40 staphylococcal phages.

Effects of translational selection and gene length on the codon usage variation
To ascertain whether translational selection or gene length also influence synonymous codon usage bias of the forty staphylococcal phages, correlation coefficients were determined between the positions of the genes along the first major axes with both Nc and gene length Table 2 (see supplementary material).Apparently, only first major axis is positively correlated with Nc.A scatter plot (Figure 3) between the position of the genes along the first major axis and Nc values also revealed that several genes of the phages SAP-2, 66, K, G1 and Twort possess lower Nc values and seemed to be separated from the other phage genes.The phage genes with lower Nc values therefore carry comparatively AT rich codons.The data together suggest that a balance between mutational bias and translational efficiency is strongly contributing to the codon usage variation in the genes of staphylococcal phages.Besides, phages SAP-2, 66, K, G1, and Twort are seemed to be composed of the highly expressed genes primarily.Synonymous codon usages in the highly expressed genes are usually influenced by the cellular tRNA abundance [16, 27-29].In most organisms, cellular tRNA abundance was found to be directly proportional to the corresponding tRNA copy number.To see whether there is any positive correlation between host tRNA abundance and synonymous codon usage of the phages G1, K, Twort, Sap-2 and 66, a comparative analysis was made between the tRNA copy numbers of S. aureus USA300 and the over-represented synonymous codons of these phages separately Table 4 (see supplementary material).Of the 18 abundant synonymous codons of phage Sap-2 or 66, 10 codons were found to be paired by those USA300-specific tRNAs whose copy numbers are greater than two or more per cell.Considering the standard base pairing rules of codon-anticodon, 7 additional overrepresented codons of sap-2 or 66 appeared to be interacted by the abundant host tRNAs Table 4 (see supplementary material).In contrast, 13 out of 18 over-represented codons of G1 or Twort or K were recognized by the over-represented tRNAs in USA300.Most of the genes of phages G1, Twort, K, 66 and Sap-2 therefore seemed to possess a high translational efficiency.Further analysis also shows that first axis in CA is negatively correlated with the gene length of the protein coding genes of the 40 phages Table 2

(see supplementary material).
This indicates that gene length influences codon usage variation to some extent in the above phages.

Synonymous codon usages of 40 staphylococcal phages are distinct
The data presented above clearly show that the synonymous codon usage of phages G1, K, Twort, 66 and Sap-2 are different from other staphylococcal phages being studied here (Figures 1-3).To determine whether synonymous codon usages of all the 40 staphylococcal phages differ from each other, correspondence analysis was also carried out on RSCU values on genes grouped according to their corresponding genomes.The data reveal that the positions of the genes along the first major axis clearly separated the genomes according to their genomic AT composition (Figure 4).Additional statistical analysis (using F statistics) on the RSCU values of all 18 amino acids, which are coded by two or more synonymous codons of the 40 staphylococcal phages, revealed that usages of each 59 synonymous codons differed significantly from genome to genome.The data together indicate that codon usages of the staphylococcal phages vary phage to phage.To study the variation in the codon usage of the 40 staphylococcal phages more accurately, a cluster analysis was performed on overall codon usage data of these phages by D-squared (D 2 ) statistic approach [30].D 2 statistic values were determined by the following equation: D 2 = sum of 64 codons of: (frequency (codon, Table 1) -frequency (codon Table 2)) 2 ; where frequency indicates the relative synonymous codon usage values.A lower D 2 value indicates a strong resemblance in the codon usage.A matrix containing the D 2 value of each set was undertaken to generate clusters.The clustering (dendrogram) produced by UPGMA (unweighted pair group method using arithmetic averages) procedure [9] shows that phages belonging to the Myoviridae (such as G1, K, and Twort) and Podoviridae (e.g., Sap-2 and 66) families are clustered in two distinct branches, whereas, Siphoviruses and unclassified viruses are clustered in different other branches (Figure 5).Such distribution of the phages not only supports the correspondence analysis data (Figure 4) but also reveals that the synonymous codon usage is dissimilar even among the phages of each branch.

Identification of the staphylococcal phages possessing therapeutic potentials
To identify the staphylococcal phages suitable for therapy, we carried out an additional analysis of the data presented in (Figure 3) and in Table 4 (see supplementary material).It appears that the Nc values of the genes of phages G1, K, Twort, Sap-2 and 66 mostly vary from ~20 -40.Secondly, phages G1, K, and Twort harbor nearly 72% highly expressed genes, whereas, phages 66 and Sap-2 carry about 94% highly expressed genes.The results together suggest that the genes of the above five phages would be expressed rapidly by the translational machinery of the pathogenic S. aureus strain USA300.Based on such translational efficiency, two podoviruses (66 and Sap-2) seem to possess higher lytic ability than those of the three myoviruses (G1, K, and Twort).Interestingly, all podoviruses or myoviruses here might be similarly lytic in nature.These five phages carry no or very few GATC sites and also do not harbor any pathogenic factor-encoding / antibioticresistance gene.Each group contains 10% of genes at the two extreme ends of the first major axis.The '*', 'N', 'a', 'b' and 'AA' indicate the codons whose occurrences are significantly (p<0.01)higher in the group of genes on left side than the genes on the right side of the first major axis, number of codons, genes on the left side, genes on the right side, and amino acid, respectively.
[13-15].In addition, codon usage analysis also identified virulent phages having prospects in therapy [16-18].The complete genome sequences of the fifty one staphylococcal phages (including phages K, G1 and Twort) are available in different viral databases (as of December 2011).

[ 9 ]
and the determination of all the parameters (mentioned above) were performed by the program CodonW 1.4 [24].The copy number of tRNA species in S. aureus USA300 and corresponding anticodon sequences were determined by a program, namely, tRNAscan-SE [25].The Sau3AI cut site in the genomic DNA was determined by 'Webcutter 2.0' program [26].

Figure 1 :
Figure 1: Nc plot of 40 S. aureus phage genes.The genes are represented as different colored dots or as indicated colored shapes.

Figure 2 :
Figure 2: Positions of the genes of 40 staphylococal phages along the two major axes of variation in the correspondence analysis on RSCU values.The genes are presented by the same symbols as in Figure 1.

Figure 3 :
Figure 3: Scatter plot of S. aureus phages and Nc values.The genes are presented by the same symbol as in Figure 1.

Figure 4 :
Figure 4: Position of the genes along the first two major axes produced by correspondence analysis on RSCU values of 40 staphylococcal phage genomes.The genomes are represented as different colored dots or as indicated colored shapes.

Figure 5 :
Figure 5: A dendrogram showing the codon usage variation of 40 staphylococcal phages constructed by the UPGMA method.

Table 1 :
The results together indicate that all the five phages may be inducted in clinics for curing staphylococcal infections.Phages similar to G1, K, and Twort were already shown to have medicinal properties [5, 6].Very little have been done yet to test the efficacy of the podoviruses as the remedies of staphylococcal diseases.Basic characteristics of the staphylococcal phages used in the study

Table 2 :
Correlation coefficients between the positions of the genes along the first two major axes with both Nc and gene length

Table 3 :
Relative synonymous codon usage (RSCU) values of each codon for the two groups of genes located at the extreme ends of the first major axis as determined by CA.