Synonymous codon usage pattern among the S, M, and L segments in Crimean-congo hemorrhagic fever virus

Crimean-Congo hemorrhagic fever (CCHF) virus is one among the major zoonosis viral diseases that use the Hyalomma ticks as their transmission vector to cause viral infection to the human and mammalian community. The fatality of infectious is high across the world especially in Africa, Asia, Middle East, and Europe. This study regarding codon usage bias of S, M, and L segments of the CCHF virus pertaining to the host Homo sapiens, reveals in-depth information about the evolutionary characteristics of CCHFV. Relative Synonymous Codon Usage (RSCU), Effective number of codons (ENC) were calculated, to determine the codon usage pattern in each segment. Correlation analysis between Codon adaptation index (CAI), GRAVY (Hydrophobicity), AROMO (Aromaticity), and nucleotide composition revealed bias in the codon usage pattern. There was no strong codon bias found among any segments of the CCHF virus, indicating both the factors i.e., natural selection and mutational pressure shapes the codon usage bias.

The codon usage bias is the most preferable factor in the biological evolution of most organisms. Codon bias is always known as the choice of synonymous codons that are non-random for every different gene or genomes. Particular codon bias is specific to the taken organism and can be affected by GC content, gene lengths, and gene expression level. To understand the molecular mechanism of expression, and the consequence of long-term evolution on a genome, it is important to study the recognition of a distinct pattern of codons that possesses the distinct type of biological influences. Codon bias is the trendiest and widely acknowledged hypothetical analytic technique that describe codon usage bias is mutationselection balance, determination of the codon usage exhibits the collective results of three evolutionary forces: genetic drift within a sample, natural selection, and mutational pressure. Overall, shuffle in GC and AT(U) pairs to cause nucleotide composition bias leads to mutational pressure, efficiency to maximize the production of protein by the preferred codons are known natural selection and eradication of codon changes among the generations as a result of emigration and immigration at the population level will lead to genetic drift [4]. The evolutionary process, an adaptation of the virus to the host, genetic drifts, selection, and mutation pressure are some of the information that can be obtained from codon usage patterns. The bias in the codon usage pattern may show variations in gene expression and protein synthesis efficiency. The viral-host adaptiveness affects the replication efficiency, virulency, synthesizing proteins, and survival of the virus is an extent of the bias in the codon usage pattern [9]. Several studies have suggested that mutational pressure is the main force for the establishment of a codon usage pattern [5,6,7,8]. In the current study, we have attempted to explain the codon usage bias of each segment of the CCHF virus using the various bioinformatics tools and R programming modules. Known data shows an occurrence of mutational pressure and natural selection in the CCHF virus [5]. But the strive of analysing the CCHF viral genome segment-wise has been employed with various methodologies to study the significant variations in codon usage pattern.

Methods: Data Collection:
Nucleotide sequences are the major factor of the data collection. The complete CDSs nucleotide sequences of S, M, L segments of Homo sapiens host of CCHF virus were downloaded separately from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/ vssi/#/) database in FASTA format. Overall, 157 sequences were retrieved and analyzed. The coding sequences of each segment were aligned and edited individually with MEGA (Molecular Evolutionary Genetics Analysis) software [6].

Overall Nucleotide Content Analysis:
Overall nucleotide content of each segment S, M, and L, which is a composition of A T G C, more specifically nucleotide at 3rd position of the codons (A3%, C3%, T3%, G3%), other entities like GC, GC1, GC2, GC3, and GC12 (mean values of G&C at the 1st and 2nd position of codons) were calculated using MEGA software. Mononucleotide and GC contents frequencies were calculated using R Studio programming software using the required external library "seqinR" [7] [8].

Relative Dinucleotide Abundance Analysis:
The relative dinucleotide abundance may take a role while predicting the codon usage indices; analysis is used to predict the organism's favourable dinucleotide. Totally there are 16 variable occurrences of dinucleotide is possible, the outline of the dinucleotide frequency specifies both mutational and selection pressure [9], and Relative Dinucleotide Abundance of three segments S, M, and L of CCHF virus were calculated using the method defined by Karlin and Burge [10].

PXY = FXY /(FX FY)
Where FX & FY are the frequency of individual nucleotide and dinucleotides are denoted by FXY in the same equation. As a conservative criterion, PXY > 1.23 is considered as high and PXY < 0.73 is low relative abundance [10]. Dinucleotide frequencies were calculated using R Studio programming software using the required external library "seqinR".

Relative Synonymous Codon Usage (RSCU) Analysis:
RSCU method is described as the ratio of the observed to the expected value for a given amino acid. Amino acid frequency or the length of the sequence does not affect the RSCU values. The codon that achieves the more than 1.6 values are overrepresented, whereas codons that lie lesser than 0.6 are underrepresented and the codon values that fall between 1.6 and 0.6 are considered to be unbiased or randomly used. The RSCU values were calculated by the following formula:

RSCU = gij / ∑ i j gij
Where gij denotes, an observed number of the ith codon for jth amino acid, that has ni types synonymous codons [11]. RSCU values of all segments S, M, L were obtained and visualized using R Studio programming software and "seqinR" library.

Effective Number of Codons (ENC) Analysis:
An ENC evaluation reflects the deviation of codon from random selection. Commonly, Effective number of codon range between 20-61 [12]. The value 20 signifies an enormously biased in which only one codon is being used to code for each amino acid. Whereas value 61 indicates no bias and the codons have been used equally. If the ENC values are less than 45, are deemed to have moderately biased codon usage [9].
The ENC value was evaluated using the following formula: ENC = 2+ 9/F2 + 1/F3 + 5/F4 + 5/F6 Where Fi (i= 2,3, 4, 6) denotes the average Fi in the i-fold degenerate amino acid family. Where the Fi value is calculated using: Where n denotes the sum of observed codons for particular amino acid; nj denotes the sum of the observed jth codon for a particular amino acid. The ENC values of the S, M, L segments of the CCHF virus were calculated in R Studio programming software, "vhica" library [13]. To illustrate the relationship between an effective number of codons and GC3 (sum of G&C nucleotide at the third position) the ENC plot was generated. This method defines and quantifies codon usage bias of gene or genome, which is the finest overall method for estimating absolute synonymous codon usage. Whereas the formula to calculate the ENC values is [10] [14],

ENC expected =2 + S + (29/ S 2 + (1-S))
Where S represents the GC3 (sum of G&C nucleotide at the third position) content. If the ENC values situate on the expected standard curve, it specifies that the codon usage be impacted by mutational pressure. Values below the standard curve indicate that the values are restricted by another factor i.e., natural selection.

Neutrality Plot Analysis:
The neutrality plot analysis is used to determine the effect of mutational pressure and natural selection that influences the pattern of codon usage. The neutrality plot was illustrated using the GC3 values against the mean of GC12. If GC3 values are significant and closer to 1, mutational pressure plays a major role to build the codon usage pattern over natural selection. The regression slope is =0 then, natural selection plays a major role [12]. The same technique was carried out for each S, M, L segments of the CCHF virus by plotting the GC12 values against GC3 values. The regression line on the neutrality plot is indicative of the mutational pressure [15] [14].

Parity Rule 2 (PR2) plot Analysis:
A PR2 or Parity rule 2 analyses was done by plotting the GC bias on abscissa [G3/(G3+C3)] and AT bias [A3 / (A3+T3)] on the ordinate. The analysis usually reveals comparative magnitude between natural selection and mutation pressure based on the genome composition [7]. The origin for both axes will be 0.5 (X= 0.5, Y= 0.5). This suggests that A=T, G=C. points situating on the origin indicates no deviation between natural selection and mutational pressure.

Codon Adaptive Index (CAI) Analysis:
Codon adaptive index (CAI) is a method to measure the level of expression gene based on the coding gene. The range of the CAI lies between 0 and 1. The highest relative adaptations were gained by the most frequent codon. The coding sequence that acquires the highest CAI values is more preferred over the lowest CAI values [11]. In the current study, Codon adaptive index values of each segment were calculated using DAMBE 7.0 software, considering the reference of Synonymous codon usage of H. sapiens [13].

Average Hydrophobicity (GRAVY) and Aromaticity (AROMA):
The GRAVY is the total amount of hydropathy values of the entire amino acid in a sequence divided by the number of residues. The average range of hydropathy range from -2.0 to +2.0, hydrophobicity of a protein were indicated by positive values, hydrophilicity was indicated by negative values. Aromaticity (AROMO) is the frequency value of aromatic amino acids, i.e., Trp, Tyr, and Phe in a sample amino acid sequence. The total GRAVY and AROMO values were calculated using the CodonW tool (CodonW download | SourceForge.net).

Results: Data collection:
The coding nucleotide sequences of each segment, S (n = 48, l = 3944bp), M (n= 57, l= 1687bp), and L (n= 52, l = 3945pb) of the CCHF virus were retrieved from the NCBI Virus database. The alignment of the nucleotide coding sequence of all segments, the estimation of nucleotide composition, and removal of stop codons from each sequence of all the segments were done using MEGA X (MUSCLE algorithm for alignment) [7].  Table). The nucleotide composition of each segment of the CCHF was calculated to assess the influence of nucleotide on codon usage patterns (

Relative Dinucleotide abundance frequency analysis:
The bias of dinucleotide can influence codon usage bias. Calculation of relative abundance of total 16 dinucleotides of each segment S, M, and L was calculated using R Studio software. The abundance frequency of each segment was seen to have less ©Biomedical Informatics (2021) 485 consistency compared with a theoretical value (equal to 1.0). Overall abundance frequency is classified based on overrepresented (>1.23) and underrepresented (< 0.78 ) [17] ( Table  2).

Relative Synonymous Codon Usage (RSCU) Analysis:
The relative synonymous codons usage of each codon of the three segments was calculated. RSCU values are represented based on the range from 0.6 to 1.6. Values that are < 0.6 are considered as underrepresented and values > 1.6 are overrepresented. Codons that gain the significance value of >1.0 represent the positive codon bias and < 1.0 represent the negative codon bias [14][9] ( Table 3). The result of the Relative Synonymous Codon Usage of each segment was as follows:   [1] Segment S: Bias of GC and AT was 0.46 and 0.45, respectively. Suggesting the preference of pyrimidines over purines. Figure 5.S [2] Segment M: GC and AT bias of M segments were 0.42 and 0.54, respectively. Indicating the preference of AT over GC and purines over pyrimidines. Figure 5.M [3] Segment L: Whereas GC and AT bias of L segment was 0.50 and 0.51, suggesting the AT preference over GC, and purines over pyrimidines.

Neutrality plot analysis:
A neutrality plot is used to examine the relationship and dominant factors (mutational pressure and natural selection) between GC12 and GC3, the plot was illustrated using the mean values of the first and second position of GC against GC3. In this study, the neutrality analysis of S, M, and L segments was seen as follow: CUB of M segments is also influenced by natural selection over the mutational pressure. The amount of genetic disparity within the population is determined by the rate of mutation. And these disparities arose from the errors made during the replication process. Figure 6.M [3] Segment L: values of GC12 and GC3 were situated near negative significant negative regression line with y= 0.448-0.106, +100x, R 2 = 0.31. Highlighting the 10.6% neutrality. Natural selection plays a major effecting factor in CUB. Figure 6.L

Codon Adaptation index analysis:
The Codon adaptation index was executed to examine the optimization of codon usage and adaptation of the virus to the host. CAI values were calculated by considering the codon usage pattern of H. sapiens as a reference. This study identified that all segments of CCHF possess a higher tendency of CAI values (> 0.5). The CAI values were varied from 0.75 to 0.77, 0.71 to 0.74, 0.71 to 0.72 with a mean value ± standard deviation of 0.76 ± 0.007, 0.73 ± 0.008, and 0.71 ± 0.002 in S, M, and L, respectively.

Correlation analysis:
The major two determinants, natural selection, and mutational pressure were considered to study the codon usage bias in each segment of CCHF. To further confirm the natural selection, the correlation analysis was performed among T, C, A, G, GC, GC1, GC2, GC3, ENC, CAI, GRAVY, and AROMO. The significant values r= -0.21675, and r = 0.4764 were observed between the ENC and GC3 of the S and L segment respectively, indicating that the pattern of codon bias is influenced by GC nucleotide on the third position. Whereas in the M segment of the CCHF virus, a nonsignificant value r = -0.1945 was obtained. A significant correlation was seen between CAI and GC3 r = 0.6528, r = 0.7138 of S, and M segments, respectively. Also, indicates that the influence of GC on the third position impacts the CUB. But the correlation value r = 0.0020 of segment L of the CCHF virus has a non-significance value, saying non-impact of GC3 on CUB. The correlation between CAI and ENC was significant value r= -0.29674, r=-0.675, and r= 0.  Table.2), and

Discussion:
CCHF is zoonotic, tick-borne, and one among the virus affecting on the human community. In the majority of living organisms, the choice of particular codon usage is a major sign of biological evolution. Therefore, the codon usage pattern delivers significant information about the host adaptation, evolution, and factor influencing CUB [5]. We suggest that the segmentation of the entire genome is a molecular key that reduces the communication between capsid stability and geometrically constrained viral particles. So, analysing the genome segment-wise will lead to identifying the specific residual differences in the genes of the viral genome.
The nucleotide composition is the base element to shape the codon usage pattern. Interestingly, the mean value of nucleotide A and Aend codons was seen to be the highest in each segment of the CCHF virus. Also found that selection of nucleotide A was consistent even when the entire nucleotide of the genome was studied as in previous findings [5]. The high nucleotide content of A in the CCHF CDs may be a genomic feature of genus Nairovirus. The CA & TG and CG & TA dinucleotides were seen to be over and underrepresented, respectively in S, M segments whereas in the L segment CT dinucleotide was an addition to overrepresented codons. Similar results were observed in the previous study [9,15]. Further, a total of 7 (ACG, CCG, TCG, TGC, AGC, GGC, and TGC), 11(ACG, CCG, CGC, GCG, TCG, AGC, CGC, GCA, GCG, and TGC), and 10 (AGC, CGC, GCA, GCG, GGC, TGC, CCG, CGC, GCG, and TCG) RSCU codons were containing underrepresented CG dinucleotide in S, M, and L segment respectively, signifying all these codons are not favourably preferred. so, this study says that dinucleotide composition performs an impact on codon usage pattern. Also, we attempted to track the shape of the codon usage bias of each segment in the CCHF virus i.e., S, M, and L. The previous CUB study of the entire genome of the CCHF was able to determine 31 high-frequency RSCU codons from the entire genome but when we analyzed the genome by each different segment, we have achieved to track 27, 29, and 31 high-frequency codons in S, M, and L segment, respectively. A similar result was observed in a previous study as well [9,16]. The previous analysis of CAI of the CCHF virus shows that the entire CAI was observed to be 0.80, but compared to this study, observed maximum CAI values of each S, M, and L were 0.77, 0.74, and 0.72 respectively, which signifies the low adaptation ability to the host compared to a previous study [5]. Similar results were also found in the CUB of the Rift Valley fever virus that belongs to the same family Bunyariridea [20].
The magnitude between natural selection and mutational pressure wasn't seen in the previous study [5], in order to analyze, we performed parity rule 2 analysis. The obtained results reveal that segment L has a slight or low bias compared to S, and M. Also, purines over pyrimidines in M and L segment whereas pyrimidine over purines in S segments. The parity plot of the GC and AT on the third position revealed that natural selection making a remarkable role over mutational pressure. A neutrality analysis plot was performed to determine the influencing factor on CUB, based on the previous study, when the entire genome was analysed, there was no significant relationship observed between GC12 and GC3 [5]. But analyzing the genome segment-wise reveals the importance of neutrality 3.95%, 5.5%, and 10.6% in S, M, and L segments, respectively. An indication of a natural selection over the mutational pressure in effecting the codon usage bias [21]. In the earlier study, only the correspondence analysis was performed segment-wise and all the other analyses were carried out for S, M, and L segments combined with H. sapiens, Hyalomma, Bos taurus, and Ovis aries organism. Parity rule 2 was employed in this study which also determines the influence of mutation pressure and natural selection in analysing the codon usage bias, which was not analysed in the previous study [5]. However, analyzing the entire genome segment-wise, resulted in significant interpretation compared to the previous study, it was observed that there are distinct variations while studying the entire genome compared to analysing the genome segments-wise. Further, the RSCU and ENC values among the segments S, M, and L indicated that S was dominating over the segments M and L. Natural selection makes a significant role in building the codon usage pattern when the gene is highly expressed, meanwhile mutational drifts play a major role when there is an occurrence of low-level gene expression. Based on these two factors, the origin of codon usage can be clarified. But it seems that these two factors aren't sufficient enough to confirm attributes of codon usage [22]. The disease-related wet-lab experiments are needed for confirmation about the codon usage pattern but with time constraints the codon usage bias analytical technique has been a boon approach to predict and analyze the CUB computationally.

Conclusion:
The CCHF virus is one of the deadliest viral diseases that causes a major public health concern. Indeed, there are no potential medicines available, so there is a need for the development of potential drugs and therapeutic. Studying the potential host and viral genome may be beneficial in identifying various preventive measures. We examined that, the CCHF virus had a weak codon usage bias, and adaptation of the CCHF virus to the host varies from each segment, meanwhile this study proves that the virus has less adaptation capability with humans. Also, the RSCU, ENC, CAI, Aromaticity, Gravy, and other nucleotide bases play a different role in each segment of the CCHF virus to undergo natural selection and shaping the codon usage pattern, respectively. The results of the study will aid future CCHF surveillance and other basic research that provides significant insights into the understanding of CCHF evolution.