Distinctive features gleaned from the comparative genomes analysis of clinical and non-clinical isolates of Klebsiella pneumoniae

It is of interest to describe the distinctive features gleaned from the comparative genome analysis of clinical and non-clinical isolates of Klebsiella pneumoniae. The core genome of K. pneumoinae consisted of 3568 genes. Comparative genome analysis shows that mdtABCD, toxin-antitoxin systems are unique to clinical isolates and catB, benA, and transporter genes for citrate utilization are exclusive to non-clinical isolates. We further noted aromatic compound degrading genes in non-clinical isolates unlike in the later isolates. We grouped 88 core genes into 3 groups linked to infections, drug-resistance or xenobiotic metabolism using codon usage variation analysis. It is inferred using the neutrality plot analysis of GC12 with GC3 that codon usage variation is dominant over mutation pressure. Thus, we document data to distinguish clinical and non-clinical isolates of K. pneumoniae using comparative genomes analysis for understanding of genome diversity during speciation.


Background
Klebsiella pneumoniae is a non-motile, gram-negative bacterium, which inhabits in diverse ecological niches ranging from soil to water and plants, and it is also opportunistic pathogen causing hospital-acquired disease in patients with the compromised immune system [1]. Human clinical isolates are considered as indistinguishable from environmental isolates with respect to their biochemical reactions and other attributes [2]. In plants, K. pneumoniae strains had been reported to fix nitrogen and promote growth of the plants [3]. Rajkumari et al. [4] had reported that ©Biomedical Informatics (2020) Klebsiella strains have the ability to degrade hydrocarbons, including polyaromatic hydrocarbons (PAH). Previously, Holt et al. [5], have determined the diversity of genetic variation in specific genes associated with virulence and antibiotic resistance to track the emergence of invasive K. pneumoniae infections.
The advent of whole-genome sequencing along with reliable bioinformatics workflow has made possible to study the distinct genome patterns that may reflect the variation due to evolutionary pressures. Identification of conserved genomic core genes and pan genes from a collection of bacterial genomes depends on classifying orthologous genes based on similar sequences [6]. It provides valuable information and deeper insights into bacterial genome evolution, genes associated with host adaptation, virulence, and pathogenesis [7]. In previous work, the whole genome sequence analysis of clinical and environmental K. pneumoniae suggested that they are closely related but antibiotic resistance and virulence factors were more frequent in clinical isolates. In fact, the phylogenomics analysis of K. pneumoniae whole genome failed to result in any distinct segregation of clinical and non-clinical clades of K. pneumoniae, as the genomes originating from either group were mixed throughout the tree [8].
Therefore, in this work, a genome analysis workflow was designed and used to identify the genes and their functions, which can be considered to resolve between clinical and non-clinical K. pneumoniae isolates (Figure 1). We used phylogenomics approach followed by the comparison of K. pneumoniae codon usage bias (CUB) pattern across the genes. In fact, the synonymous codons are known to be used non-randomly, and this unequal usage of the synonymous codon is called codon usage bias [9]. It is of interest to describe the distinctive features gleaned from the comparative genome analysis of clinical and non-clinical isolates of K. pneumoniae.

The effective number of codons
The observed effective number of codons (ENC) for each coding sequence of gene sets of K. pneumoniae was calculated using the formula given by Wright where Fk (k= 2, 3, 4, 6) is the mean of Fk values for the k-fold degenerate amino acids

Nucleotide composition analysis
The overall nucleotide composition (A%, C%, T% and G%) and occurrence of overall frequency of the nucleotide (G+C) at first (GC1), second (GC2) and third (GC3) position of the synonymous codons were calculated in the coding sequences of the genes to quantify the extent of base compositional bias. The calculations were done using a Perl script developed by one of the authors (SC).

Neutrality plot
The neutrality plot is a scattered plot, which is used to determine the role of directional mutational pressure against selection pressure during evolution. It is the regression of GC12 on GC3, as the synonymous mutation occurs in the 3 rd position of codon while non-synonymous mutations occur in the 1 st and 2 nd position. The non-synonymous mutation transforms the activity of the gene, which resulted from the alteration of amino acid sequence. In neutrality plot, if the regression line falls near the diagonal, it signifies weak external selection pressure and the role of mutation pressure is dominant.

Software and Statistical Analysis
Heat map of the specific clinical and non-clinical genes was generated by Expression Heatmapper using an average linkage method with Euclidean distance [11]. A network of genes was created for selected unique genes of clinical and non-clinical isolates by Cytoscape 3.4.0 with GeneMANIA plugin. The node degree distribution of the complex protein-protein interaction network was obtained from Cytoscape by Network analyzer [12]. A PERL program was developed to estimate the genetic codon usage bias indices and the selection pressure on the coding sequence of K. pneumoniae genes. Correlation analysis was performed to identify the degree of relationship between two parameters by Karl Pearson's method. The significance of the correlation coefficient was tested by t-test for (n-2) degrees of freedom at p<0.01 or p<0.05. Statistical analyses were performed using IBM SPSS version 21.0 for windows.  Table 1). The tree was built out of core genome taking 3568 genes per genome, 135584 in total. There was no clear separation of clades between clinical and non-clinical isolates ( Figure 2). Phylogenomic analysis of the core genomes 48 clinical and 29 environmental K. pneumoniae isolates had demonstrated that the isolates were intermixed and failed to result in any distinct segregation of clinical and non-clinical clades of K. pneumoniae [8].
©Biomedical Informatics (2020) Based on the phylogenomics results, genomes (clinical and nonclinical) were selected for further analysis and orthologous genes in K. pneumoniae were calculated. Core genome analyses of the K. pneumoniae genomes showed that it consisted of 3568 conserved genes. While the pan-genome appeared to grow rapidly and the core genome was limited to less than 4000 genes. There are several Pan-genome analysis pipelines are available [13, 14, 15, 16] however we have used EDGAR pipeline for studying genetic variation, and function enrichment analyses of the gene clusters. Core versus pangenome development analysis of K. pneumoniae genomes revealed that 3568 formed the core genome while 11780 genes formed the pan-genome when K. pneumoniae AWD5 is used as the reference genome ( Figure 3). The genes in AWD5 strain cover 89.45% of the coding genes in the genome. Average nucleic acid identities (ANI) of AWD5 with K. pneumoniae KP-1, K. pneumoniae ATCC-BAA 2146 and K. pneumoniae subsp. pneumoniae NTUH-K2044 revealed 99% sequence homologies and 94.02% with K. pneumoniae 342. The genomes of AWD5 and ATCC BAA-2146 (clinical strain) appear to be most similar by sharing 4529 orthologs whereas 4442 orthologs were found between AWD5 and the environment isolate KP-1. Pangenome of six K. pneumoniae strains was reported to be consisted of 4,829 core genes, such a high percentage signifies a high rate of conservation among the strains. Previously, some studies had reported that phenotypic and genetic features of K. pneumoniae of environmental and clinical origin were similar and therefore, the isolates cannot be distinguished [2, 17]. Gene content analysis of K. pneumoniae By investigating the presence and absence of genes in K. pneumoniae, most of the regions within the genomes were found to be conserved, including the virulence genes present in the clinical strains regardless of disease source. The orthologs include components of regulatory pathways such as basic transcriptional machinery, DNA relication, homologous recombination, mismatch repair, nucleotide excision repair, bacterial secretion system and protein export. Genes attributable to the production of indole-3acetic acid (IAA) (ipdC), solubilization of phosphate (pqqABCDEF, phn and pho gene clusters, pstBACS), synthesis of siderophore (ent, fep gene clusters), acetoin and 2,3-butanediol (alsDSR, budC) are found to be conserved. It also revealed the presence of multicomponent nitrate or nitrite transport system. More than 13 genes involved in benzoate (ben genes), catechol (cat genes), ©Biomedical Informatics (2020) 260 protocatechuate (pca genes) were conserved in the core genome of K. pneumoniae.
Further, to obtain the unique genes, the genomes were segregated into two groups i.e., clinical and non-clinical, taking five genomes for each group and analyzed separately. In clinical isolates, such core genes included virulence factor capsule assembly protein, multidrug transporter subunit (mdtABCD), type II toxin-antitoxin systems, and type VI secretion system (TVISS). TVISS were found to be shared among clinical and non-clinical isolates however, at least eight CDSs determined in clinical genomes involved in TVISS, were not found in non-clinical isolates. It has been reported that an environmental isolate Kp342 and clinical isolate MGH78578 seemed to share core components of TVISS [18]. The unique core genome of non-clinical isolate consisted of aromatic compound degrading genes catB, benA and several transporters including genes for citrate utilization. This was in agreement to the hypothesis that environmental isolates are more versatile for their catabolic processes, as the ability to degrade organic compounds govern the evolution of novel catabolic abilities of bacteria that can survive in different habitats [19]. We took the presence/absence of such individual genes as the categorical analyst parameter subjected to build a phylogenetic tree. And showed that the strains from similar clinical and environmental sources were not linked than those from different sources (Figure 4). The core genes discriminated the clinical and environmental genomes and accessory genes segregated the strains within the group. Based on the gene functions, several gene sets were experimented which may resolve clinical and non-clinical isolates. Accordingly, one set of unique core genes in non-clinical isolates and clinical isolates is given in the figure 4 on the basis of which genomes were resolved into two distinct categories, further few genes were useful for resolving the genomes within the particular group. Virulence factor (galF, wcaI, wcaG, wzb, wzc, manBC), allantoin utilization genes (gcl, allABCDS, ybbW), TVISS (impL, vgrG, tssFGJ) and other antibiotic resistance genes were also observed as accessory genes in ten K. pneumoniae. The clinical genomes possessed multidrug resistance gene blaCTX-M- 15 as accessory gene; this is an indication of horizontal gene transfer that might have occurred between virulent and multidrug-resistant K. pneumoniae strains [20]. This finding agrees with the previous study, which reported that resistance to multiple antibiotics was found to be more frequent in clinical origin than that of environmental origin [21]. The gene interaction network of specific core genes of clinical and non-clinical K. pneumoniae genomes was analyzed in cytoscape and GeneMANIA ( Figure 5) and data is given in supplementary file. It was noted that the functional partners were not part of the core genome of the respective group; however, it surely had interactions, and hence influence the functions of these genes.

Extent of Codon usage variation in K. pneumoniae
In order to analyze codon usage variation five K. pneumoniae genomes were selected considering their level of similarity. The data set was restricted to 5 genomes to avoid non-ambiguity. According to the EDGAR interface, genes were placed at intersections of the Venn diagram only if they were reciprocal matches. The analysis utilizes all CDS of the genomes and it is not restricted to the core genome. Among the K. pneumoniae strains, AWD5 chromosome shared 31 orthologous CDS with BA2146 and further 188 CDS conjointly with the strain KP-1. Moreover, 65 orthologous CDS are found to be shared conjointly by environmental isolates AWD5, KP-1 and Kp342 ( Figure 6). Also, it indicated that AWD5 has 10 singleton genes. Besides, the singletons that resemble to the genes without reciprocal best hit to another genome, as orthologs does not necessarily have to be a proper singleton.  The codon usage variation was studied for eighty-eight genes (24 DRGs, 16 IRGs, and 48 XMGs) from five K. pneumoniae genomes of non-clinical origin (Kp342, KP-1, AWD5) and clinical origin (NK2044, BA2146) core genome. Selected genes details are given in the supplementary file ( Table 2).
To quantify the extent of variation in codon usage among different genomes of K. pneumoniae, the effective number of codon (ENC) values for three gene sets of each genome was calculated. The ENC is a non-directional measure of codon usage bias, widely used to measure for individual genes. The ENC values among five K. pneumoniae strains ranged from 37.1 to 39.71, indicating low codon usage bias (Table 3) for IRGs, DRGs and XMGs respectively. (ENC < 40) represents stable ENC values and indicates that there is almost no variation of codon usage bias among the genes of K. pneumoniae. It also indicates conserved genomic composition among different K. pneumoniae genomes. Correlation analysis between codon usage bias and compositional properties of GC content were analyzed to understand the effect of base composition on codon usage bias. Negative correlation was observed between ENC and GC composition ( Table 4). These results suggested that natural selection might have played an important role in codon usage pattern across the genes. Our results show that codon usage bias and gene expression among different K. pneumoniae genomes was lower, slightly biased in the genes [22].   Nucleotide composition analyses of coding sequences of IRGs, DRGs, and XMGs (Table 3) showed mean percentage of GC and AT compositions was 59.44% and 40.5% in IRGs, 59.36% and 40.66% in DRGs and 61.66% and 38.34% in XMGs respectively. Genes were GC-rich, GC content at the third position was higher than at the first and second codon position and the greatest difference of GC content was found between the second and the third codon positions. Hence, the overall nucleotide composition suggested that the nucleotide C and G occurred more frequently compared to A and T in the coding sequences and it is expected that G/C-ended codons might be preferred over A/T ended codons in the genomes. The difference in frequencies of A and T and to that of G and C were not the same which indicates that natural selection might have played a role in codon usage pattern [22]. A neutrality plot analysis of GC12 (average value of GC1 and GC2) versus GC3 (Figure 7) was drawn to characterize the correlation of the three codon position of GC [23], and to estimate the influence of selection and mutation pressure on codon usage bias of IRGs, DRGs and XMGs of five K. pneumoniae strains [24]. The regression coefficient of GC12 on GC3 for IRGs of the K. pneumoniae strains AWD5, BA2146, KP-1, Kp342 and NK2044 were 0.0133, 0.0701, 0.0249, 0.0033 and 0.0099 indicating relative neutrality of 1.33%, 7.01%, 2.49%, 0.33%, and 0.99% respectively. The GC12 was influenced by mutation pressure and natural selection with a ratio of  ©Biomedical Informatics (2020)

Conclusion:
A comparative genome analysis shows that clinical and non-clinical strains of K. pneumonia are similar and are not separated by phylogeny. However, gene level comparison help distinguish these isolates where mdtABCD, toxin-antitoxin systems are distinctive to clinical isolates and catB, benA, transporter genes for citrate utilization are limited to non-clinical isolates. Thus, these data help distinguish clinical and non-clinical isolates of K. pneumoniae towards the understanding of genome diversity during speciation.

Supplementary Materials:
The gene network of specific core genes of clinical and non-clinical isolates was analyzed in Cytoscape (Figure 5a & 5b). The gene network for the unique core genes of clinical isolates consisted of 64 nodes and 191 edges. The relationship of genes of interest in our study with functional partners was specified to find co-expression, genetic interactions, and physical interactions. A single cluster was identified. The relationship of the selected genes was correlated with other genes, for example, phosphoproteins that are assistant transporter of drug transmembrane transport (baeS, baeR, uvrY) correlated to multidrug efflux system (mdtABCD). These two clusters were further connected closely to uracil catabolism system. Similarly, gene cluster of type II toxin-antitoxin module was connected to a group of genes with the phosphotransferase system and fructose specific subsystem. Hierarchical clustering led to the removal of four outlier samples (malE, aslA, yaiZ, ushA), which belong to the unique core of clinical isolates. This is probably due to no interaction with the genes in the network.
We then expanded the analysis of the gene network for the unique core genes of non-clinical isolates that consisted of 36 nodes and 101 edges. Hierarchical clustering led to the removal of seven outlier samples (arsB, benA, betB, catB, ppx, sufE). Interestingly, two clusters were identified in which the cluster for the query gene ppdA with its functional partners showed no interaction with other query genes. In the larger cluster, the relationship of the genes of interest was also correlated, such as an integral component of plasma membrane ynfA; melB interacted with plasma membrane genes ubiD, ybdJ with its gene cluster. The query gene of citrate utilization (citCDE) showed its interaction with functional partners of the gene cluster along with other genes. As well, the functional partners of the outliers' catB, benA, and arsB were specified, which have a functional role in the catabolism of catechol, benzoate and arsenite transport. It was noted that the functional partners were not part of the core genome of the respective group; however, it surely had interactions, and hence influence the functions of these genes.
To understand the topological properties of these networks, the probability of the node degree distribution P(k) showed that each network satisfied scale-free topology following power law (r 2 >0.6; in core genes of clinical genomes and r 2 >0.2 in core genes of nonclinical genomes) (Figure 5c & d). The node distribution based on degree implied the presence of genes with centrality values. The distribution following the power law distribution and showing the nature of the scale-free network suggested a hierarchical organization in the network.