Identification and prioritization of macrolideresistance genes with hypothetical annotation inStreptococcus pneumoniae

Macrolide resistant Streptococcus pneumoniae infections have limited treatment options. While some resistance mechanisms are well established, ample understanding is limited by incomplete genome annotation (hypothetical genes). Some hypothetical genes encode a domain of unknown function (DUF), a conserved protein domain with uncharacterized function. Here, we identify and confirm macrolide resistance genes. We further explore DUFs from macrolide resistance hypothetical genes to prioritize them for experimental characterization. We found gene similarities between two macrolide resistance gene signatures from untreated and either erythromycin- or spiramycin-treated resistant Streptococcus pneumoniae. We confirmed the association of these gene sets with macrolide resistance through comparison to gene signatures from (i) second erythromycin resistant Streptococcus pneumoniae strain, and (ii) erythromycin-treated sensitive Streptococcus pneumoniae strain, both from non-overlapping datasets. Examination into which cellular processes these macrolide resistance genes belong found connections to known resistance mechanisms such as increased amino acid biosynthesis and efflux genes, and decreased ribonucleotide biosynthesis genes, highlighting the predictive ability of the method used. 22 genes had hypothetical annotation with 10 DUFs associated with macrolide resistance. DUF characterization could uncover novel co-therapies that restore macrolide efficacy across multiple macrolide resistant species. Application of the methods to other antibiotic resistances could revolutionize treatment of resistant infections

resistance genotypes our understanding of direct resistance mechanisms further, some macrolide resistant isolates use one of these mechanisms while others use multiple mechanisms with no clear connection to level of resistance [2]. Therefore, despite a good understanding of various direct macrolide resistance mechanisms, co-therapies to overcome macrolide resistance have yet to be established.
Mutant library studies have revealed large numbers of genes that both directly, as discussed prior, and indirectly influence drug resistance with many of these genes not clearly involved in known drug-resistant mechanisms [5]. Indirect mechanisms associated with resistance can be metabolic, such as decreased Kreb's (i.e. TCA) cycle in vancomycin intermediate resistant Staphylococcus aureus [6]. One way to uncover genes functioning indirectly with antibiotic resistance mechanisms is to examine gene expression and identify genes with markedly different expression (differentially expressed) between two conditions for further examination (i.e. hypothesis generation). This approach has been used to successfully predict antibiotic resistance in some bacterial pathogens such as Escherichia coli [5], but its application to other organisms like S. pneumoniae has been slow. Identifying differentially expressed genes associated with antibiotic resistance is a first step in fully elucidating the interaction between direct and indirect drug-resistance mechanisms.
Incomplete genome annotation substantially limits gene expression analysis [7], and is common for bacterial genomes [4,8] with up to 50% of some bacterial genomes lacking annotation [9]. A hypothetical gene is defined by its sequence alone, having little to no experimental evidence of its function, and lacking homology to genes with known function [4,9]. There are two types of hypothetical proteins: (i) uncharacterized protein families that lack domain information and are not usually conserved across phylogenetic lineages, and (ii) domains of unknown function (DUFs), functionally uncharacterized protein sections that have been shown to play essential roles in bacterial processes [9,10]. Over 20% of protein domains have DUFs annotations with around 2,700 DUFs found in bacteria and more than 800 DUFs shared between the domains of life [10]. Identifying hypothetical genes associated with antibiotic resistance and prioritizing them for experimental characterization, such as structural determination [7], could lead to the development of life-saving co-therapies to preclude or overcome antibiotic resistance.
In this paper, we identify and validate genes associated with macrolide resistance by comparing therapeutic response gene expression signatures (list of genes ranked from high to low differential expression between untreated and macrolide treated samples) in S. pneumonia (Figure 1). We noticed these genes were associated with known mechanisms of macrolide resistance, such as efflux, showing our approach's ability to identify potential cotherapies to overcome macrolide resistance. However, as anticipated, 22 out of 160 (13.75%) macrolide resistance genes identified had hypothetical annotation. To address this, we examined hypothetical genes for DUFs and propose prioritized gene targets related to macrolide resistance for immediate experimental characterization. Through we introduce this approach while exploring erythromycin resistant S. pneumoniae, recommendations provided by applying our approach to other antibiotic resistant bacterial infections can reduce development costs and time to availability for potential new co-therapy targets, substantially renovating the way antibiotic resistant infections are treated clinically.

Identification and validation of macrolide resistance genes
Using these expression data, we created erythromycin and spiramycin gene signatures (ranked by T-score) for macrolide resistant GA17547 strain (Figure 1). We used the erythromycin signature as reference and 250 most over-or under-expressed spiramycin genes as query gene sets for Gene Set Enrichment Analysis (GSEA), which calculates a running summation (enrichment score) based on the T-score of matches (hits) between the reference signature and query gene sets [13]. From this, we can (i) estimate how similar these signatures are (significance) by calculating a normalized enrichment score (NES) and p-value from 1000 gene permutations, and (ii) identify genes that contribute to achieving maximum enrichment score (i.e. leading-edge genes) associated with macrolide resistance. Leading-edge (93 over-and 67 under-expressed, Table 1 and Table 2, respectively) genes are used for further analysis.
To confirm that identified leading-edge genes are related to resistance, we (i) used leading-edge genes as query gene sets for GSEA with a T-ranked erythromycin response signature from a macrolide sensitive strain (XZ8009) as reference, and (ii) utilized Principal Component Analysis (PCA) and Leave One Out Cross Validation (LOOCV) to examine expression of leading-edge genes in another macrolide resistant strain, XZ7022 (Figure 1). PCA is an unsupervised dimensionality reduction machine-learning technique that visualizes high-dimensional datasets (in our case 67 and 93 dimensions) in 2D space. PCA considers all samples of high dimensional and converts them into principal components, a smaller number of uncorrelated variables. When principal components are plotted in 2D space, variation between samples is observed as separation along principal components. Alternatively, LOOCV will set aside each sample individually (i.e. test set) and calculates a multiple linear regression equation from the remaining samples (i.e. training set). The resulting equation is used to predict the treatment condition of the test set sample. This process is repeated until all samples are left out and accuracy determined from the results.

Functional association and hypothetical gene identification from macrolide resistance genes
To identify cellular processes associated with our leading-edge genes, we utilized the Panther search feature [14] at the Gene Ontology (GO) knowledgebase [15,16], accessed October 17, 2018 ( Figure 1). Panther calculates a p-value using the Fisher's Exact Test for each user-inputted gene set compared to established gene sets in the GO knowledgebase. For this comparison, we converted each leading-edge gene's locus tag provided by GEO to a gene symbol. To do this, we queried the Protein database from the National Center for Biotechnology Information (NCBI) for each locus tag and collected gene symbols from the connected Conserved Domains Database [17]. If a locus tag did not have domains in the Conserved Domains Database, we verified its hypothetical status by examining homologs identified via Basic Local Alignment Search Tool Protein (BLASTP) [18]. Symbols for all leading-edge genes without exception were included in GO analysis. Figure 1: Schematic representation of study approach. Signatures (ranked list of genes from high to low expression) are created by ranking genes in expression dataset by T-score calculated by comparing untreated and macrolide-treated (either erythromycin or spiramycin) samples collected during mid-log phase of growth. To identify macrolide resistance genes, the erythromycin signature from erythromycin resistant S. pneumoniae strain GA17547 was used as reference and the top and bottom 250 genes from the GA17547 spiramycin signature were used as query gene sets (unranked list of genes with biological relevance) for Gene Set Enrichment Analysis (GSEA) comparison. Gene matches (hits) between the reference signature and query gene set being compared that contribute most to are grouped together as a leading-edge gene set. Leading-edge gene sets were then used (i) as query gene sets for GSEA comparison against a erythromycin gene signature from erythromycin sensitive strain XZ8009 (reference), (ii) to select genes for principal component analysis and leave one out cross validation on erythromycin resistant strain XZ7022, and (iii) for functional analysis by collecting gene symbols, descriptions, and protein domain information from National Center for Biotechnology Information (NCBI) databases then using Gene Ontology to assess lists for overrepresentation to known biological processes gene sets and prioritizing genes without symbols (hypothetical genes) by domain of unknown function detection.

Discussion:
Similarities between erythromycin and spiramycin signatures reveal genes associated with macrolide resistance To identify genes associated with macrolide resistance, we compared erythromycin and spiramycin gene expression signatures with the idea that genes with similar differential expression when erythromycin resistant S. pneumoniaeis treated with different macrolides are associated with macrolide resistance. We observed a statistically significant similarity between erythromycin and spiramycin signatures (p<0.22, Figure 2a). Of the 250 most over-and under-expressed genes from the spiramycin signature used as query gene sets for comparison to the erythromycin signature, 93 over-and 67 under-expressed genes were identified as contributing most to achieving maximum enrichment score (i.e. leading-edge, Table 1 and Table 2, respectively). We then used each leading-edge gene set as query for GSEA against an erythromycin response signature from a macrolide sensitive S. pneumoniae strain with the hypothesis that these genes would not be differentially expressed in response to macrolide treatment. We observed a relatively random distribution of leading-edge genes across the macrolide response signature (p>0.900, Figure 2b), supporting their role in resistance rather than their expression changing as a response to treatment. These genes may contribute to macrolide resistance and become valuable reverse macrolide resistance therapeutic targets.

Figure 2:
Similarities detected between erythromycin (reference) and spiramycin (spira, query gene sets) signatures from an erythromycin resistant S. pneumoniae strain, revealing leading-edge genes. (a) Differential expression of leading-edge genes (query gene sets) was not a response erythromycin treatment as seen by comparison to a erythromycin signature from a macrolide sensitive S. pneumoniae strain (reference). (b) These findings suggest identified leading-edge genes are associated with macrolide resistance rather than response to macrolide treatment.  To confirm that the macrolide resistance genes (leading-edge) we identified are related to resistance, we used PCA to see if expression of these genes in a non-overlapping dataset from a related erythromycin resistant S. pneumoniae strain could separate samples based on treatment (marcrolide or untreated). Both overand under-expressed leading-edge gene sets were able to separate macrolide treated from untreated samples, regardless of which macrolide (erythromycin or spiramycin) was used for treatment ( Figure 3a). To quantify this separation ability, we used LOOCV on the same erythromycin resistant S. pneumoniae strain dataset. Multiple linear regression equations derived from these data were successfully able to predict treatment of left out samples with 100% and 80% accuracy for over-and under-expressed leading-edge genes, respectively (Figure 3b). While the sample size used in this study is small and we acknowledge that inclusion of more samples would make findings more robust and prevent over fitting, these results support the conclusion that our leading-edge genes are related to macrolide resistance.

Macrolide resistance genes involved in increased amino acid biosynthesis and decreased ribonucleotide synthesis
To identify which cellular processes our macrolide resistant genes (i.e. leading-edge) correspond to most, we compared leading-edge gene lists to gene lists of known biological processes from GO to assess for over-representation via Fisher's Exact Test [14][15][16]. GO identified 7 of 93 over-expressed leading-edge genes were related to amino acid biosynthesis: dapA, asd, alr, proC, proB, ilvE, and glyA (p=0.024). We noted genes clustered into several processes not identified as over-represented by GO

Cellular process detection is limited by incomplete genome annotation:
Incomplete genome annotation is a wide-spread challenge to examining cellular processes in bacteria [20]. Unfortunately, this study was not immune to this major limitation as we observed 6 of 93 (6.5%) over-expressed and 10 of 67 (14.9%) under-expressed leading-edge genes had hypothetical annotation and confirmed their annotation via BLASTP. Since true hypothetical proteins require experimental investigation, we explored genes with hypothetical annotation further to provide useful recommendations for experimental endeavors. Following such guidance would maximize the potential to identify targets for new therapeutics that preclude and overcome macrolide resistance while minimizing experimental exploration costs.

Conclusion:
We identified and confirmed macrolide resistance genes in S. pneumoniae that are involved in increased amino acid biosynthesis and decreased ribonucleotide synthesis. Reversing activity for these cellular processes may overcome macrolide resistance. We noted that incomplete genome annotation (i.e. hypothetical genes) is a limitation to our analysis and further explored hypothetical genes related to macrolide resistance to recommend DUFs that are a priority for experimental characterization such as structural determination via nuclear magnetic resonance or X-ray crystallography. Characterization of DUFs identified here has the potential to uncover novel co-therapies that reverse macrolide resistance [7], restoring efficacy, not only for S. pneumoniae patients, but across multiple macrolide resistant species, saving thousands of lives annually.
Our gene signature comparison approach to identify DUFs associated with antibiotic resistance is a novel way to prioritize hypothetical genes for experimental characterization. Application of our approach across resistant bacterial infections would be valuable in reducing experimental time and financial costs for identifying new therapeutic targets. However, a major hindrance in these efforts is the availability of datasets run on the same platform. Variations in platforms used in gene expression studies require the use of gene symbols, reducing signature similarities and resulting in detection loss. Regardless, gene expression datasets for antibiotic resistant bacteria using the same platform are publicly available with more being deposited regularly. Results from further analysis could hold far-reaching advancements in treated antibiotic resistant infections globally.