Improved Annotations of 23 Differentially Expressed Hypothetical Proteins in Methicillin Resistant S.aureus

Antibiotic resistant Staphylococcus aureus is a major public health concern effecting millions of people annually. Medical science has documented completely untreatable S. aureus infections. These strains are appearing in the community with increasing frequency. New diagnostic and therapeutic options are needed to combat this deadly infection. Interestingly, around 50% of the proteins in S. aureus are annotated as hypothetical. Methods to select hypothetical proteins related to antibiotic resistance have been inadequate. This study uses differential gene expression to identify hypothetical proteins related to antibiotic resistant phenotype strain variations. We apply computational tools to predict physiochemical properties, cellular location, sequence-based homologs, domains, 3D modeling, active site features, and binding partners. Nine of 23 hypothetical proteins were <100 residues, unlikely to be functional proteins based on size. Of the 14 differentially expressed hypothetical proteins examined, confident predictions on function could not be made. Most identified domains had unknown functions. Six hypothetical protein models had >50% confidence over >20% residues. These findings indicate the method of hypothetical protein identification is sufficient; however, current scientific knowledge is inadequate to properly annotate these proteins. This process should be repeated regularly until entire genomes are clearly and accurately annotated.


Background:
Antibiotic therapy has been the marvel of modern medicine since the advent of Penicillin in the 1920s. Over seventy billion doses of antibiotics are consumed globally each year [1]. Antibiotics are a low-cost resource to treat food-borne and other sanitation-related infections that commonly affect poor people. Among wealthier countries, antibiotics play a pivotal role as a prophylactic, controlling infections associated with medical practices such as surgery [1,2]. Unfortunately, this usage exposes normal microbial flora to anti-bacterial drugs, allowing them to develop resistances so the drugs lose effectiveness. Medical science has been unable to cultivate new antibiotics as fast as resistances to current therapies are rising [2,3]. Infectious organisms that are resistant to every antibiotic developed have been reported. This antibiotic resistance crisis is a critical challenge for humanity's medical future.
Staphylococcus aureus, an opportunistic pathogen that was originally associated with hospital-acquired infections, was the first organism to show resistance to Penicillin and its synthetic offspring like Methicillin. Though hospital-acquired Methicillinresistant S. aureus (MRSA) cases proliferated through the late 20 th century, recent years have seen decreases in the number of hospital-acquired MRSA infections due to improvements in sanitation procedures and increases in Vancomycin use despite its potential side effects [4]. Unfortunately, community-acquired MRSA infections have dominated recently since over 100 million people harbor MRSA strains as part of normal skin flora according to Dutch  A challenge to developing new antibiotic therapies is genome annotation. Around 50% of proteins identified in the S. aureus genome are annotated as hypothetical [6,7]. At annotation, hypothetical proteins are predicted by sequence only and lack homology to known proteins. Researchers further define hypothetical proteins by their larger than 100 amino acids size, since smaller sequences likely represent other macromolecular structures such as short interfering RNA (siRNA) rather than functional proteins [8]. True hypothetical proteins have similar features to other hypothetical proteins due to lack of experimental evidence to predict function for the protein family, though frequently hypothetical proteins found in databases represent old genome annotations in need of update. Several studies have used various methods to identify hypothetical proteins related to antibiotic resistance in S. aureus. Early studies randomly selected hypothetical proteins for characterization [6,7,9,10]. While this approach developed and demonstrated computational procedures that contribute to hypothetical protein characterization, it is limited in its ability to identify hypothetical proteins specifically connected to antibiotic resistance. To improve the selection process, we formerly developed crossspecies approach that used proteins with experimentally established structures from the major facilitator superfamily; a large, highly conserved protein family associated with antibiotic resistance [7]. This approach worked because of the large percentage of hypothetical proteins in the S. aureus genome, but it becomes inadequate if a hypothetical protein related to resistance has no well-characterized homolog in another species, a common challenge for hypothetical proteins. Better methods for identifying antibiotic resistant-related hypothetical proteins are needed.
Microarray and other forms of publicly accessible gene expression data can provide an excellent repository for targeted identification of resistance linked hypothetical proteins in S. aureus. For example, Ham and colleagues examined mRNA expression between antibiotic resistant (MRSA; ATCC 33591, shown to be susceptible only to Vancomycin and Kanamycin) and sensitive (MSSA; ATCC 25923) strains using Affymetrix GeneChip® technology [11]. They statistically compared mRNA expression levels between the strains to uncover potential mechanisms of resistance, but did not consider hypothetical proteins. This presents an opportunity to characterize hypothetical proteins whose differential expression constitutes a drug-resistant genomic background.
This study uses computational procedures to characterize statistically significant differentially expressed hypothetical proteins from the microarray data generated by Ham and associates. By comparing natural gene expression between antibiotic sensitive and resistant strains, new insight into strain background differences is gained. These variations could uncover new resistance mechanisms, further developing into a useful diagnostic tool or potential antibiotic therapeutic target. This would improve outcomes for patients infected with MRSA strains through faster and more effective treatment options.

Methodology:
Normalized mRNA expression data from Ham's study is available at the National Center for Biotechnology Information's (NCBI) Gene Expression Omnibus (GEO; Dataset Record GDS4242; GEO accession GSE18289) [11]. Data consisted of 7774 entries, each with probe name and six samples representing triplicates of both MSSA (ATCC 25923) and MRSA (ATCC 33591) strains. Probe names were converted to gene names and descriptions per Affymetrix chip platform and non-hypothetical proteins were removed. Excel calculated T-scores and p-values based on Student's T-test two-tailed, equal variance formulas. The study rejected hypothetical proteins with a p>0.05 as these were not differentially expressed. The National Center for Biotechnology Information (NCBI) and UniProt databases confirmed hypothetical protein annotation.
This study used numerous algorithms to characterize these hypothetical proteins and default program settings were used for all analyses. ExPASy's Protparam server calculated physiochemical properties including number of amino acids, molecular weight, positively and negatively charged residues, theoretically isoelectric point (pI), extinction coefficient, aliphatic index (AI), instability index (II), and the grand average hydropathy (GRAVY) [12]. By hypothetical protein definition, those identified through differential expression yet smaller than 100 amino acids were excluded from further study.
PSortB and SOSUI servers predicted each hypothetical protein's cellular location. PSortB predicted between cytoplasm, cytoplasmic membrane, cell wall, or extracellular locations [13]. For model development and characterization, we used the integrated Phyre2 and 3DLigandSite servers. Phyre2 produced a tertiary structure model, predicted ligand-binding sites, and analyzed the effect of amino acid variants through automatic homology detection methods [17]. Phyre2's model advanced to 3DLigandSite for active site characterization and docking predictions. 3DLigandSite identifies homologous structures with bound ligands by searching a structural library then superimposing those ligands onto the Phyre2's protein structure [18]. Together, Phyre2 and 3DLigandSite servers modeled the protein and characterized its binding site.
The Search Tool for Interactions of Chemicals (STITCH) database predicted potential ligand interactions for each hypothetical protein. STITCH draws upon scientific literature and several databases, including the formerly separate Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database, which houses high-throughput experiment and conserved coexpression data, to calculate drug-target interactions, binding affinities, and biological pathways [19]. STITCH is a useful tool to predict protein and chemical binding partners.

Results:
The mRNA expression dataset, GSE18289, was downloaded from GEO and Excel calculated the T-statistic and p-value for each protein. Twenty-seven proteins labeled as hypothetical in NCBI, 16 and 11 up-and down regulated in MRSA, respectively, had <0.05 p-values. Four of these proteins had predicted functions in UniProt, an endotoxin (SACOL0468, up regulated, T-score 9.00), exotoxin (SACOL1178, up regulated, T-score 10.17), phosphate dikinase regulatory protein (SACOL1620, down regulated, Tscore -9.80), and a lipoprotein (SACOL1531, up regulated, T-score 7.89). Since these proteins had predicted identities, they were excluded from further study. The remaining 23 proteins are listed by T-score in Table 1.

Figure 1:
Venn diagram illustrating overlap of study evalations. PSortB, PSI-BLAST, and Phyre2 (green line) characterized all 14 hypothetical proteins that passed Expasy's size exclusion criteria. Only those algorithms found results for SACOL2481 (1). SACOL2241 also had a SOSUI (purple line) result (1). SACOL0710 and SACOL0323 had STITCH (blue line) results (2). SACOL0488 had both STITCH and CDD-BLAST (orange line) results (2). SACOL0267, SACOL0109, and SACOL0075 had SOSUI and STITCH results (3). SACOL2123 and SACOL0350 had CDD-BLAST, Pfam (yellow-line), and STITCH results (2). SACOL1956 and SACOL0644 had results from all programs (2) and SACOL0835 had results from all except STITCH (1). # AA, number of amino acids; MW, molecular weight; pI, theoretical isoelectric point; # neg, total number of negatively charged residues (Asp + Glu); # pos, total number of positively charged residues (Arg + Lys); EC, extinction coefficient assuming all pairs of Cys residues form cystines; II, instability index; AI, aliphatic index; GRAVY, grand average hydropathy. 1 As there are no Trp, Tyr, or Cys in the region considered, protein should not be visible by UV spectrophotometry. Few hypothetical proteins had well defined homologs in the NCBI database as identified by PSI-BLAST ( Phyre2 and 3DLigand servers performed hypothetical protein modeling and active site characterization. Similarity measurements of the hypothetical protein target to its experimental structure template are in Table 7. These findings represent Phyre2 running in normal mode. Hypothetical proteins with coverage >25% in normal mode were re-run under Phyre2's intensive mode with the results show in Figure 2. Remarkably, under this mode, SACOL1859 and SACOL0710 models had 88% and 89% residues modeled with >90% confidence. No amino acids from the other four proteins could be modelled with that confidence. Unfortunately, 3DLigand was unable to make a prediction for any hypothetical protein examined in this study due to insufficient homologous structures with ligands bound.
STITCH predicted binding partners for hypothetical proteins. STITCH was unable to predict binding partners for the following hypothetical proteins: SACOL2481, SACOL0835, and SACOL2241. Most top binding partners were fellow hypothetical proteins with confidence scores listed in Table 8. This implies that more database annotation and/or wet bench work are needed to fully understand how these proteins work. SACOL0323, SACOL2123, and SACOL0710 had top matching binding partners that were not hypothetical proteins.
SACOL0323 matched a prophage L54a, Cro-like protein. SACOL2123 had equal scores to a M20/M25/M40 family peptidase (SACOL2125) and a hypothetical protein (SACOL2124). SACOL0710 equally matched a phosphotransferase mannose-specific family component IIA (SACOL0709) and a DAK2 domain-containing protein (SACOL0708). These results did not correlate with the findings from other programs used in this study.  1 Equal probability of the protein being located in any cellular structure: cytoplasm, cytoplasmic membrane, cell wall, or extracellular. 2 Equal probability of protein being located in cytoplasmic membrane, cell wall, or extracellular.