Predictive characterization of hypothetical proteins in Staphylococcus aureus NCTC 8325

Staphylococcus aureus is one of the most common hospital acquired infections. It colonizes immunocompromised patients and with the number of antibiotic resistant strains increasing, medicine needs new treatment options. Understanding more about the proteins this organism uses would further this goal. Hypothetical proteins are sequences thought to encode a functional protein but for which little to no evidence of that function exists. About half of the genomic proteins in reference strain S. aureus NCTC 8325 are hypothetical. Since annotation of these proteins can lead to new therapeutic targets, a high demand to characterize hypothetical proteins is present. This work examines 35 hypothetical proteins from the chromosome of S. aureus NCTC 8325. Examination includes physiochemical characterization; sequence homology; structural homology; domain recognition; structure modeling; active site depiction; predicted protein-protein interactions; protein-chemical interactions; protein localization; protein stability; and protein solubility. The examination revealed some hypothetical proteins related to virulent domains and protein-protein interactions including superoxide dismutase, O-antigen, bacterial ferric iron reductase and siderophore synthesis. Yet other hypothetical proteins appear to be metabolic or transport proteins including ABC transporters, major facilitator superfamily, S-adenosylmethionine decarboxylase, and GTPases. Progress evaluating some hypothetical proteins, particularly the smaller ones, was incomplete due to limited homology and structural information in public repositories. These data characterizing hypothetical proteins will contribute to the scientific understanding of S. aureus by identifying potential drug targets and aiding in future drug discovery.


Background
While Staphylococcus aureus is a natural bacterial inhabitant of nasal passages, it is a major cause of nosocomial infections of surgical wounds particularly involving indwelling medical devices [1]. It can also present as superficial skin lesions or localized abscesses turning into deep-seated infections such as furunculosis if left untreated. S. aureus causes toxic shock syndrome when it goes septic, a huge concern considering the rise of antibiotic resistance the organism has experienced. Other health issues related to internalized infections are heart and lung diseases such as endocarditis and necrotizing pneumonia, which are now being diagnosed in the younger community populations, rather than remaining solely a hospital acquired (HA) infection. Deaths have been reported in relation to these heart and lung infections [2].
Methicillin-resistant Staphylococcus aureus (MRSA) bacteria are resistant to all beta-lactam antibiotics such as penicillin, methicillin, amoxicillin, and oxacillin. In 2011, the Center for Disease Control estimate 80,000 invasive MRSA infections and 11,285 related deaths in the United States annually [3]. Most of these are nosocomial infections, though there are increases in community acquired (CA) MRSA infections, particularly among immunocompromised patients. Others in the community setting that have shown a tendency to acquire MRSA are those of younger age rather than older. According to Casey, the median age for CA-MRSA in 2010 was 24 years versus a median age of 61 years for nosocomial MRSA infections [4]. Another predictor of CA-MRSA infections was an increased number of antibiotics prescribed in the year before infection. There is no significant data showing a link between race and CA-MRSA infection rates, but obesity was determined as a risk factor.
CA-MRSA in the United States is of particular concern due to the USA300 strain gaining momentum globally, relative to its excessive production of exotoxins and its genetics of polyamine-resistance [5]. One of the most prominent mechanisms responsible for the virulence of USA300 is its Arginine Catabolic Mobile Element (ACME), which ultimately inhibits polyamines that perpetuate wound healing in patients [6]. The ACME of this CA-MRSA allows it to survive acidic environments that normally limit its colonization.
Due to documented increases of a global spread of CA-MRSA in just the past 20 years, a worldwide need for innovative therapies that target these divergent strains in new ways is of ultimate concern [2]. This directs attention to prediction work with hypothetical proteins in silico, which allows for further investigation into the S. aureus genome. Upon analyzing the phylogeny of its protein sequencing, prediction of the bacteria's next mutation is possible, thus enhancing knowledge of its mechanisms of action. By this, science gains insight into receptor targets that inhibits reproduction of such a resistant and virulent species.
Approximately 50% of the S. aureus NCTC 8325 genome is comprised of hypothetical proteins. Hypothetical proteins are protein sequences by nucleic acid sequence only with unknown function [7]. These sequences have little to no experimental evidence for their function's existence, characterized by a low identity to proteins with known function. Frequently, these nonconserved proteins do not follow established phylogenetic lineage. There are two groups of hypothetical proteins: uncharacterized protein families and domains of unknown function. The latter are experimentally identified proteins with no known structural domains related to function.
Several studies have characterized hypothetical proteins. Mohan and Venugopal examined ten hypothetical plasmid proteins in S. aureus in 2012 [8]. They characterized an ABC transporter ATPbinding protein, export proteins, and a protein related to the multiple antibiotic resistance family among others. In 2015, Varma and colleagues examined one hypothetical protein from S. aureus, selected for its size and Basic Local Alignment Search Tool (BLAST) result, which appears to bind to ribosomal subunits [7]. Shahbaaz and researchers predicted the function of 83 hypothetical proteins in Mycoplasma pneumoniae type 2a strain 309, several of which appear virulent [9]. Islam, et al., characterized six hypothetical proteins in Vibrio cholerae O139 predicting the function of an antibiotic resistance protein, an integrase enzyme, and a restriction endonuclease [10]. All used similar methods to those presented in this study.
With approximately half of all genomic protein sequences currently annotated as hypothetical, great potential exists for the discovery of new drug targets [10]. The pharmaceutical industry is struggling to discover and develop new drugs quickly and cheaply. Increasing the number of available targets that pharmaceutical agents could act on by characterizing hypothetical proteins may alleviate some of the pharmaceutical industry's pressure. This could lead to novel and improved therapeutic agents for better patient care, increased corporate and hospital profits, and decreased drug prices for consumers. Several algorithms characterized these hypothetical proteins. Position-Specific Iterative BLAST (PSI-BLAST) at NCBI identified potential homologs through secondary protein structure alignments. ExPASy's Protparam server computed the number of amino acids, amino acid composition and frequencies, molecular weight, the total number of charged residues (aspartic acid plus glutamic acid for positively charged and the sum of arginine and lysine for negatively charged), theoretical isoelectric point (pI), extinction coefficient, instability index (II), aliphatic index (AI), and grand average hydropathy (GRAVY) [11].

Methodology
Both Pfam and the conserved domain database BLAST (CDD-BLAST) from NCBI, performed protein domain identification. Pfam is a comprehensive collection of multiple sequence alignments and Hidden Markov Models that represent protein domains and families [12]. The CDD-BLAST algorithm uses a PSI-BLAST variant to establish position-specific scoring matrices with the protein sequence [13]. Researchers frequently use Pfam and CDD-BLAST together to characterize parts of the protein involved in binding capability [8, 10].   Two programs analyzed protein location within the cell. PSortB predicts the location of each protein [18]. The SOSUI server characterized a protein's solubility and identified potential transmembrane regions [19]. Examining how cysteine forms disulfide bonds to stabilize the protein may be helpful. The DISULFIND predicted disulfide bridges and examined structural and functional properties of hypothetical proteins [20]. Default program settings were used for all analyses except for STITCH where the required confidence (score) was set to highest confidence (0.900).

Discussion
Thirty-five chromosomal hypothetical proteins from S. aureus NCTC 8325 were randomly selected from 1509 possible hypothetical proteins. Characterization included homolog identification, physiochemical measurements, domain identification, active site description, binding partners, cellular location, and solubility calculations.

Sequence Similarity
PSI-BLAST compares protein secondary structures among proteins. Top PSI-BLAST result for each hypothetical protein is listed in Table 1. All hypothetical proteins matched proteins in S. aureus with 100% query coverage, except for SAOUHSC_01937, as PSI-BLAST could not match SAOUHSC_01937. SAOUHSC_00010 fit a protein in S. aureus MRSA131. SAOUHSC_00328 matched a protein in S. aureus A5948. SAOUHSC_001024 hit a protein in S. aureus VRS1. Percent identity ranged from 97% to 100% with e-values of 0.0 to 4e-11, indicating strong matches between hypothetical proteins and their homologs.

Physiochemical Characterization
ExPASy calculated the physiochemical parameters listed in Table 2. Number of amino acids ranged from 30 to 1370 with molecular weights from 3544.3 to 163266.7. The theoretical isoelectric point, the pI where the protein would be most stable, was calculated from the number of negative and positive residues (Asp and Glu, Arg and Lys, respectively). The extinction coefficient values are for 280nm because that is the wavelength where proteins absorb light strongly while other substances common to protein solutions do not. The extinction coefficient for two smaller hypothetical proteins, SAOUHSC_01024 and SAOUHSC_01291, could not be determined because there were no Trp, Tyr, or Cys in the protein, so the protein should not be visible by UV spectrophotometry. The instability index (II) predicts if a protein would be stable in a test tube under normal conditions. Proteins with II values over 40 considered unstable. The aliphatic index (AI) represents the protein's volume taken up by aliphatic side chains (Ala, Val, Leu, and Ile). The higher the AI, the wider the temperature range at which the protein will be stable. GRAVY measures the protein's hydrophobicity. Values spanned -1.984 to 1.096 with higher scores meaning increased hydrophobicity for the protein. Tables 3 -6, respectively. The programs could not find domains within proteins not listed. If both programs identified a domain, the CDD-BLAST tables identified and defined it (Tables 3 and 4) and not repeated in the Pfam tables.

Active Site and Substrate Characterization
The (PS) 2 server attempted to model each hypothetical protein.
Template information including percent identity and e-value is in Table 7. Several proteins could not be modeled by (PS) 2 . Hypothetical proteins, SAOUHSC_01931 and SAOUHSC_02570, yielded an error message of computer language when (PS) 2 attempted to model them. Attempted to report the problem to (PS) 2 at chieh.bi91g@nctu.edu.tw, but no correction was made. The program could not find significant templates for other hypothetical proteins not listed.
3DLigandSite characterized the active site for hypothetical proteins. Figure 1 depicts the predicted active site with binding heterogens for the 12 of 22 proteins with the largest active sites. Table 8 lists predicted residues responsible for forming active sites and heterogens. There were insufficient homologous structures with ligands bound for other hypothetical proteins not listed in the table.
The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) and Search Tool for Interactions of Chemicals (STITCH) predicted interactions with hypothetical proteins. Table 9 shows the top non-hypothetical protein interactions with the highest confidence from STRING. If a protein is not listed in the table, it did not have predicted functional partners or all predicted partners were other hypothetical proteins. Figure 2 illustrates the highest confidence interactions STITCH predicted with multiple proteins. Since the findings between STITCH and STRING were similar, if one non-hypothetical protein was predicted with highest confidence, it is listed in Table 9 but not shown in Figure 2.

Cellular Location, Solubility, and Stability
PSortB predicted the cellular location of hypothetical proteins with results summarized in Table 10. PSortB was unable to determine cellular location for unlisted proteins.
SOSUI calculates the average hydrophobicity and determines if the protein is soluble from it. If hydrophobicity exists, that portion of the protein is labeled as a transmembrane region. Table 11 shows the transmembrane regions of the eight proteins. SOSUI deemed all other proteins soluble.