Homology modeling and assigned functional annotation of an uncharacterized antitoxin protein from Streptomyces xinghaiensis

Streptomyces xinghaiensis is a Gram-positive, aerobic and non-motile bacterium. The bacterial genome is known. Therefore, it is of interest to study the uncharacterized proteins in the genome. An uncharacterized protein (gi|518540893|86 residues) in the genome was selected for a comprehensive computational sequence-structure-function analysis using available data and tools. Subcellular localization of the targeted protein with conserved residues and assigned secondary structures is documented. Sequence homology search against the protein data bank (PDB) and non-redundant GenBank proteins using BLASTp showed different homologous proteins with known antitoxin function. A homology model of the target protein was developed using a known template (PDB ID: 3CTO:A) with 62% sequence similarity in HHpred after assessment using programs PROCHECK and QMEAN6. The predicted active site using CASTp is analyzed for assigned anti-toxin function. This information finds specific utility in annotating the said uncharacterized protein in the bacterial genome.


Background:
Streptomyces are soil-conquering gram-positive bacteria and a member of the order of Actinomycetales [1]. Streptomyces xinghaiensis, a novel species of Streptomyces, was isolated from a marine sediment sample collected from Xinghai Bay, Dalian, China [2]. The S. xinghaiensis draft genome contains 7,618,725 bp with a GC content of 72.5%, representing approximately 92.7% of the 8.2-Mb estimated size of the genome. Analysis of the genome revealed a number of genes related to the biosynthesis of secondary metabolites. At least 15 clusters involved in secondary metabolism were identified; these include one gene cluster that highly resembles the gene cluster of ribostamycin [3], an amino-glycoside antibiotic. Toxinantitoxin (TA) system was widely adopted in many genomes like bacteria and archaea and is usually recognized as a maintenance or stability mediator [4,5]. Although, the exact role of this system in the genome is not clear but, acts as sentinels against DNA loss and various stress management process like programmed cell death and antibiotic resistance [6]. According to the mode of action, the TA systems have been classified into three broad classes. Namely, class I, II and Class III. Among them, class II is predominant in many organisms [7]. The class II TA system consists of two proteins called toxin and antitoxin. The toxin is neutralized by antitoxin through direct protein-protein interaction and/or interaction with palindrome sequences within the promoters for suppressing transcription of the TA system [8-10].
The sequencing technology is both sophisticated and advanced in dealing with massive amount of data in recent years. Unfortunately, many of these genomes are still not fully annotated and they comprise of various genes or proteins with uncharacterized function and unknown 3D structures. This is due to several limitations, such as the cost and time necessary for experimental methodologies. Hence, an alternative method using computer aided mathematical models are frequently used to gain insight [11][12][13]. Therefore, it is of interest to study the uncharacterized proteins in the genome. An uncharacterized protein (gi|518540893|86 residues) in the bacterial genome was selected for a comprehensive computational sequence-structure-function analysis using available data and tools.

Sequence retrieval
We inspected the NCBI (http:// www.ncbi.nlm.nih.gov/) [14] protein databases for proteins containing antitoxin like sequences. An uncharacterized protein (gi|518540893|) from Streptomyces xinghaiensis consisting of 86 amino acid residues was selected for the study and its sequence was downloaded in FASTA format for further analysis.

Analysis of physico-chemical properties
The ProtParam (http://web.expasy.org/protparam/) [15] tool of ExPASy was used for the analysis of the physical and chemical properties of the targeted protein sequence. The properties including aliphatic index (AI), GRAVY (grand average of hydropathy), extinction co-efficients, iso-electric point (p I ) and molecular weight were analyzed.

Sub-cellular localization prediction
Determining sub-cellular localization is crucial for understanding protein function and is also vital for genome analysis. Prediction of sub-cellular localization of the protein from Streptomyces xinghaiensis was completed using CELLO (version 2.5), a multiclass support vector machine classification system [16,17].

Protein family and phylogeny analysis
The BLASTp program from NCBI (http://www.ncbi.nlm.nih. gov/) [18] was used for searching the similarity of the protein against the non-redundant database with default parameters. Then the target protein was analyzed for the presence of conserved domains based on sequence similarity search with close orthologous family members. For this purpose, three different tools and/or databases including Proteins Families Database (Pfam), [19] NCBI Conserved Domains Database (NCBI-CDD), [20] and SUPERFAMILY [21] were used. Pfam is a database of protein families that includes annotations and multiple sequence alignments generated using hidden Markov models. NCBI-CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. The SUPERFAMILY annotation is based on a collection of hidden Markov models, which represent structural protein domains for SCOP super-family level. The annotation is produced by scanning protein sequences from completely sequenced genomes against the hidden Markov models with these features. The phylogeny analysis was completed using the CLC Sequence Viewer v7.0.2 (http://www.clcbio.com) for understanding molecular evolution. The sequence (gi|518540893|) for the target protein with the secondary structures (alpha helix and beta strands) is shown on the top of the alignment. The target protein shows 62% sequence similarity with the structure known template with PDB ID 3CTO:A. The rest of the sequences show 90% similarity with the target protein.

Multiple sequence alignment and Secondary structure analysis
A combined approach was used to get structural and functional insights through sequence comparison. We fetched several annotated antitoxin protein sequences of Streptomyces species from the NCBI protein database and the multiple sequence alignment (MSA) along with the target protein were obtained using BioEdit biological sequence alignment editor [22]. These aligned sequences were used further for the prediction of the secondary structures using EsPript 3.0 [23].

Homology Modeling
Homology modeling was used to determine the threedimensional structure of the target protein. A BLASTp [18

Model quality assessment
The quality of the predicted structure was assessed by PROCHECK [27] and QMEAN6 [28] programs of ExPASy server of SWISS-MODEL Workspace [29]. Furthermore, Root Mean Squared Deviation (RMSD), superimposition of query and template structure was generated by using UCSF Chimera 1.5.3 [30]. The Z score of the template and query were also assessed by ProSA-web server [31]. Finally, the model and the template structure superimposed were visualized by using PyMOL

Results & Discussion:
Various physiological and chemical properties of the target protein were assessed by ProtParam tool. These include aliphatic index (AI; score of 88.60), instability index (II; score of 81.60), p I ; score of 4.61, extinction coefficient; score of 6990 and average hydro-pathicity; score of -0.573. All of these calculations are related to the stability of the protein for its function [34].
Sub-cellular localization is an essential feature of a protein.
Cellular functions are usually localized in specific enclosed area; so, foretelling the sub-cellular localization of an unknown protein may possibly use to obtain handy information about their function. Therefore, this information is also valuable for drug designing for the target protein [35]. Here, the sub-cellular localization of the target protein predicted by CELLO is cytoplasm. The BLASTp search against the non-redundant database showed homology (up to 90% sequence similarity) with other known antitoxin proteins from different Streptomyces species Table 1 (see supplementary material). Phylogenetic analysis is shown in Figure 1 using the same data and their evolutionary relatedness is depicted. The output of the tree with the true distance inferred the evolutionary similarity of different antitoxin genes.
Numerous web tools were used to search for conserved domains and potential function of the target protein. Based on consensus predictions made by Pfam, NCBI-CDD and SUPERFAMILY suggested that the target protein contains PhdYeFM_antitox superfamily domains and is currently classified as antitoxin Phd_YefM in the type II toxin-antitoxin system. Pfam server predicted the Antitoxin Phd_YefM, type II toxin-antitoxin system at 1-74 amino acid residues with an evalue of 1.9e-21. The PhdYeFM_antitox super family was also found by the NCBI-CDD server at 2-81 amino acid residues with an e-value of 3.27e-20. The SUPERFAMILY server found the domain at positions 3-79 amino acid residues with an evalue of 2.49e-22. In this system, once the antitoxin protein is bound to their toxin companions, they bind DNA via the Nterminus and inhibit the expression of the operons, which contain genes encoding the TA system [36, 37].

Figure 4:
The 3D structure superposition of template structure and predicted model is shown. Here, in figure 5A, the template 3CTO:A (red color) and the target protein (cyan color) is shown. The RMSD value for this superposition is 0.709 Å. Figure 5B showed the Z score of the model (target protein) and Figure 5C showed the Z score of the template (3CTO:A).
The MSA of different antitoxin proteins of Streptomyces and the target protein (gi|518540893|) are depicted in Figure 2. The secondary structure of these proteins are also included in this figure and showed that they are mostly conserved throughout the alignment along with the template. Homology modeling is an important part in the recent past for the comparative modeling of various unknown structures with enormous available tools [38,39]. The structure for the target protein is unknown. Therefore, it is of interest to develop a homology model of the protein as shown in Figure 3. Here, the template (PDB ID: 3CTO: A) is M. tuberculosis YefM antitoxin with 62% sequence similarity with the target.
Quality assessment of the predicted 3D model was completed using PROCHECK using "Ramachandran plot" where we got 93.6% amino acid residues within the favored region. The quality of the model was further checked by QMEAN6 server where the model was placed inside the dark grey zone and considered as a good model with a QMEAN6 score of 0.608. Superimposition between the model and the template is shown in Figure 4A. The RMSD value obtained from the superimposition of target and the template (3CTO: A) in UCSF Chimera was found to be 0.709 Å, suggesting a reliable threedimensional model. The Z score evaluates the global model quality and is used to check whether the input structure is within the range of scores usually found for native proteins of similar size. The z for the model obtained from ProSA was -3 ( Figure 4B) and for the template was -3.44 ( Figure 4C), proposing the homology between target and the model. The active site of the protein was analyzed using the CASTp server. The identification and characterization of functional sites on proteins have increasingly become an area of interest. On account of the analysis of the active site residues for the binding of ligands provides insight towards the design of inhibitors of an enzyme. In this study, we have also analyzed the best active site area of the protein as well as the number of amino acids involved ( Figure 5). In most cases, class II antitoxin have two domains, one is DNA-binding domain located in the Nterminal region and other is toxin binding domain located in the C-terminal end [40][41][42][43]. In our analysis, we have also found similar domain based active sites in the target protein model. Those were depicted using a spherical view in Figure 5.

Figure 5:
Active sites (spherical view) identification of the protein through the CASTp server is shown. Here, the amino acid residues in the active sites are depicted with zoomed view for better visualization. The N-terminal region starts from the left end (Blue marked) and the right end (Red coil region) is the C-terminal.

Conclusion:
We describe the homology model with possible assigned function of an uncharacterized protein from Streptomyces xinghaiensis. The analysis shows that target protein is antitoxin, which acts as in a type II toxin-antitoxin (TA) systems. This TA system composed of two genes encoding a labile antitoxin and a stable toxin. This data finds utility in the annotation of the target protein.