Functional prediction of hypothetical proteins in human adenoviruses

Assigning functional information to hypothetical proteins in virus genomes is crucial for gaining insight into their proteomes. Human adenoviruses are medium sized viruses that cause a range of diseases. Their genomes possess proteins with uncharacterized function known as hypothetical proteins. Using a wide range of protein function prediction servers, functional information was obtained about these hypothetical proteins. A comparison of functional information obtained from these servers revealed that some of them produced functional information, while others provided little functional information about these human adenovirus hypothetical proteins. The PFP, ESG, PSIPRED, 3d2GO, and ProtFun servers produced the most functional information regarding these hypothetical proteins.


Background:
Human adenoviruses (HAdVs) are double stranded DNA viruses that are around 35 kb in size [1]. These viruses cause a wide variety of diseases such as acute respiratory disease [2], keratoconjunctivitis [3], and gastroenteritis [4]. Therefore, HAdVs are important human pathogens. There are 7 species of human adenoviruses, species A-G which are further divided into different strains/types increasingly based on bioinformatics and genomics approaches due to the availability of whole genome sequences, whereas earlier, this was done based on serum neutralization and hemagglutination inhibition [5]. In recent years, the availability of whole genome sequences of various organisms has increased dramatically due to next generation sequencing methods. For example, there was a 21% annual increase in the number of virus nucleotide base-pairs in GenBank and an overall annual increase in all GenBank nucleotide base-pairs of 43.6% in 2014 [6]. Many of the proteins in sequenced genomes are annotated as "hypothetical proteins." These are predicted proteins that do not have experimental evidence for their translation [7] nor do they have a characterized function [8]. In order to better understand the genomes to which these proteins belong, it will be extremely helpful to assign functions to these hypothetical proteins. Even with their relatively small genome size compared with prokaryotes and eukaryotes, HAdVs possess several hypothetical proteins that need to be functionally annotated.
A myriad of computational approaches to protein function prediction have been developed ranging from template based methods where a template with known function and structure is used to predict function of a query sequence [9], to text mining methods [10] to computational intelligence methods [11]. In this study, we used several well known protein function prediction servers to annotate HAdV hypothetical proteins. We found that some of these servers provided little to no information about the function of these HAdV hypothetical proteins, while others provided information that could potentially be used by wet bench biologists to further experimentally characterize these proteins. These results can serve as a guide to users, particularly wet bench biologists, as to which servers to use to predict the function of hypothetical proteins, particularly those belonging to viruses.

Methodology:
Twenty-eight hypothetical proteins across 11 HAdVs Table 1 (see supplementary material) were obtained from GenBank [6] by searching these genomes for the keyword "hypothetical". Three additional proteins not explicitly annotated as hypothetical (AAT97486, AAT97487, AAT97489 from HAdV-4) were chosen as they are very likely hypothetical due to BLASTP hits to other hypothetical proteins. One of the 31 proteins, ADN06471 from HAdV-40/41, although annotated as hypothetical, is known to be expressed [12]. All thirty-one of these proteins were then submitted to several sequence-based protein function prediction servers. These were PFP [

Results:
The average length of the 31 hypothetical proteins from 11 different human adenovirus genomes was 124 amino acids, with the high being 224 and a low of 58 (Table 1). The PFP server predicted functions for all 31 hypothetical proteins, some of which with high confidence, such as beta1-adrenergic receptor activity at 92% confidence for protein ACN88103 and purine nucleotide binding at 100% for protein AAW65500 Table 2 (see supplementary material). The ESG server was not as successful as the PFP server, but still managed to predict functions for 26 of the 31 possible hypothetical proteins. For instance, GTPase activity and GTP binding at 99% confidence was predicted as the function of AGF90820, and lyase activity and aldehyde-lyase activity at 89% confidence was predicted for ACN88103 as shown in Table 2.
ARGOT2 was only capable of predicting the function of 7 hypothetical proteins, such as hydrolase activity at 100% confidence for protein AGE46441 and transferase activity at 85% confidence for protein AAT97487 Table 3 (see  supplementary material). Additionally as shown in Table III, BAR+ was unable to predict a function for any of the hypothetical proteins. Similarly, the dcGO server was unable to predict a function for any of the hypothetical proteins (table not shown). The PSIPRED server predicted functions for all 31 hypothetical proteins such as GTP binding at 94% probability for AFH58045 and oxidoreductase activity at 99% probability for protein AAT97539 Table 4

(see supplementary material).
The fold recognition server Phyre2 identified potential folds in 8 of the 31 hypothetical proteins as shown in Table 4. These folds include: pyruvate kinase C-terminal domain-like at 17.70% confidence for AFH58048 and barrel-sandwich hybrid at 25.10% confidence for protein AAW65505. The ProtFun server predicted functions for 24 of the 31 proteins, along with categorical information concerning gene ontology and whether the protein was an enzyme or not Table 5 (see supplementary  material). Protein AAT97531 was predicted to play a role in the cell envelope with 53% probability, be an enzyme with 46% probability, and finally, be a structural protein with 27% probability. Additionally, protein AFH58048 was predicted to play a role in transport and binding with 74% probability, be a non-enzyme with 82% probability, and finally, be a growth factor with 7% probability as shown in Table 5. The homology modeling server SWISS-MODEL did not produce a structure output for any of the 31 hypothetical proteins for use with the 3d2GO server. However, the structure-based 3d2GO server predicted a function for 22 of the 31 hypothetical proteins from proposed structures of these proteins, provided by MuFold Table 6 (see supplementary material). For example, 3d2GO predicted oxidoreductase activity at 29% confidence as a function for AAW33184 and transport at 61% confidence for protein AAW65506. The protein family server Pfam found no domains for any of the hypothetical proteins Table 7 (see supplementary material). In contrast, the protein domain prediction server SMART produced results for 25 of the 31 hypothetical proteins, with the majority containing low complexity regions as shown in Table 7.

Discussion:
The PFP server predicted some form of "binding" for 25 of the 31 function predictions, and had an average prediction confidence of 81% (Table 2). Additionally, the ESG server made function predictions for 26 of the 31 proteins, averaging 50 % confidence. ESG did not predict a function for all proteins as PFP did, but it provided more complete functional information, albeit with average to low confidence. For example, for protein AAT97533, 4-hydroxy-tetrahydrodipicolinate reductase, oxidoreductase activity, oxidoreductase activity, acting on CH or CH2 groups, NAD or NADP as acceptor, NADP binding, NAD binding, and NADPH binding was predicted at 32% confidence (Table 2). Also, for protein ADN06471 Nacetyltransferase activity, transferase activity, transferase activity, transferring acyl groups, transferase activity, and transferring acyl groups other than amino-acyl groups was predicted at 53% confidence.
ARGOT2 predicted only 7 functions, averaging 80% confidence ( Table 3). The BAR+ and dcGO servers were both unable to predict a function for any of the proteins as shown in Table 3. PSIPRED was capable of predicting a function for all 31 proteins, averaging 91% confidence in the process ( Table 4). The function of "structural constituent of ribosome" was predicted for 7 of the 31 proteins. Also, some form of "binding" was predicted for 16 of the 31 proteins and ranged from "calcium ion binding" to "actin binding". While the PSIPRED predictions were rather vague, the confidence of the predictions remained high across all 31 hypothetical proteins. Additionally, the fold recognition server Phyre2 only identified 8 potential matching folds out of a possible 31 and had an average confidence of 16.60% which is the probability of the query sequence and template being homologous (Table 4). Moreover, since Phyre2 utilizes fold recognition, the information the server provided allows users to gain insight into the fold of that protein.
ProtFun provided a more thorough functional prediction for each protein that it could predict a function for. ProtFun managed to make 24 of the possible 31 hypothetical protein function predictions (Table 5). Not only did ProtFun predict functions for the 24 proteins, it also predicted whether the protein was an enzyme or nonenzyme, and its gene ontology (GO). Across the 26 predictions, function prediction confidence averaged at 29%, enzyme/nonenzyme prediction confidence averaged at 63%, and gene ontology prediction confidence averaged at 17%. SWISS-MODEL did not find templates for any of the proteins and therefore, could not produce a structure to use as input to the 3d2GO server. However, MuFold predicted a structure for 22 of the 31 hypothetical proteins (Table 6). Furthermore, structure-based server 3d2GO utilized those predicted structures to predict a function for the 22 proteins as shown in Table 6. Average prediction confidence was 50% and the server was able to predict a function from all structures proposed by MuFold. The function for protein AAW33433 was predicted to be RNA binding, ribosome, ribonucleoprotein complex, structural molecule activity, intracellular, translation, rRNA binding and structural constituent of ribosome at 99% confidence, but aside from this thorough prediction, most other predictions were rather vague, such as "cytosol", "cytoplasm", and "membrane" as shown in Table 6. While Pfam and SMART are not strictly protein function prediction servers, we wanted to investigate whether they could provide pertinent domain information for the HAdV hypothetical proteins. Pfam also did not find any domains in these proteins. Further, while the SMART server did find matching regions for 26 of the 31 hypothetical proteins, the information provided from the server was very minimal as 23 of the 26 matches were "low complexity regions" and the other 3 were classified as "signal peptide regions" (Table 7).

Conclusions:
It is apparent from the results no single server produces the most complete functional determination of these "hard" HAdV hypothetical proteins. The servers that provided the most information were PFP, ESG, PSIPRED, 3d2GO, and ProtFun. The servers which provided very little or no functional information were ARGOT2, BAR+, and dcGO. We believe that the best option for functional prediction of hypothetical proteins is to use a multitude of servers and analyze the results produced. Furthermore, we agree with Radivojac et al. [25] that these servers need to be improved in order to better predict protein function.  binding, 77% N-acetyltransferase activity, transferase activity, transferase activity, transferring acyl groups, transferase activity, transferring acyl groups other than amino-acyl groups, 53%