Fold combinations in multi-domain proteins

Domain-domain interactions in multi-domain proteins play an important role in the combined function of individual domains for the overall biological activity of the protein. The functions of the tethered domains are often coupled and hence, limited numbers of domain architectures with defined folds are known in nature. Therefore, it is of interest to document the available fold-fold combinations and their preference in multi-domain proteins. Hence, we analyzed all multi-domain proteins with known structures in the protein databank and observed that only about 860 fold-fold combinations are present among them. Analyses of multi-domain proteins represented in sequence database result in recognition of 29,860 fold-fold combinations and it accounts for only 2.8% of the theoretically possible 1,036,080 (1439C2) fold-fold combinations. The observed preference for fold-fold combinations in multi-domain proteins is interesting in the context of multiple functions through structural adaptation by gene fusion.


343
©Biomedical Informatics (2019) sequence data in the dataset increases the number of folds observed in multi-domain protein to 1021 (71% of the known folds) and the number of fold-fold combinations to 29860. Although, ~35 times increase is observed in the number of fold-fold combination on integrating the sequence data, the number is far less than the number of theoretically possible fold-fold combinations (1,036,080 for 1439 folds). The analysis shows that only a few (~2.8%) of the fold-fold combinations are observed in multi-domain proteins. This limitation could be due to geometrical constraints associated with fold-fold associations in multi-domain proteins. Our analysis also highlights the incompleteness of representation of multi-domain proteins in the current version of PDB, as only 504 folds out of 1439 folds described in SCOPe are observed.

Materials and Methods: Dataset of multi-domain proteins from SCOPe -structure dataset:
We retrieved 3-D structures of proteins solved by X-ray diffraction method or Nuclear Magnetic Resonance (NMR) from the Protein Data Bank (PDB) [18]. All the PDB entries with no SCOP annotations available in SCOPe [22] were discarded. 244,326 protein domains in SCOP ver. 2.07 were filtered for genetic domains, discontinuous domain and His-tag domain (SCOPe code: l.1.1.1). The resulting 217,523 protein domains were filtered for single-domain entries. 71,016 proteins domains were observed to be part of multi-domain proteins. Fold-fold combinations in these multi-domain proteins were enumerated only if the domains are interacting. Domain-domain interactions were defined between the domains in multi-domain proteins if at least five residue-residue interactions are observed between domains. The interactions between domains were identified using Protein Interaction Calculator (PIC) [23]. For the analyses here, we have not taken into consideration the frequency of observation of a given fold-fold combination; we only consider if the association between two specific folds is observed or not.

Limited set of folds and fold-fold combinations:
There are 1439 folds reported in SCOPe version 2.07. In the dataset of multi-domain proteins of known structure, only 504 folds are observed for constituent domains. On enumerating the fold-fold preferences in multi-domain proteins, only 860 fold-fold combinations are observed out of 1,036,080 theoretical combinations calculated for 1439 folds (Figure 1A). The observed fold-fold combinations are listed in Table 1. On further analysis of the number of fold-fold combinations observed for each of the 504 folds, it is observed that 430 (~79%) folds interact with fewer than five folds and the rest ~21% interacts with more than five folds ( Figure 1B).  Figure 1C. The total number of combinations observed for c.37 fold is 42. Interestingly, domains corresponding to ~28% (155 folds out of 504 folds) of the folds interact with domains of same fold (dots along the diagonal in Figure 1A). Out of the 34,002 domain-domain interactions observed, 10,591 domain-domain interactions are observed between same fold. Interestingly, about 2,697 domaindomain interactions between same folds are not domain-repeats (i.e. the domains belonged to different families). The occurrence of such domain-domain interactions between same folds but from different superfamilies supports the concept of evolution of these domains from gene-duplication events. Interestingly, 48 of these self-interacting folds interact with only domains of same fold (boxed in blue in Figure 1D). Such folds henceforth will be referred as "solely self-interacting folds". Examples of solely self-interacting folds are serum albumin-like fold (a.126) and snake toxin-like fold (g.7) (Figure 1E). It should be noted that observation of solely self-interacting folds is not due to limited number of occurrence of these folds in SCOPe database [22] ( Table 2). For example, a.126 and g.7 are represented 488 and 269 times respectively in the SCOPe database. These observations imply that certain folds prefer to tether with a domain of same fold. The geometrical compatibility between the folds could be a defining feature of folding of proteins having interacting domains of same folds. Many of the folds that interact with domains of same fold are observed to interact with many other folds (extreme right column bars in Figure 1D) as well. This observation implies that certain folds have geometrical features that are compatible for interaction with many different folds. One of the many reasons for this could be the number of functions that are associated with such folds and are discussed in detail in the following section.

Fold-fold combinations and associated functions:
We investigated potential biological implication of substantial proportion of domain folds (~21%) preferring to interact with many domain folds. For this, we used the functional annotations from SUPERFAMILY database [26] for the superfamilies listed under each of these domain folds [11,27]. The functional annotation of 347 ©Biomedical Informatics (2019) each superfamily was extrapolated to the fold i.e. each fold is assigned all the functions that each of its constituent superfamily has been assigned with. In the SUPERFAMILY database, the function associated with majority of the family members is assigned to the superfamily. It has to be noted that functions definition used in SUPERFAMILY database is with respect to the most common role of the domain in proteins, in a particular pathway or in the cell/organism. The definition is mix of the definition of 'biological process' and 'molecular function' used in the Gene Ontology annotation [28]. It has to be noted that the information of the function of the folds could be retrieved only for 470 folds out of 504 folds. Majority of the folds follow the trend of higher the number of functions associated with a fold, higher is the number of fold combinations (Figure 2A).Folds that have 20 or more number of fold-fold combinations and the number of functions associated with the folds are listed in Table 3. Few examples that do not follow this trend are also observed i.e. folds, which have fewer than 5 fold-fold combinations are associated with substantial number of functions. For example, STAT-like fold (a.47) has only one fold-fold combination i.e. with common fold of diphtheria toxin/transcription factors/cytochrome f (b.2) but six associated functions ( Figure 2B). The two folds share 4 out of 6 functions, suggesting that the tethered domain influences the functions of the domain folds.

Limited repertoire of fold-fold combinations
The structural information available on multi-domain proteins is limited due to the technical difficulties in crystallising multidomain proteins and the size limitation on the structure elucidation by nuclear magnetic resonance (NMR). Since only a minor fraction of PDB represents multi-domain proteins (~35%) in contrary to the proportion of multi-domain proteins encoded in genomes (~75% on average), it is imperative to include sequence data, which is much larger, in the our analysis dataset. For this, we mapped the information on sequence domains available in the Pfam database (ver. 31) [24] to SCOP domains (SCOPe version 2.07 [22]). Information on domain architecture was retrieved from Pfam database [24]. We generated the corresponding SCOP domain for each domain architecture. For the analysis of the sequence dataset, it is assumed that all the domains in a multi-domain protein interact with one another. Although, it adds a greater number of fold-fold combinations that may not be actually observed, it would give an upper limit to the number of fold-fold combinations possible in nature. 4095 of the sequence domains documented in Pfam ver. 31 could be mapped to structural/evolutionary domain documented in SCOP database using methods discussed previously by Kumar et al. [25]. Mappings where Pfam domain corresponded to two different SCOP folds were discarded. 374,319 multi-domain Pfam architectures were mapped to 29,860 SCOP fold-fold combinations ( Figure 3A). These fold-fold combinations represented 1021 folds out of 1439 folds in SCOPe database. Although, the number of fold-fold combinations deduced from the sequence database is ~35 times more than the fold-fold combinations observed in the structure dataset, the number is still far less than the number of theoretically possible fold-fold combinations (~1036080 fold-fold combinations for 1439 folds). These observations suggests strongly that only a few (~2.8%) of the fold-fold combinations are preferred in multi-domain proteins. Thus, only few fold-fold combinations are selected by nature during evolution of multi-domain proteins. The selection of the fold-fold combinations could be because of geometrical constraints during evolution or/and the functional constraints, like coupling of functions, allostery regulation etc., on the constituent domains of multi-domain proteins. 202 folds in the sequence dataset have foldfold combinations fewer than five (Figure 3B), with 31 folds being in common between sequence and structure dataset. Since ~35 times more fold-fold combinations are observed in the sequence dataset than the structure dataset, it stresses on the need of structure elucidation of multi-domain proteins to understand better the structure and functions of multi-domain proteins.

348
©Biomedical Informatics (2019) and it is one of the reasons for selection of certain domain-domain combinations during evolution, here we addressed the question whether the geometrical compatibility between domains to interact restricts the number of fold-fold combinations observed in multidomain proteins. In addition, we asked whether certain folds are pre-disposed to occur in multi-domain proteins. For this, we first analysed all the available multi-domain proteins with known structure in the light of annotated domain folds. Interestingly, only 504 folds out of 1439 folds are observed in multi-domain proteins with known structures and these 504 folds form 860 fold-fold combinations (Figure 4). Repetition of analysis but using a much larger sequence data resulted in observation of 29,860 fold-fold combinations out of 1,036,080 theoretical fold-fold combinations possible for 1439 folds. 1021 folds were observed as part of multidomain protein in the sequence dataset. These observations strongly suggest that certain folds are pre-disposed to occur in multi-domain proteins and only few fold-fold combinations are selected for during evolution. The selection pressure for these foldfold combinations could be the geometrical constraints during the folding of proteins or/and the functional constraints on tethered domains to optimise the function/fitness cost of the protein during evolution. Observation of only 504 folds in the multi-domain proteins with known structures reflect the paucity of multi-domain proteins in the structure databases like PDB and stresses on the need of rigorous structural elucidation of multi-domain proteins. 21% of the folds are observed to form more than 5 fold-fold combinations. On detailed analyses, it is observed that these folds have multiple distinct functions associated with them. Another interesting feature is that few folds, which have less than 5 fold-fold combinations, have multiple distinct features associated with them. Detailed analyses showed that such domain have overlapping functions with the tethered domain. Such overlapping of functions of tethered domain again stresses the importance of domaindomain interaction in the function of proteins.

Figure 4:
Flowchart representing the size of data considered for enumeration of fold-fold combination for the PDB dataset.

Conclusions:
The results discussed here suggest that a limited set of fold-fold combinations have been selected for multi-domain proteins during evolution. Our analyses also highlight the disparity in the number of multi-domain proteins for which structure has been elucidated and the number of multi-domain proteins encoded in the genomes of all life forms.