A Comparison of Rosetta Stones in Adapter Protein Families

The inventory of proteins used in different kingdoms appears surprisingly similar in all sequenced eukaryotic genome. Protein domains represent the basic evolutionary units that form proteins. Domain duplication and shuffling by recombination are probably the most important forces driving protein evolution and hence the complexity of the proteome. While the duplication of whole genes as well as domain encoding exons increases the abundance of domains in the proteome, domain shuffling increases versatility, i.e. the number of distinct contexts in which a domain can occur. In this study we considered five important adapter domain families namely WD40, KELCH, Ankyrin, PDZ and Pleckstrin Homology (PH domain) family for the comparison of Domain versatility, Abundance and domain sharing between them. We used ecological statistics methods such as Jaccard’s Similarity Index (JSI), Detrended Correspondence Analysis, k-Means clustering for the domain distribution data. We found high propensity of domain sharing between PH and PDZ. We found higher abundance of only few selected domains in PH, PDZ, ANK and KELCH families. We also found WD40 family with high versatility and less redundant domain occurrence, with less domain sharing. Hence, the assignments of functions to more orphan WD40 proteins that will help in the identification of suitable drug targets.


Background:
The building blocks that create protein three dimensional structures are called domains, and domains are often combined to create multi-domain proteins or tethered proteins and the process is called as "domain tethering". In many vertebrate proteins, repeats with several adjacent domains from the same family can be found [1]. During evolution, they have been duplicated, fused and recombined, to produce proteins with novel structures and functions [2]. Comparisons of the proteomes of different organisms have suggested that proteins have evolved increasingly complex functions primarily by the acquisition of pre-existing domains, resulting in the formation of new multi domain architectures, whereas the emergence of an entirely new domain is a relatively rare event [3].
Domains can recombine to form multi-domain proteins, and proteins with two or more domains constitute the majority of proteins in all genomes. Thus, the recombination of existing domains may be a major mechanism that modifies protein function and increases proteome complexity [3]. The combination or shuffling of domains increases what is termed as the versatility of a domain superfamily; that is, the number of different partner domains that domains of a particular superfamily are adjacent to. The extent of duplication of different combinations varies widely and, in nature, will depend on selection for the domain combination based on its function. Some of the pair-wise domain combinations that are highly duplicated also recur frequently with other partner domains [4].
We studied five families having namely WD40, ANKYRIN, KELCH, PDZ and PH. The study pool contained WD40 and Ankyrin repeat (ANK) families which are among the most frequently occurring repeats among eukaryotes [1]. KELCH and WD40 repeats consist of repeated sequence motifs with hallmark residues spaced at regular intervals. Significant diversity has been observed in both WD40 and KELCH repeat sequences, a large number have repeat lengths and repeat spacing, yet they resemble same three dimensional Beta propeller structures [5]. PDZ family is considered because they are abundant protein interaction modules that often recognize short amino acid motifs at the C-termini of target proteins and are known to regulate many signalling pathways [6]. Pleckstrin Homology (PH) domains have been known to have multiple roles but predominantly involved in Inositol phosphate signalling [7]. All the above families contain protein with high degree interaction partners indicating their participation in interactome [8].
ANK belongs to all alpha class, PDZ and PH belongs to alphabeta class which are predominantly found at the cell membrane, whereas KELCH and WD40 belong to all beta class of proteins. WD40 and KELCH have been recognised for their roles in transcription regulation and protein ubiquitination respectively. ANK plays role in various functions from signal transduction to transcriptional regulation. ANK form folded solenoid structure (Figure 1a), PH domain is made of two perpendicular antiparallel beta sheets, followed by a C-terminal amphipathic helix, WD40 and KELCH proteins form propeller structure.But none of the above mentioned domains have inherent catalytic activity and they act as modules for protein-protein interactions thereby regulating the process in which they are involved. The structural and functional diversity exhibited by the families is going to reflect in the sample sequences too. Hence, the study pool spans topologically and functionally diverse proteins. We compared the domain tethering pattern in each family, domain sharing between the families and by applying methods for ecological data analysis first we analysed the similarity and then we analyse the difference in terms of the domain composition. We tried to identify most diverse and most skewed families there by finding the unique domain family among those taken in this study.

Construction of local repository:
The sequences representing the five families were retrieved using profile HMMs at Pfam database according to method prescribed by Krupa and Srinivasan [19]. Each sequence was manually curated for the gene name using NCBI GENE and Uniprot database. Sequence redundancy was removedmanually, domain repertoire were catalogued in Excel file and formatted to DB format using PERL program. Front end of database was created using WAMPP architecture with HTML-PHP script as front end and MySQL as back end with PERL as query program.  and Domain C, the tethering number is 2. The frequency of domains occurring among the above said five families were also recorded.

Data analysis:
A matrix was designed taking families as columns and domains as rows. For similarity analysis, a matrix was created with domains in rows and families in columns giving a score 1 for presence of domain and 0 for absence of domain in particular family under consideration. Jaccard's similarity index (JSI) was calculated based on method proposed by Real and Vargas, 1996 [9]. A Cluster analysis was done using Euclidian distances and Ward's method.
Another matrix was created indicating frequency of domain occurrences in respective families (Table 2). Detrended Correspondence Analysis (DCA) of the matrix was done to identify the unique family. All the statistical analysis was done using PAST programme [22].   Domains impart the structure and function to a protein. Due to exon shuffling and recombination many domains tend to occur in a single polypeptide called tethered proteins also called as Rosetta stone proteinwhich are the hallmarks of protein evolution [10]. The analysis of domain tethering has been described in methods section. The total number of tethered protein and total number of tethered domain is shown in Figure 3. It is evident that, though the number of tethered proteins is high, the tethering number is low in case of PH and KELCH. The ratio of tethered protein to tethered domain was found to be 0.65, 0.63, 0.23, 0.11 and 0.66 for WD40, PDZ, PH, KELCH and ANK families respectively. It indicates that, PH and KELCH may have the repetitive combination of same domains that may give paralogs with different functions. There may be other situation involving too many isoforms of the same protein with different tissue specific expressions. In such a case due to the repetitive occurrence of same domains, the function of the family also will be skewed towards few biological processes.
Protein domains either may be found shared between different protein families or will be strictly confined to one particular family in some cases. All the families under consideration have shown both situations (Figure 3). WD40 family shows highest percentage of unique domains indicating most of the domains occurring in WD40 family are confined to itself, we call them as unique domains for convenience. However, on the other hand PDZ domain has highest percentage of shared domains. This domain distribution is similar to classical biodiversity data analysis where the similarity and dissimilarity is examined to decide the extent of diversity of flora and fauna of particular area under consideration for which JSI has been used widely. JSI provides the association between different entities in a data distribution. The scores in the matrix represent the level of association between the families under consideration. Hence higher the score higher the number of domains shared between two families. JSI provides the association between different entities in a data distribution. It has been used earlier in various cases such as clustering protein similarity networks [12] domain architecture comparison for multidomain homology [13] and automatic classification of protein structures [14].
There is a greater association between PDZ and PH families followed by PH and ANK. However, PH and PDZ families did not found to be sharing domains with KELCH. The maximum score for KELCH and other family association is 0.028 indicating that KELCH has limited domain sharing. WD40 domain has maximum value of 0.076 with ANK and minimum of 0.024 with KELCH. The highest level of domain association is between PH and PDZ followed by PH and ANK (Table 1). It is clear that there is certain type of propensity towards domain distribution between the families. When the domains were clustered using JSI score (Figure 2) WD40 and KELCH family were clustered as separate out groups substantiating the observations in the JSI matrix. This hints the uniqueness of WD40 and KELCH families with limited domain sharing tendencies. Domain distribution pattern were later observed for the shared domains along with their frequencies of occurrences across the five families considered in this study (Table 2). A detrended correspondence analysis which is a part of community ordinance method was done for the dataset. Detrended correspondence analysis (DCA) is a multivariate statistical technique to find the main factors or gradients in large data. DCA is an iterative algorithm that has shown it to be a highly reliable and useful tool for data exploration and summary in community ecology [15].
Observations from the DCA plot ( Figure 5), domain sharing data (Figure 4), Jaccard's similarity index    WD40 family has high versatility because it comprises highest percentage of unique domains (Figure 3). It is evident that WD40 is not interested in sharing common domains with other families and there exists less domain redundancy unlike KELCH (Figure 4 and Table 3). KELCH has a high abundance of BTB domain but has lesser versatility suggesting the functional skewness (Table  2). Hence it can be concluded that, KELCH is skewed towards specific function whereas WD40 perform different function but with set of domains dedicated to only WD40 family. Higher domain shuffling results in higher number of tethering number which imparts functional versatility [11] to the domain family lacking which the versatility is lost. Hence, it may result in skewness towards certain biological functions in the family. It is clear that some domains are shared specifically between two families and sometimes they are restricted to single family. Observing all the above data and behaviour of the families, it makes clear that WD40 family is unique with respect to domain tethering behaviour.
PDZ and PH families were found to share their domains (59%) more often followed by PH and ANK (43%) ( and 4% respectively and their JSI score is 0.028 and 0.024 respectively. This may be due to very less tethered domains ( Figure 3) and lesser shared domains ( Table 3 and Figure 4).Also, the domain sharing in a pair involving WD40 domain is less (Fig.  6). For example, the domain sharing is 11% (WD40 and PDZ), 24% (WD40 and PH) and 26 % (WD40 and ANK) with JSI 0.038, 0.068 and 0.076 respectively. Unlike KELCH, WD40 family despite of having six fold higher domain repertoires, WD40 still have very less contribution for domain sharing accounting to only 15% (Figure 4). This shows there is versatility in the domain repertoire of WD40 family but most of them are restricted to WD40 family only.
WD40 has been one of the top 10 most promiscuous domains in eukaryotes. WD40 has been reported to be one of the highly connected, and therefore likely have multiple potential functions and would not be restricted to any particular functional branch. WD40, since it has high versatility, it is regarded to be among top 10 highly social domain club meaning, the larger set of clubs contains proteins with multiple distinct domains [3]. The nature always has preferred WD40 domain because they pose greater symmetry in structure in contrast to other abundant domains that predominate in intra-cellular processes. The reason for the symmetry is regular repeating super-secondary structure elements. The beta propeller scaffold always allows long insertions or deletions or multiple single amino acid substitutions evolving a new binding site for an interaction partner. In WD40 propeller scaffold, there is no interlocking of secondary structures unlike TIM barrel domains which allows mutations to occur without drastic effect on structure of the scaffold. PDZ, PH, SH3 are complex structures in which there is no such system of regular repeats, rendering drastic changes in the protein sequence more likely to disrupt the overall structure [17]. In isolation, WD40 domains have posed challenge to characterize and study, probably because they are often subunits in larger assemblies, but also because, in most cases, they lack measurable intrinsic catalysis. Whatever the reason for the adaptability of WD40 domains to act as scaffolds, they clearly represent one of the most important domain families for most critical cell processes [18]. This emphasizes that, due to domain structure and composition diversity, WD40 proteins are able to interact with various proteins making high degree protein interaction network, regulating various different biological processes yet there are limited publications on members of this family. Many WD40 proteins are still largely regarded as WD Repeat (WDRs) containing proteins only. WDRs have not been clearly deduced with their gene ontology even in any knowledge bases. A deeper understanding of their structures, interactions and functional diversity will be crucialfor our understanding of detailed cellular processes, and ultimately might provide new means to tinker with biological functions via synthetic and systems biology approacheswhich in turn may open new avenue for identifying new potential biomarkers.

Conclusion:
In the present study, we have compared five adapter protein families with respect to their domain composition and domain distribution among them. We found ANK, PH and PDZ families share their domain more often than WD40 and KELCH. We applied ecological tools to domain distribution data. The analysis has helped us to find how unique the familes with respect to domain composition. We found high degree of redundancy with respect to domain composition in all families except WD40 and we also found WD40 with highest percentage of versatility. In the light of the fact indicating limited study on WD40 protein evident by limited number of publications in public domain, we propose