Substitutional Analysis of Orthologous Protein Families Using BLOCKS

Orthologous proteins, form due to divergence of parental sequence, perform similar function under different environmental and biological conditions. Amino acid changes at locus specific positions form hetero-pairs whose role in BLOCK evolution is yet to be understood. We involve eight protein BLOCKs of known divergence rate to gain insight into the role of hetero-pairs in evolution. Our procedure APBEST uses BLOCK-FASTA file to extract BLOCK specific evolutionary parameters such as dominantly used hetero-pair (D), usage of hetero-pairs (E), non-conservative to conservative substitution ratio (R), maximally-diverse residue (MDR), residue (RD) and class (CD) specific diversity. All these parameters show BLOCK specific variation. Conservative nature of D points towards restoration of function of BLOCK. While E sets the upper-limit of usage of hereto-pairs, strong correlation of R with divergence-rate indicates that the later is directly dependent on non-conservative substitutions. The observation that MDR, measure of positional diversity, occupy very limited positions in BLOCK indicates accommodation of diversity is positionally restricted. Overall, the study extract observed hetero-pair related quantitative and multi-parametric details of BLOCK, which finds application in evolutionary biology.


Background:
Homologous proteins, emerged due to speciation event, are structurally and functionally similar [1]. Evolution accommodates changes in these sequences. Amino acid changes are mostly achieved by substitution, deletion and insertion mechanisms [2], of which earlier is the result of accumulation of changes at locus specific positions. In evolution, two types of substitutions namely conservative and non-conservative occur of which most of the later changes are deleterious. Thus these are eventually eliminated through purifying selection. Beneficial ones (both conservative and non-conservative) are restored in sequence population and thus contribute to species differentiation [3]. Comparison among homologous sequences of database reveals sequences of closely related species (e.g. human and mouse) are more similar than that of distantly related species (human vs. bacteria). When homologous positions (column-wise in a BLOCK) are fixed, it would be seen that each of these positions bears characteristic details. While some are invariant, other is either conservative or non-conservative type of substitutions [3]. Henikoff and Henikoff (1992) pioneered the concept of BLOCK of sequences. A BLOCK contains homologous sequences whose allelic positions are fixed. These types of BLOCKs of different level of sequence similarity were used to develop different series of average BLOSUM matrices [4].
The concept divergence rate has become an important tool in the assessment of mechanisms of diversification in sequence evolution [5]. . For example, Fibrino-peptide, a blood-clotting factor, has the highest and histone, a DNA binding protein, has the lowest divergence rate [7,8]. The variability in these rates are related to structural and functional requirements of these molecules [10]. In this aspect, great deals of studies and developments are available [6, 7, and 9]. Understanding the mechanism of substitutions largely involve comparison of locusspecific positions [11], for its effect on physicochemical properties [12] and identity [13] or similarity [14]. Similarity or identity scores are used for pair-wise comparison of sequence that eventually helps their alignment, finding relatedness [14], obtaining functional significance and constructing phylogenetic trees [12,15]. Further sequence-based studies also include analyses and extraction of information from INDEL regions of alignment. It is an additive alternative to substitution-mechanism for understanding protein evolution [16]. While these studies have widened our understanding in different aspects of molecular evolution of protein sequences, the governing principles of evolution for homologous protein families in relation to acquired substitutions (i.e. the usage of observed hetero-pairs) still remain an enigma. Fundamental question concerning the non-conservative substitutions, as to how these are managed in these functionally similar proteins when they are known to be deleterious [3,17], remain to be answered. In this work, we report results on SHPs (substitution-heteropairs) for eight protein BLOCKs of known divergence rate [6,7] to work out a general model of evolution of homologous proteins. We use APBEST for efficient extraction of BLOCK parameters (D, R, E, MDR, RD and CD). The study then shows the application of these parameters in relation to amino acid substitution of which the role of R and MDR are highlighted for the first time in this work. Overall our study extracts evolutionary parameters, the knowledge of which has potential application in understanding molecular evolution of homologous protein families.

Collection of Data
A total of eight homologous protein families (Ubiquitin, Glyceraldehyde-3-phosphate dehydrogenase (G3PDH), Lactate dehydrogenase (LDH), Acid-protease, Hemoglobin, Ribonuclease, Somatotropin and Kappa-casein.) were taken in the present study. These families were chosen in such a way that their divergence rate give a wide coverage. For example Ubiquitin has 0.1% per 100/mYr and that for Kappa-casein is 33% per 100/mYr [6,7]. Family specific sequences were obtained from UNIPROT [18], database. Obtained sequences were then aligned using ClustalW2 [13], for each of the eight protein families.

Preparation of BLOCK FASTA files
BLOCK-FASTA files were prepared using automated block preparation tool (ABPT) of PHYSICO2 [19]. As the method involve manual step during removal of partial sequences, care was taken such that maximal sequence information is restored in the BLOCK. The BLOCK FASTA file thus produced was used as input for APBEST. An example input BLOCK FASTA file can be downloaded at (https://sourceforge.net/projects/apbest/files/). A flowchart starting from methodology to analysis using APBEST is shown in Figure 1.

Analyses of BLOCK FASTA file and extraction of evolutionary parameters
Analysis of BLOCK FASTA files was performed using in house procedure APBEST. The program is written in AWKprogramming-language and runs in CYGWIN-UNIX like environment. It is efficient, error free and user-friendly. A compact itemized (Item A through F) output is redirected in excel file. It is freely available at http://sourceforge.net/projects/APBEST/ for academic users. D, R, E, MDR, RD and CD parameters were computed using relevant observed frequency of substitutionhetero-pair (SHP) (Figure 2). BLOCK positions undergo different types of substitutions. Different positions of BLOCK are also assessed based on residue types. If there is only one type of amino acid in a given position then it is marked as invariant. If substituted then qualitatively positional substitutions are assessed as different categories such as hydrophobichydrophobic, hydrophilic-hydrophilic and hydrophobic to hydrophilic etc.

Result and discussion
To explore evolutionary and functional significance of substitution-hetero-pairs (SHPs) for any given homologous protein family, we have analyzed eight homologous protein BLOCKs of known divergence rate [6, 7], ( Table 1: second column) using APBEST. A representative output is available at https://sourceforge.net/projects/apbest/files/. It provides details of six different items (Item A through F). Items A to E compute quantitative results on substitutions. Item F provides qualitative and quantitative insight into the positional mutations and variability respectively. The study is a first time attempt to gain insight into the mechanism of substitution based on observed hetero-pairs and its diversity. It is worth noting here that, BLOSUM series of fundamental matrices made use of observed hetero-pair for the computation of odd-score [4]. However, their use in relation to the above is rare. In the course of evolution, observed SHPs, the source of diversity in BLOCK, emerge in expense of homo-pairs in the ancestral protein. A total of 20 homo-pairs (diagonal) and 190 hetero-pairs (off-diagonal) participate in this process. BLOCK specific frequency parameters such as R, E and N, and diversities parameters such as RD, CD and MDR are presented in Table 1. Homo-pair and hetero-pair frequencies and types for a typical BLOCK are presented in Figure 3. Several points are noteworthy from Table 1 and Figure 3. First, type specific hetero-pair frequencies are seen to be non-identical for BLOCKs (Figure 3) and usage of hetero-pair (E) for different BLOCKs are seen to be different (Table 1: column 5). Second, dominantly used heteropair (D) is seen to be conservative in nature (  The fact that for a given BLOCK, individual SHP frequency varies from one another (Figure 3) and among BLOCKs, E also shows variation ( Table 1: column 4), we have presented heteropair frequency against observed probability in Figure 4 (plot A1 and A2). It is seen in the figure that overall distribution pattern and region specific details of observed hetero-pair types vary greatly for BLOCKs. At low probability range, observed heteropair frequency is very high and non-selective. As we move towards higher probability range, the frequency and type of hetero-pair become narrower and selective. For example, at highest probability range, the sole and lone observed hetero-pairs are LV and ED for plot A1 and A2 respectively (Figure 4). It is worth noting here that both of these are conservative types with the former is hydrophobic and the later is hydrophilic. In evolution, functionally similar sequences (BLOCK of homologous/Orthologous sequences) are the result of substitution in the parental one. While conservation of specific sequence positions as parental one (such as active site, binding site, protein core forming region etc) is the prerequisite for functionality, evolution demands substitutions (i.e. formation of SHPs) at homologous positions for environmental adaptation. At the same time, lethal substitutions may lead to the malfunctioning of proteins [3,17]. At this point, it is worth raising the question as to what are the lower and upper limits of usage of SHPs. To check this, we have plotted E for BLOCKs (Figure 4: Plot B). In principle, E varies between 0 and 1 (Figure2;  Equation 4). The former and the later indicate non-use and fulluse of SHP respectively. However, we see the observed lower and upper limit of E are 0.3 and 0.7 respectively. Interestingly kappa-casein, that possesses highest divergence rate ( Table 1: column 1) shows lower E value (0.32). Similar is the case for Somatotropin. Thus, the parameter E is largely uncorrelated to the divergence rate.
Is there a BLOCK specific parameter that would correlates divergence rate? In Figure 4 (C) R is plotted and fitted against the divergence rates [6,17]. Notably, it is the ratio of nonconservative to conservative substitution (Figure 2; Equation 3). The plot shows that the parameter is positively and linearly correlated with divergence rate (correlation coefficient of 0.93). Such strong correlation of R and divergence rate indicates the former could be useful in the analysis of substitutions of orthologous protein families.  (Marks, 1988;Dayhoff and Schwartz, 1978). LDH: Lactate dehydrogenase; G3PDH: Glyceraldehyde 3-phosphate dehydrogenase; € Dominant pair indicates the hetero-pair type whose observed frequency is maximum for Block.  Many factors might affect BLOCK's-positional divergence or diversity. Some of these factors are positional entropy (Shannon) [20], position specific physicochemical characteristics of BLOCKs. APBEST also computes some details of which few are listed in Table 2. Several points are noteworthy from the Table A

Conclusion
Analyses of 8 protein BLOCKs of known divergence rate shows BLOCK specific variation in the distribution pattern, hetero-pair frequency and parameters such as D, E and R, MDR, RD and CD. E is suitable for understanding usage limit of hetero-pairs and R is directly related with the divergence rate. Non-conservative substitution acts as determinant for the divergence rate. MDR not only contributes to class-specific-variability (CD-parameter) but also contributes to divergence rate. It populates only at limited BLOCK positions indicates the divergence utilizes limited portion of the total width of BLOCK. In other words, BLOCK with high conservation can still have high divergence. Such a novel strategy of limited yet unique use of positions for divergence is postulated for the purpose of incorporation of other important mechanisms of substitutions such as conservation. Taken together the procedure seems to have novel applications in substitution analysis of orthologous protein families.