A decision tree model for the prediction of homodimer folding mechanism.

The formation of protein homodimer complexes for molecular catalysis and regulation is fascinating. The homodimer formation through 2S (2 state), 3SMI (3 state with monomer intermediate) and 3SDI (3 state with dimer intermediate) folding mechanism is known for 47 homodimer structures. Our dataset of forty-seven homodimers consists of twenty-eight 2S, twelve 3SMI and seven 3SDI. The dataset is characterized using monomer length, interface area and interface/total (I/T) residue ratio. It is found that 2S are often small in size with large I/T ratio and 3SDI are frequently large in size with small I/T ratio. Nonetheless, 3SMI have a mixture of these features. Hence, we used these parameters to develop a decision tree model. The decision tree model produced positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57% for 3SDI in cross validation. Thus, the method finds application in assigning homodimers with folding mechanism.


Background:
Homodimers play an important role in catalysis and regulation.The formation of homodimer interface is structurally intriguing [1].The mechanism of formation of such homodimer interfaces is further appealing.Structures for 47 homodimers with known folding information are now available as given in Hence, the future definition of homodimers as drug targets is evident.Therefore, it is important to understand both homodimer association and its folding mechanism of formation.A number of attempts have been made to relate homodimer structures to folding mechanism to decipher folding specific structural features [50][51][52][53][54].We recently documented the relationship between structural features describing homodimer folding mechanism [55].Nevertheless, folding information on homodimers is far less than the known number of homodimer structures stored in databases [1].Therefore, it is of interest to predict folding mechanism to known homodimer structures.We created an improved dataset of 47 homodimer structures from PDB with known folding mechanism to glean parameters and to develop models for homodimer folding mechanism prediction given their structures.We then use these parameters to design a decision tree model to classify homodimer structures with unknown folding mechanism.

Methodology: Dataset:
We created a dataset of 47 homodimer structures from PDB with known folding information taken from respective literature (Table 1 in supplementary material).The dataset consists of twenty eight 2S, twelve 3SMI and seven 3SDI structures.Table 1 (see supplementary material) also provides information on structural parameters such as monomer length (ML), interface area (B/2) and interface to total residue (I/T) ratio for each structure.The structural features in the dataset are summarized in Table 2 (see supplementary material).

Monomer length (ML):
Monomer length (ML) refers to the protein length of monomers forming the homodimer complex.The distribution of 2S, 3SMI and 3SDI with ML is shown in Figure 1a.The figure illustrates the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset.The length of 2S proteins are found in the range of 45 to 271, 3SMI in the range of 72 and 381, while 3SDI between 90 and 835.There is some degree of ML overlap between the three categories of homodimers.

Interface area (B/2):
Interface area (B/2) is defined as the change in accessible surface area (delta ASA) when going from monomer state to dimer state during complex formation.Accessible surface area (ASA) is calculated using the software SURFACE RACER 5.0 [56] using the algorithm described by Lee and Richard

Interface to total residue (I/T) ratio:
It is the ratio between the numbers of interface residues per monomer (residues involved in homodimer interactions at the interface) to the total number of residues in monomer protein.
Interface residues are identified using ASA calculation described in previous section.The distribution of 2S, 3SMI and 3SDI with I/T ratio is shown in Figure 1c.The figure shows the graphical representation of homodimers to I/T ratio.Here, the 3SDI proteins lie in the range of 5 to 50%, and 3SMI in the range of 9 to 44%, while the 2S proteins lie in the range of 6 to 80%.

Decision tree model:
A decision model is a clear logical model that can be easily understood by persons who are not mathematically inclined.The decision tree model is a classification tree to classify the target variable (folding mechanism in this case) based on the predictor variables (ML, B/2 and I/T) described in previous sections.The cumulative frequencies of the three predictors (ML, B/2 and I/T) were used to decide the values in the logical conditions of the decision tree.A flowchart describing the decision tree model is illustrated in Figure 3.The model checks for ML, I/T and B/2 for each known homodimer structures to assign their folding mechanism using human expert cut-off values as shown in Figure 3.

Validation:
An internal cross validation is performed for 47 homodimers in Table 1 using the decision tree model described above.The results of the validation using true positive (TP), false positive (FP) and positive predictive value (PPV) is given in Table 5. PPV (%) is defined as TP/(TP+FP)*100.

Assignment dataset:
We created a dataset of 149 homodimers with unknown folding information for prediction and assignment of folding mechanism using structural parameters (Table 3 in supplementary material).The structural features in the dataset are summarized in Table 4 (see supplementary material).A classification of 149 homodimers into three target categories using the decision tree model is given in Table 6 (see supplementary material).   .However, no attempt has been made to predict their folding mechanism given their structures in complex state.Here, we describe a novel decision tree model using predictors ML, B/2 and I/T to predict folding mechanism (target variable) given their structures in complex state.
The decision tree model is developed based on the prevalence of weight associated with these predictors in a dataset of structures with known folding data (Figure 1).The distribution of its percent cumulative frequency of predictor variables in the datasets are given in Figures 2. Figure 2a gives percent cumulative frequency of 2S, 3SMI and 3SDI for ML.More than 90% of 2S lie when ML <= 250.
When ML = 250 only about 15% of 3SDI and 60% of 3SMI are covered.Hence, ML <=250 was selected as a decisive condition in the development of the model.Figure 2b gives percent cumulative frequency of 2S, 3SMI and 3SDI for I/T ratio.About 90% of 3SMI and 3SDI lie when I/T <= 25%.When I/T <= 25%, only about 30% of 2S is covered.Therefore, I/T <=25% was selected as a decision condition in the development of the model.Figure 2c gives percent cumulative frequency of 2S, 3SMI and 3SDI for B/2.When B/2 <= 1500, about 70% 3SMI, 50% 2S and 30% 3SDI are covered.So, B/2 <= 1500 was selected as a decision condition in the development of the model.Thus, percent cumulative frequency values for predictors are used in the design and development of the decision tree model (Figure 3).The conditional values of the predictor variables are selected based on their biased cumulative frequency in the target categories (datasets).The decision tree model checks for predictor values within defined conditional values for multiple variables in a subsequent manner sequentially so as to reach the respective nodes to predict and assign target variables.
The decision tree model was applied to classify the dataset of 47 homodimers (with known folding data) in a cross validation experiment.The model produced the positive predictive values (PPV) 71.4%, 58.4% and 57.1% for 2S, 3SMI and 3SDI, respectively (Table 5 in supplementary material).We then extended the application of the decision tree model to a dataset of 149 homodimers with no folding data known.The model was able to assign folding data to 132 (88.5%) of 149 structures to predicted target variables with only 17 structures unable to classify (Table 6 in supplementary material).This predicted data serves a framework to understand their folding mechanism given their structures.It should be noted that these predicted mechanism should be verified using denaturation experiments.

Figure 1 :
Figure 1: Distribution of 2S, 3SMI and 3SDI for ML, B/2 and I/T is shown.(a) An illustration of the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented.The X -axis represents monomer length.The overlap regions are shown horizontally.2S proteins range from 45 to 271, 3SMI range from 72 to 381 and 3SDI range from 90 to 835.(b) An illustration of the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented.The X -axis represents interface area.The overlap regions are shown horizontally.2S proteins range from 156 to 2507, 3SMI range from 309 to 2332 and 3SDI range from 1351 to 2317.(c) Distribution of 2S, 3SMI and 3SDI for I/T ratio.An illustration of the minimum and maximum limits of I/T for 2S, 3SMI and 3SDI homodimers in the dataset is presented.The X -axis represents I/T ratio.The overlap regions are shown horizontally.2S proteins range from 6 to 80, 3SMI range from 9 to 44 and 3SDI range from 5 to 50.It should be noted that there is no Y-axis variable defined in this case.However, a Y-axis is shown for convenience of visualization.

Figure 2 :
Figure 2: Percent cumulative frequency of 2S, 3SMI and 3SDI for ML, I/T and B/2 is given.(a) The distribution of the cumulative frequency of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented.About 90% of 2S, 60% of 3SMI and 15% of 3SDI are covered when ML <= 250.Hence, ML <=250 was selected as a decision condition in the development of the model.(b) The distribution of the cumulative frequency of I/T ratio for 2S, 3SMI and 3SDI homodimers in the dataset is presented.About 30% of 2S and 90% of 3SMI and 3SDI are covered when I/T <= 25%.Hence, I/T <=25% was selected as a decision condition in the development of the model.(c) The distribution of the cumulative frequency of interface area for 2S, 3SMI and 3SDI homodimers in the dataset is presented.About 50% of 2S, 70% of 3SMI and 30% of 3SDI are covered when B/2 <= 1500.Hence, B/2 <= 1500 was selected as a decision condition in the development of the model.

Figure 3 :
Figure 3: A flowchart describing the decision tree model is given.The decision tree model checks for predictor values within defined conditional values for multiple variables in a subsequent manner sequentially so as to reach the respective nodes to predict and assign target variables.

Table 1 (see supplementary material
[55]is is time consuming, laborious and tedious.The number of homodimer structures with unknown folding mechanism is substantial[1].Therefore, it is of interest to predict homodimer folding mechanism given their 3dimenisonal structures.A number of studies have been documented to relate folding and structural features [50-54].We recently described the trends in parameters (monomer size, interface residues, interface area, hydrophobicity factor, hydrophilic residues and charged residues) for distinguishing 2S from 3S proteins[55]

Table 1 :
Dataset of 47 homodimer structures from PDB with known folding information

Table 2 :
The minimum, maximum, mean and standard deviation value of the predictor variables is given for 47 homodimers.

Table 3 :
An assignment dataset of 149 homodimers with unknown folding data.

Table 4 :
The minimum, maximum, mean and standard deviation value of the predictor variables is given for 149 homodimers of the assignment dataset.

Table 5 :
Cross validation experiment positive predictive values (PPV) of the decision tree model when applied to the dataset of 47 homodimers.

Table 6 :
Classification results of the assignment dataset.