Comparative analysis of prokaryotic and eukaryotic transcription factors using machine-learning techniques

The DNA-protein interactions play vital roles in the central dogma of molecular biology. Proper interactions between DNA and protein would lead to the onset of various biological phenomena like transcription, translation, and replication. However, the mechanisms of these well-known processes vary between prokaryotic and eukaryotic organisms. The exact molecular mechanisms of these processes are unknown. Therefore, it is of interest to report the comparative estimate of the different properties of the DNA binding proteins from prokaryotic and eukaryotic organisms. We analyzed the different sequence-based features such as the frequency of amino acids and amino acid groups in the proteins of prokaryotes and eukaryotes by statistical measures. The general pattern of differences between the various DNA binding proteins for the development of a prediction system to discriminate between these proteins between prokaryotes and eukaryotes is documented.

However, the molecular details of the transcription processes in the pro-and eukaryotic organisms are still at its infancy. In this work, we tried to analyze the different aspects of the transcription factors from pro-and eukaryotic organisms. For the comparison purposes, we used the amino acid sequences of the DNA binding proteins (DBPs) and transcription factors (TFs) from UniProt [4].
We compared the TFs using their sequence information only as sequence is more abundant than structure [5]. The main motivation of carrying out the work is to discriminate between the different classes of microorganisms. We, for the first time, put forward some plausible discriminatory features between the TFs from the different branches of organisms. Interestingly, the TFs from the pro-and eukaryotic organisms can be distinctly identified using the amino acid frequency analyzes in the TFs. We also analyzed the statistical efficacies of the features used in the study to discriminate between the different classes of microorganisms using machine-learning techniques. The ideas regarding these features may further be utilized to come up with a prediction system to discriminate between the different branches of organisms.

Methodology: Data collection:
We downloaded the sequences of DNA binding proteins (DBPs) from UniProt [4]. We collected the amino acid sequences of the DNA binding proteins from 1012 prokaryotic organisms and 1425 eukaryotes. We divided our dataset into two groups, the largest group containing the whole DBP data, and a small subgroup containing the transcription factor (TF) sequences, which were also present in the DNA binding protein dataset. The data collection process was carried out using an in-house tool written in Python (Figure 1).

Redundancy check to the dataset:
The raw dataset may be biased because of having multiple copies of a single sequence. We, therefore, performed a redundancy check, by means of distance matrix calculation. The distance matrix was generated by Hamming distance algorithm [6, 7]. After this redundancy check, we were able to eliminate the redundancy in the dataset and prepared a clean dataset. The clean dataset contained 270 DBP sequences from prokaryotes and 347 DBP sequences from eukaryotes; among them, there were 92 sequences of TF from prokaryotes and 182 sequences of TF from eukaryotes. So the DBP dataset contained 270 prokaryotic and 347 eukaryotic sequences. As the eukaryotic DBP sequences were present in higher number than the prokaryotic DBP sequences, we had split the eukaryotic DBP sequences into two sets. Eukaryotic DBP set 1 contained sequences starting from 1 to 270 and eukaryotic DBP and set 2 contained sequences starting from 78 to 347 so that there were equal numbers of amino acid sequences in the datasets. For the same reason, the eukaryotic TF dataset was split into two sets. TF set 1 contained sequences starting from 1 to 92 and TF set 2 contained sequences starting from 91 to 182. Thus all the datasets were balanced. The distribution of the dataset is shown in The list of UniProt IDs used in these datasets was present in Table S1 (see Supplementary data).

Frequency Calculation:
After the preparation of these clean datasets, we performed amino acids and amino acids group frequency calculations. We categorized the amino acid groups into Hydrophobic (HB), Hydrophilic (HI), Charged (CR), Basic (BS) and Acidic (AC) [8]. This frequency calculation was done to normalize the dataset. The entire frequency calculation was done using an in-house python script. We had calculated the frequency of amino acids and amino acid groups separately for the two datasets DBP and TF, and separately for eukaryotic set1 and eukaryotic set 2.

Machine learning using WEKA:
We used the overall amino acid frequencies and amino acids group frequencies of the prokaryotic and eukaryotic organisms as features to distinguish between prokaryotic and eukaryotic organisms using the tool WEKA [9]. WEKA is a tool, containing a collection of machine learning algorithms, is commonly used in data mining problems in bioinformatics. We have used the Support vector machine (SVM) algorithm and the SMO classifier [10] with 10 fold cross-validation. The 10 fold cross validation is a kind of default test option of WEKA. It randomly splits the dataset into training and testing datasets and runs the test. It does this operation 10 times with random splitting of the input data into training and testing datasets. We prepared the input dataset for WEKA using data distribution as described in table 1.

Figure 1:
Flowchart diagram of the in-house python tool.

Results: Amino acids and amino acid group frequency
A distinguishable difference was found in the frequency patterns between eukaryotic and prokaryotic amino acid sequences in the DNA binding proteins. This distinguishable difference pattern in amino acid and amino acid group frequency can be used to discriminate them. The bar graph ( Figure 2) and boxplot ( Figure  3 and Figure 4) were used to decipher the patterns of the differences.

Machine learning results:
We found that amino acids and amino acid group frequency can be used as features to train a SMO classifier in WEKA to distinguish prokaryotic and eukaryotic DNA binding proteins on the basis of their amino acid and amino acid group frequency as given in Table 2.