Classification of Functional Metagenomes Recovered from Different Environmental Samples

Classification of functional metagenomes from the microbial community plays the vital role in the metagenomics research. In this paper, an investigation was made to study the performance of beta-t random forest classifier for classification of metagenomics data. Nine key functional meta-genomic variables were selected using the beta-t test statistic from the 10 different microbial community using p-value at 5% level of significance. Then beta-t random forest classifier showed the higher accuracy (96%), true positive rate (96%) and lower false positive rate (5%), false discovery rate (5%) and misclassification error rate (5%) for classification of metagenomes. This method showed the better performance compare to Bayes, SVM, KNN, AdaBoost and LogitBoost).


Background:
Metagenomics is one of the important research wings in bioinformatics for studying the microbial community available in different environments. The classification of functional metagenomes is the big statistical challenge from the different sources of metagenome dataset. The classification of potential metabolic function from microbial community using metagenomic information is an important task of metagenomics research. The different microbial process has different metagenomic function for several environments [1]. Metagenomics is the complete scheme of microbial activity and gives the easier interpretation of thousands of proteins using BLAST matches algorithm [2]. There are many web tools available for statistical analysis of metagenomic dataset but not all the analysis tools provide accurate and valid results [3].
Some traditional multivariate statistical methods such as principal component analysis (PCA), multidimensional scaling (MDS), and canonical discriminate analysis (CDA) are often used for analysis of genomic data and microbial community [4]. The multivariate statistical techniques are plays vital role for classification and visualization of metagenomics data from several microbial community. The metagenomic data profiling from the different environments and its classification is important for separation of functional metagenomes. The MetaGUN is the three-stage gene selection method for gene prediction for metagenomic fragments using support vector machine (SVM) [5,7]. To explore the universe of metagenome, k-nearest neighbor method is significant for the several microbial communities [6]. AdaBoost is the efficient method for analyzing the gigantic metagenomic data and it is challenging ©Biomedical Informatics (2019) task for bioinformaticians/computer scientists [8]. The prediction of ribosomal protein in plants, the machine learning method Random Forest is very much useful [9][10][11][12]. The statistical test is very important for the identification of potential metabolic function within and between environments based on the different microbial community. Random forest method is efficient for the robust classification of high dimensional complexity data like as the microbial community data. It is the ensemble learning method for classification and regression multiple patterns datasets. High dimensional dataset with large number of features or metabolic functions or metabolic variables is a very basic problem. Therefore, it is essential to select the proper feature selection method for classification of large dimensional metagenomics dataset. In this study, we used beta t-statistic for feature selection of metagenomic data from the several microbial community then applied random forest algorithm for efficient classification of functional metagenomes.

Methodology: Dataset:
The dataset in this study were collected from the previously published article [17]. The dataset contains 212 microbial metagenomes generated from the 10 different environments with 26 metabolic functions.

Model:
The Bayesian classifier is generally known as a simple probabilistic classifier. The sequence features were used for the input X = (x1x2…xp) to the Bayesian classifier. For each metagenome, our Bayesian classifier produced a multiclass and the Bayesian classifier was trained using a set of labeled training dataset (X, C). The Support Vector Machine (SVM), K-nearest Neighbor (KNN), AdaBoost, LogitBoost [13] Random Forest [15] Beta-t statistic [16] are used for classification and comparison of functional metagenomes from the different microbial community (Figure 1). All the computational analysis conducted in this study using Rstatistical programming language (https://www.r-project.org/) [14].

Results and Discussion:
To identify the key functional metagenomes, we used the beta t-test statistic. This method is described in details in the previously published paper [16]. Using the method we select the top nine key functional metagenomes (AAD, CDCC, CVPGP, DNAM, MT, MC, NN, Plasmids and SM) based the on the p-values at 5% level of significance ( Table 1)  To investigate the performance of the different classifiers we divided full dataset into three different parts using the cross validation (CV) method such as 10-fold, 5-fold, and 3-fold cross validation dataset and checking the performance. In case of full dataset, performance of different classifiers (Bayes, SVM, KNN, AdaBoost, LogitBoost and Beta-t Random Forest) is shown in the Table 2. The performance measure of all the methods using accuracy (AAC), true positive rate (TPR), false positive rate (FPR), false discovery rate (FDR), and misclassification error rate (MER). Bayes classifier showed the lowest performance in terms of ACC (57%), TPR (57%), FPR (42%), FDR (48%), and MER (43%) whereas the highest performance is observed for beta-t Random Forest in terms of ACC, TPR, FPR, FDR and MER with the results of 94%, 94%, 5%, 6% and 6% respectively. Finally, we showed that beta-t Random Forest provided the better performance for full dataset. For the 10-fold cross validation dataset, the Bayes classifier showed the lowest performance and LogitBoost and beta-t Random Forest showed approximately equal performance but eventually beta-t Random Forest was considered as better classifier than the other methods. In case of 5-fold and 3-fold cross validation dataset, it is found that the beta-t Random Forest method showed better ACC, TPR, FPR, FDR, and MER respectively.  Figure 3 it is revealed that among the misclassification error rate (MER) of the six different classifiers, the SVM classifier provided the highest MER and beta-t Random Forest showed the lowest MER for full dataset. Similarly, for other datasets (10-fold, 5fold, and 3-fold CV) SVM also showed the highest MER and beta-t Random Forest provided the lowest MER. It is however demonstrated that the beta-t Random Forest showed the lowest MER for all datasets.

From the
The false discovery rate (FDR) was calculated for each of the dataset. Figure 4 illustrates that SVM produced largest FDR for all datasets followed by Bayes classifier and KNN. On the other hand, among these six classifiers, the beta-t Random Forest produced lowest FDR to classify the functional metagenomes from several microbial communities.

29
The radar plot (Figure 5) shows the different performance measurement methods for popular classifiers in the literature to classify the functional metagenomes from the different microbial community. The beta-t Random Forest classifier showed the highest TPR and lowest FDR and MER for classification of the metagenomes.

Conclusions:
Classification of the metagenomic data obtained from different microbial community is an important task in the context of their associated functional metagenomic variables. In this study the betat random forest classifier showed the lowest FDR and MER along with highest TPR in all cases of data compared to Bayes, SVM, KNN, AdaBoost and LogitBoost classifiers. Therefore, the beta-t random forest classifier is considered the best classifier in grouping the metagenomes derived from different environmental samples.