Robust Feature Selection Approach for Patient Classification using Gene Expression Data

Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.


Background:
Nowadays the big biological data is one of the hottest topics for the researchers. Gene expression datasets is the high-dimensional big datasets because it contains ten thousands of genes/features with very few patients/samples [1]. This behavior of gene expression data often refers to the curse of dimensionality [2][3]. Thus analyzing of these types of datasets has become complicated and challenging for the researchers. The goal of classification is to allocate/classify the new objects into one of two or more population of the training dataset whose categories are known in advance. Cancer classification based on gene expression dataset is important for subsequent diagnosis and treatment. Without correct classification of different cancer types of the patient, it is very difficult to provide proper treatment and therapies [4]. The conventional classification methods are largely dependent on different morphological parameters to classify cancer. Thus their applications become limited with low prediction accuracies. To get rid from the curse of dimensionality of GED, classification through informative gene identification or feature selection (FS) has already attracted to the research communities [5]. FS can boost the performance of the classifiers by selecting smaller number of features. It also reduces the computational time and provides more reliable estimates to train the classifiers. There are three types of FS methods for GED analysis; (a) wrapper method, (b) embedded method and (b) filter based method [6][7]. Wrapper method searches the features until a certain accuracy of the classifier was achieved. Embedded methods embed feature selection within classifier construction. Filter based method first select few informative features (DE genes) using the labeled samples of training dataset and based on these pre selected features, researchers perform the further classification task. Filter based methods are easily understandable and computationally faster than the wrapper and embedded methods, thus they are better suited to high dimensional datasets [8]. Among the filter-based methods, t-test is one of the popular and widely used methods in gene expression data analysis [9].
However, the major drawback of this classical t-test is that it produces higher false discoveries and lower accuracies with small-sample sizes or outlying gene expressions. Significance Analysis of Microarrays (SAM) has overcome the shortcomings of t-test for small-sample case by controlling false discoveries [10]. However, SAM is very sensitive to outliers and produces misleading results in presence of outlying gene expressions. Consequently, the popular classifiers produce misleading results in presence of outliers when feature selection is performed using classical t-test or SAM. Recently, we have robustified the SAM approach by minimum β-divergence estimators to solve the allaforesaid problems of classical t-test and SAM [11]. Therefore, in this paper, we employ robust SAM as a feature selection method along with classifiers. To investigate the performance of the robust SAM in a comparison with classical t-test and SAM, we pick up four popular classifiers: linear discriminant analysis (LDA) [12], K-nearest neighborhood (KNN) [13], support vector machine (SVM) [14] and naive Bayes classifier [15]. From a real Colon cancer dataset we identified additional 21 DE genes using robust SAM approach that were not identified by the classical ttest or SAM approach. Using the functional annotation and KEGG pathway enrichment analysis we revealed that 15 genes out of 21 genes, are involved in some important pathways related to cancer disease.

Methodology: Performance Evaluation:
In order to evaluate the performance of different classifiers for binary classification test such as normal or cancer, we used different statistical measures. For binary class prediction, the outcomes are always divided into four categories:

Patient Classification through Robust SAM Approach:
Gene expression datasets are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis [16]. If outliers are present in the dataset then the results of the downstream analysis might be changed. Despite the popularity of the statistical FS methods (t-test or SAM), they are sensitive to outliers. Therefore, in this paper, we used robust SAM [11] as a feature selection method to select the smaller number of informative features to train the classifiers (Figure 4). The detail procedure of patient classification is as follows: 1) Apply the robust SAM approach in the GED to select the informative features or DE genes using the p-values.
2) Adjust the p-values for multiple testing corrections using Benjamini-Hochberg method. Then arrange the adjusted p-values in ascending order. 3) Select first T < max (n1, n2) genes out of G genes as top DE genes from the training dataset. Here, n1 and n2 are the number of patient in the normal and cancer group, respectively. G is the total number of gene in the dataset. 4) Estimate the parameters of the classifiers using the expressions of these top T DE genes based on training dataset. 5) Select the expressions of top T DE genes from test dataset to obtain the reduced test dataset. 6) Finally, classify the patients of the test dataset into one of two groups (normal/cancer).

Dataset: Simulated Gene Expression Dataset:
We generate the simulated gene expression dataset from the following model as described in table 1. In this table g1 and g2 represents the up-regulated and down-regulated DE gene group, respectively and g3 represents the EE gene group. We generated gene expression profiles of G=10,000 genes, with k=2 groups (normal/cancer). We considered 100 datasets for both small (N1=N2=10) and large (N1=N2=40) sample cases, respectively. Each dataset for each case represents the gene expression profiles of G=10,000 genes with N= (N1+N2) samples. We set the values of the parameter d as 2 and σ 2 = 0.1. Among the expression of 10,000 genes for each datasets we divided these expressions in to two groups (expressions of important features or DE genes, 200 and expressions of the unimportant features or EE genes, 9800). We randomly divided each of the 100 datasets into two independent datasets to construct the training and test dataset such that training and test datasets consist of n1 = N1 / 2 samples in normal and n2 = N2 / 2 samples in cancer group.

Real Gene Expression Dataset:
This dataset consists gene expression profiles of 6,500 human genes collected from 40 tumor and 22 normal colon tissue samples were analyzed with an Affymetrix technology [17].
Among the 6,500 genes, the highest minimal intensity across the samples with 2000 genes was selected for the further analysis. This dataset can also be downloaded from the R-package ''plsgenomics''.

Results & Discussion:
To demonstrate the performance of the robust SAM in a comparison of classical      figure 3 shows the bar chart of the biological process, cellular component and molecular function categories. In this figure there are 15 genes out of 21 genes involved in the three categories. The top ten KEGG pathways for additional 21 genes detected by the robust SAM is summarized in table 3. We found that DNA replication pathway is the highest enriched pathway.

Conclusion:
Patient classification into various sources of population of training dataset is very popular in GED. t-test and SAM are the popular FS methods for patient classification using GED. However, both of them suffer from outliers. To prevail over the problems of classical t-test and SAM, robust SAM using the minimum β-divergence estimators was proposed [11]. In this paper, we employed robust SAM as a FS method along with classifiers. From a real Colon cancer dataset we identified additional 21 DE genes by robust SAM and we found that these genes are involved in some important pathways related to cancer disease. Then we apply the expressions of these 21 genes in the classification and reveal that the classification performances improve using these genes. Moreover, we notice that SVM and naive Bayes classifiers performed better compare to the LDA and KNN.