SDED: a novel filter method for cancer-related gene selection.

Gene selection is to detect the most significantly expressed genes under different conditions expression data. The current challenge in gene selection is the comparison of a large number of genes with limited patient samples. Thus it is trivial task in simple statistical analysis. Various statistical measurements are adopted by filter methods applied in gene selection studies. Their ability to discriminate phenotypes is crucial in classification and selection. Here we describe the standard deviation error distribution (SDED) method for gene selection. It utilizes variations within-class and among-class in gene expression data. We tested the method using 4 leukemia datasets available in the public domain. The method was compared with the GS2 and CHO methods. The Prediction accuracies by SDED are better than both GS2 and CHO for different datasets. These are 0.8-4.2% and 1.6-8.4% more that in GS2 and CHO. The related OMIM annotations and KEGG pathways analyses verified that SDED can pick out more 4.0% and 6.1% genes with biological significance than GS2 and CHO, respectively.


Background:
DNA micro-array technology has enabled biologists to associate phenotypes with molecular genetics [1, 2]. It is commonly used to compare gene expression levels of different phenotypes (normal versus cancer). It enables the study of thousands of gene expression simultaneously. The difficulty is in interpreting expression data. Genes with significant expression across the sample set are selected using sound statistical techniques. These discriminatory genes will help to classify different cancer subtypes [3, 4]. There are two categories of gene selection strategies namely, filter and wrapper [1]. Both categories of gene selection strategies have their disadvantage. Although GS2 is a stable method, calculations are too complex and the biological meaning is difficult for annotation. The CHO method considers within-class information and it loses the among-class information. The wrapper methods use exponentially increasing dimensions of the feature space for large gene sets. Thus, the wrappers are computationally intractable for high-dimensional gene data [1]. The inherent linear nature is their disadvantage and it makes it difficult to identify important genes in wrapper methods [11]. Here, we propose a statistical measurement to better score genes with subtle expression patterns. It incorporates the within/among class variations in gene expression data.

ALL-AML dataset:
The ALL-AML dataset is obtained from the cancer program of BROAD Institute

Data normalization
These 4 datasets were used in the analyses. Each sample was normalized to standard distribution -N(0,1) before scoring for gene selection. The expression of each gene was normalized based on the expression level in each sample.

SVM classifier
SVM is a powerful and popular machine-learning method and has been widely used in biological classification. The key idea of SVM is to maximize the margin separating the two classes while minimizing the total classification error. There were a number of kernels used in SVM models for decision plane computing and the radial basis function (RBF) kernel was chosen for our purpose. As for the design of multi-class SVM classifier, we used the one-versusone method. The final prediction decision was given by the voting strategy: the predicted class is assigned to the one that has the maximum vote. If more than one class has the same maximum vote, the classifier will have to make a random prediction. It is known that proper selection of parameter is very important for SVM, so the grid search strategy by Chih-Jen Lin [15] was performed to find the best combination of parameters for each prediction process. The toolkit for SVM implementation we used in MATLAB was LIBSVM-Version 2.82 [15].

Discussion:
Samples are first divided into testing and training data for each dataset. We used the training samples for scoring the genes. The quality of these top ranked x genes are selected based on two aspects, namely: (1) the classification accuracy; (2) relevance to relative inheritance or diseased association in related pathways.

Classification accuracies
We used the top ranked genes selected by a gene selection method, together with their expression values in the training dataset to build a classifier for each testing sample. We defined the classification accuracy as the percentage of correct decisions made by the classifier on the testing samples. We adopted the SVM classifier to compare the performance of SDED with GS2 and CHO. The classification accuracy was obtained through the leave one out cross validation (LOO_CV) process. One sample was taken as testing and the remaining were used as training data in LOO_CV. This is done for all samples and for every top ranked x (from 1 to 100 with p < 0.01) genes in the datasets. Figure 1 shows the plot for classification accuracy of the SVM classifier based on SDED, GS2 and CHO on MLL dataset. The SDED method could achieve better results than GS2 (94.444%/91, 97.222%/48, 93.056%/36), CHO (88.889%/82, 95.833%/74, 93.056%/69) for MLL, ALL-AML-3 and ALL-AML-4 datasets. The SDED showed 97.222%/48, 98.611%/16, 97.222%/57, accuracy for these datasets even with less number of genes, respectively. The performance of SDED method (98.387%/96) was only comparable with GS2 (97.581%/68) and CHO (96.774%/87) in ALL dataset. In summary, the SDED filter method can perform about 0.8-4.2% and 1.6-8.4% better classification accuracies than GS2 and CHO, respectively.

Biological meaning
We examined genes and their association in pathways to demonstrate the biological significance and evidence of gene selection. The top 100 ranked genes were chosen (p < 0.01) for each method and dataset. The numbers of genes in the dataset that are found in