Entropy based sub-dimensional evaluation and selection method for DNA microarray data classification

DNA microarray allows the measurement of expression levels of tens of thousands of genes simultaneously and has many applications in biology and medicine. Microarray data are very noisy and this makes it difficult for data analysis and classification. Sub-dimension based methods can overcome the noise problem by partitioning the conditions into sub-groups, performing classification with each group and integrating the results. However, there can be many sub-dimensional groups, which lead to a high computational complexity. In this paper, we propose an entropy-based method to evaluate and select important sub-dimensions and eliminate unimportant ones. This improves the computational efficiency considerably. We have tested our method on four microarray datasets and two other real-world datasets and the experiment results prove the effectiveness of our method.

number of conditions (features). The sub-dimension based method partitions the dataset into several smaller parts called sub-dimensions, which may or may not be disjoint [5]. It clusters the datasets based on their sub-dimensions. In our previous study, a voting system was used to combine all sub-dimension class results. We assigned two objects x 1 and x 2 to the same group if more than half of the sub-dimensions x 1j and x 2j belong to the same group. Experiment results show that the method is effective [1]. However, the enormous number of features in the real world microarray datasets makes it difficult to select the optimal sub-dimensions. One method is to reduce the dimensionality. In the classification, the contribution of each sub-dimension is not equal. Some may be corrupted or less relative to others, which can be discarded without degrading the performance of the system. In this paper, we employ the feature evaluation and selection technique to determine the sub-dimensions that are not as important as others in order to reduce the number of sub-dimensions without affecting the classification accuracy.
The aim of feature selection is to discriminate features which contain the most or the least effective information from an original candidate set. Feature selection algorithms have been well researched in this area. In our study, we apply the entropy based measure combined with the subdimension method. Entropy based methods have been used in many areas, such as mathematics, communication theory, and economics. In 1948, Shannon [6] first introduced the basic entropy and the information gain concept to the information domain. "Entropy is a measure of the amount of uncertainty in the outcome of a random experiment, or equivalently, a measure of the information obtained when the outcome is observed." [7] In our study, the entropy can be said to be the measure of contribution that a single sub-dimension makes to the general classification. Aiming to show the convincing performance of the proposed method, normal PNN and sub-dimension combined PNN are used in experimental comparison. In this paper, we first briefly review the structure of the PNN, discuss the sub-dimension formulation, and introduce the entropy concept. Then, we describe the proposed method and present experiment results from six datasets.

Methodology:
Please see supplementary material.

Discussion:
Experiments based on the proposed method are performed on four microarray datasets including yeast cell cycle data, sporulation data, rodrigues data, and annot data [11]- [14].
To verify the proposed method, we also present the experiment results on other datasets, including wine data, Wisconsin diagnostic breast cancer (wdbc) data. For each dataset, we run the steps in section II 30 times and compute their average to evaluate the performance.

Real world data
In order to evaluate the performance of the proposed method for noisy data, we added white Gaussian noise (wgn) randomly into the features of entire datasets as a form of corruption. The wine dataset contains 178 objects in three groups and 13 features. In our experiment, we adopt 78 objects as training samples and the remaining 100 objects for testing. As shown in Table 1 (supplementary material), the sub-dimension based PNN obtains 90 correct out of 100, compared with 71 correct out of 100 in normal PNN. However, with 89% accuracy, we can see that the proposed method provides a comparable performance with the sub-dimension based PNN.
The wdbc dataset has 576 objects in two classes and 30 features in which 276 training samples and 300 testing samples are used to test the recognition results. As in the case for the wine data, the proposed method shows close results in the wdbc dataset, 279 correct classifications compared with 280 by the sub-dimension based PNN, and is superior to the normal PNN.

Microarray data
The yeast cell cycle dataset consisting of 6220 genes is published by Cho and colleagues [11]. In the study of the sub-dimension method [5], we adopt 384 genes and normalized each gene expression profile so that it has zero mean and unit variance. The dataset has five cycle phases which are the G1 phase, late G1 phase, S phase, S2 phase and M phase, and 17 time points. The results are given in Table 3 under supplementary material. The proposed method correctly classifies 149 out of 200 testing samples and the sub-dimension based PNN correctly classifies 150. The error is only 0.5%. The sporulation dataset contains 6118 genes with seven features. In [5], after pre-processing, we use only 1136 genes of which the value of the root mean square of the log2 transformed the data greater than 1.13. The dataset has seven phases: metabolic, early I, early II early middle, middle, mid-late, and late. We use 736 genes for training and the remaining 400 genes for testing. As shown in Table  4 (supplementary material), the proposed method works well with an accuracy rate of 48.5% (194 out of 400) compared with 49.5% for the sub-dimension based PNN.
Rodriguez dataset is available elsewhere [13]. It contains 974 genes clustered to nine groups with 47 features and 500 of the genes are used for testing. Clearly Table 5 (supplementary material) shows that the proposed method achieves an improvement of the same recognition accuracy with the sub-dimension based PNN (82.4%). As comparison, the normal PNN classification results are 79.6% accuracy. Similar results on the Annton dataset, containing 639 genes in five classes and 47 features, of which half are in the test set. As expected, the test set presents almost the same success as the sub-dimension based PNN, at 73% accuracy. The normal PNN could only obtain 283 correct out of 400 testing data. As shown in the tables (under supplementary material), the proposed method performs very closely to the sub-dimension based PNN which uses all sub-dimension features.

Conclusion:
Instead of considering all features of datasets in a classifier, our previous paper [1] implemented the PNN classification on single sub-dimensions. However, the number of combinations of sub-dimensions is large and this overall system computationally to complicated. In this paper, a feature evaluation and selection technique based on an entropy definition is used to measure the contribution of each sub-dimension. The sub-dimension with the lowest contribution to the overall classification is discarded. Experiments on two real world datasets and four microarray datasets show clearly that the achievement of the proposed technique is remarkable better than the normal PNN and as good as the sub-dimension based PNN. However the system complexity is significantly reduced and the classification speed is increased. The feature evaluation and selection are especially effective and convenient when the input features are large and the datasets are noisy. At the rank of the corresponding information gain G, the importance of the sub-dimension decreases while G reduces. Good performance selection occurs particularly at the top of the rank. However, how many sub-dimensions should be considered as important is a critical issue which needs to be investigated further.

References
where and are the probability density functions for categories ( ) The main task of implementing Equation (1) is to estimate the probability density function for each class according to a set of known training patterns. As in papers [1] and [2], it is shown that a particular estimation of a probability density function where m is the total number of training patterns; X Ai is the i-th training pattern form category A θ ; σ is the smoothing parameter.
is the sum of multivariate Gaussian distributions centered at each training sample. It could be any smooth density function, not limited to the Gaussian.

X f A
The PNN network consists of input units, two hidden layers, and output units. Figure 1 shows the PNN structure for a two group classification. The input units in the PNN correspond to input features. The first hidden layer is called pattern units. In each unit, input pattern X is performed a dot product with a weight vector , If both X and are normalized to unit length, Equation (3) becomes: Summation units which are the second hidden layer simply sum the input from the corresponding pattern units according to the training process. The connection between two hidden layers is made in such a way that each pattern unit in the first layer matches only one appropriate node in the second layer.
The output units, or decision units, simply produce a binary output, as indicated in Figure 1. PNN employs the training patterns to estimate the probability distribution of each class during the training routine, and classifies the input according to the weighted average of the closest training examples in the testing process. In this paradigm, learning for small and moderate sized databases is faster since the iteration process is avoided. However, the entire training datasets need to be stored and large networks require large databases. These are the disadvantages of PNN [8].

Sub-dimension:
The sub-dimension method [4] can be implemented by dividing the databases into smaller parts and applying the classification procedure to each part. Let be a matrix with i objects ( Instead of considering all features as evidence for classification, the sub-dimension based PNN algorithm takes the subdimension as the input pattern and implements the PNN classification to each sub-dimension respectively. The observable benefit of this approach is that results of each sub-dimension are hardly affected by features in other subdimensions. In a previous study, we simply employed the majority decision as the class label determination. We concluded that object is closer to than when more than half of the sub-dimensions are closer to than . This can be formulated by: where is the probability that one object falls in class . It can be estimated from: where is the number of objects falling in class , The corresponding information gain of the j th sub-dimension clustering is