Evaluation of data integration strategies based on kernel method of clinical and microarray data.

The cancer classification problem is one of the most challenging problems in bioinformatics. The data provided by Netherland Cancer Institute consists of 295 breast cancer patient; 101 patients are with distant metastases and 194 patients are without distant metastases. Combination of features sets based on kernel method to classify the patient who are with or without distant metastases will be investigated. The single data set will be compared with three data integration strategies and also weighted data integration strategies based on kernel method. Least Square Support Vector Machine (LS-SVM) is chosen as the classifier because it can handle very high dimensional features, for instance, microarray data. The experiment result shows that the performance of weighted late integration and the using of only microarray data are almost similar. The data integration strategy is not always better than using single data set in this case. The performance of classification absolutely depends on the features that are used to represent the object.


Background:
In bioinformatics data, an object can be represented by several heterogeneous data sets.The combination of heterogeneous data set based on kernel method is considered to produce better classification result [1, 2, and 3].One of the bioinformatics data that contains several heterogeneous data sets is data of breast cancer patients which consists of microarray data and clinical data [4].The objective is to classify the breast cancer patients that are with the distant cancer or without the distant cancer.Simply, distant cancer, it can be called as distant metastasis, indicates that the cancer spread from the primary tumor to distant organs or distant lymph nodes [5].Support Vector Machine (SVM) which is introduced by Vapnik, is a powerful classifier that is often used bioinformatics application [6].The problem in SVM is finding the proper parameters to build the model.It takes time while the data set or the features set is a large size, such as microarray data.Least Square Support Vector Machine (LS-SVM) is claimed as a modified SVM that can faster in training phase [7].The main idea of SVM is finding the best hyper plane that can be separated the data into two classes; positive and negative.Using kernel method, mixed data can be transformed into higher dimension, which it can be separated linearly, implicitly using kernel function [6,8].
The means which are to combine several heterogeneous data sets can be called as data integration strategies.The various features sets can be integrated in simple manner based on kernel method  1 and 2); and also feature dimension of clinical data from 13 to 2 and 3 as showed in (Figure 3 and 4).The blue circle and the red circle represent the two classes.Based on the visualizations, the data are mixed.

Methodology:
The kernel based data integration strategies are classified in three classes; early integration, intermediate integration and late integration [2].The early integration is a concatenation of features sets, the intermediate integration is a combination of kernel matrices and the late integration is a combination of the models.Three common kernel functions will be used in this experiment; linear kernel, polynomial kernel and Radial Basis Function (RBF) kernel.The equations to combine the kernel matrices in the intermediate integration are formulated according to Equation (1) for linear kernel, Equation (2) for polynomial kernel and Equation (3) for RBF kernel in the Supplementary material.These kernel functions are applied to LS-SVM.The differences of LS-SVM and SVM is that LS-SVM uses square loss function instead of hinge loss function so that The issue of different signification level of each feature sets leads to give the weight to each features sets.In this paper, it is called as weighted data integration.It means that each features sets will be weighted with a particular real number so that it controls the contribution of the features set to get the final conclusion.The implementation of weighted data integration strategy is straight forward by multiply kernel matrix with a real value.In this case of using two feature sets (i.e.micro-array data set and clinical data set), the total weighted of these two data set is 1.0.The equations of the weighted intermediate integration are formulated according to Equation (4) for linear kernel, Equation (5) for polynomial kernel and Equation ( 6) for RBF kernel in then the Supplementary material.

Discussion:
The result of the experiment scenario can be showed in

Conclusion:
The experiment scenario is designed to evaluate the performance of data integration strategy and the using of single data set in the case of breast cancer classification.The issue of significant level of features set is also included as weighted data integration strategy.The experiment scenario contains 33 models that are evaluated.The complete result can be shown in Table 1 (see supplementary material) The data is taken from public domain from Nederland Cancer Institute that contains microarray data set and clinical data set as supplementary data.
Based on the experiments result, the microarray his little bit higher value of kappa statistic than the weighted late but the value of AUC of the weighted late has little bit higher than the AUC of the microarray.It shows that the performance of data integration strategy is almost similar with only using single data set.For further analysis it can be inferred that the clinical data set shows relatively bad performance.The experiment using microarray produce similar performance with the weighted late because the clinical data set in the weighted late cannot contribute much to the classification result.Generally, the classification performance is controlled by the features sets that are used.

[ 1 ]
. This research is closely related with a research by Daemen A et al., "a kernel-based integration of genome-wide data for clinical decision support."Daemen A et al. concluded that the accuracy of cancer prediction was increase if the multiple data sets were integrated [3].The main contribution of this paper is to evaluate the performance of classification result in the term of the distant cancer classification which implements the data integration strategies and not.The data sets refer to 295 breast cancer patients which are public domain provided by Netherland Cancer Institute.This data consist of 101 patients are with and 194 patients are without distant metastases.The data set is spitted into training set and validation set.The training set contains 148 data which consist of 47 with and 101 without distant metastases and the validation set contains 147 which consist of 54 with and 93 without distant metastases.The characteristic of the data are woman who were younger than 53-years old and the tumor which are smaller than 5cm [4].Microarray technology is very important tool to monitor genome wide expression level of genes in an organism.The microarray data contains of 24.496 spots or features which can be selected into 70 features that are good-prognosis signature.The clinical data contains only 13 variables; Diameter of tumor (Numeric), T1_T2 (Binary; ≤2cm or >2cm), pN (3 classes; pN0, 1-3, or ≥4), Number of positive Lymph nodes (Numeric), Mastectomy (Binary; yes or no), Estrogen Receptor (Binary; positive or negative), Tumor grade (3 classes; poorly, intermediate or well differentiated), Age (Numeric), Chemotheraphy (Binary; yes or no), Hormonal therapy (Binary; yes or no), St. Gallen criteria (Binary; Chemotherapy or no chemotherapy), National Institutes of Health (NIH) consensus criteria (Binary; Chemotherapy or no chemotherapy), NIH risk (3 classes; low, intermediate or high) [4].In order to know visually about the distribution data sets, the Kernel Dimensionality Reduction (KDR) [9] reduces the features dimension of microarray data from 70 to 2 and 3 as showed in (Figure

Figure 1 :
Figure 1: Visualization of reduced microarray data features into 2 dimensions.

Figure 2 :
Figure 2: Visualization of reduced microarray data features into 3 dimensions.

Figure 3 :
Figure 3: Visualization of reduced clinical data features into 2 dimensions.

Figure 4 :
Figure 4: Visualization of reduced clinical data features into 3 dimensions.

Table 1 (see supplementary material).
The best performance in AUC, 0.7493, is weighted late integration using two RBF kernel functions which the configuration are weight of 0.74 for microarray data set, weight of 0.26 for clinical data set, gamma parameter of 3.317 for microarray data set, gamma parameter of 1.685 for clinical data set, σ 2 of 39.686 for microarray data set and σ 2 of 13.276 for clinical data set.The best kappa statistic, 0.4725, is model of only microarray data set using RBF kernel function which the configurations are 3.380 for gamma parameter and 46.918 for σ 2 .