Genes Selection Comparative Study in Microarray Data Analysis

In response to the rapid development of DNA Microarray Technologies, many differentially expressed genes selection algorithms have been developed, and different comparison studies of these algorithms have been done. However, it is not clear how these methods compare with each other, especially when we used different developments tools. Here, we considered three commonly used differentially expressed genes selection approaches, namely: Fold Change, T-test and SAM, using Bioinformatics Matlab Toolbox and R/BioConductor. We used two datasets, issued from the affymetrix technology, to present results of used methods and software's in gene selection process. The results, in terms of sensitivity and specificity, indicate that the behavior of SAM is better compared to Fold Change and T-test using R/BioConductor. While, no practical differences were observed between the three gene selection methods when using Bioinformatics Matlab Toolbox. In face of our result, the ROC curve shows that: on the one hand R/BioConductor using SAM is favored for microarray selection compared to the other methods. And, on the other hand, results of the three studied gene selection methods using Bioinformatics Matlab Toolbox are still comparable for the two datasets used.


Background:
In many cases, the purpose of the microarray experiments is to compare the gene expression levels between two different conditions. Mostly, one sample is considered as reference or control and the other is considered as experiment. In all such comparative studies, the main goal is to determine the genes that are differentially expressed in the two samples being compared. In the early days the simple method of Fold Change (FC) was used, this last involves selecting genes for which the ratio of the experiment and control values is a certain distance from the experiment/control ratio [1]. Another possible approach to gene selection is to use invariance statistical tests, like T-test. This approach essentially used the classical hypothesis testing methodology [2]. The analysis of variance (ANOVA) is a particular interesting approach to microarray data and differentially expressed genes (DEGs) selection, the idea behind ANOVA is to build an explicit model of all sources of variance that affect the measurements and the use of the data to estimate the variance [3]. The following approach is the model based maximum likelihood estimation methods [4]. This approach is very general and very powerful. Significant Analysis of Microarray (SAM) and moderated t-statistic using empirical Bayes are two tests used to address the problem of the simple T-test when the number of samples is small [5,6]. More recently, Storey et al. [7] developed a new approach based on the Optimal Discovery Procedure (ODP), which aims to maximize the expected number of true positive genes for each fixed level of expected false positives. Several others methods have been used for the DEGs selection and the literature is abundant [8,9,10]. Despite the wealth of a literature concerning DEGs selection methods, there is no clear-cut guideline regarding the choice of methods and softwares. In this context, a comparative study could guide the practitioners to a suitable tool. Here, we comparatively review, three DEG selection methods: Fold change, T-test and SAM using two software's tools R/Bioconductor and Bioinformatics Matlab Toolbox. To conduct our study we used two datasets issued from the Affymetrix technology [11] since we showed in our last study that there is no significant difference, in term of differential analysis, between Affymetrix and other technology like Agilent [12]. The aim is to evaluate by means of sensitivity and specificity the impact of statistical tools on the DEGs selected. In the next section, we will justify the choice of statistical tools and we will give a concise description of the DEGs selection workflow followed for all proposed algorithms and tools. In the end, we discuss the results of used algorithms and tools on well known datasets.

Methodology:
In order to assess the performance of a gene selection method we need a set of criteria able to qualify the outcome of the selection process. In our case the performance of a gene selection method has been calculated in terms of specificity and sensitivity presented in Receiver Operating Characteristics curves (ROC). In a binary decision situation, such as changed or unchanged, the results can always be divided into four categories: truly changed that are reported as changed (True Positives: TP), unchanged that are reported as changed (False Positive: FP), truly changed that are reported as changed (False Positive: FP) truly changed that are reported as such (True Negatives: TN). Based on these parameters, one can define the sensitivity and specificity that qualify the productivity of DEGs methods and tools used.

Selected algorithm for comparison
Several researches are done on the comparison and the choice of gene selection methods. The last study in this context aims to compare the gene selection methods and classify them according to some parameters [13]. This study shows that all DEGs selection methods may be classified in three categories: Class of methods based on deterministic FC, class of methods based on T-test, multiple T-test, P_values, and the class of methods using random permutations. Based on this classification, we selected three selection methods, each one provided from one class, and we will compare selected genes from each method under R/Bioconductor and Bioinformatics Matlab Toolbox. The first approach used is the FC, selected from the first class. This method is the most initiative approach to finding genes that differentially regulated [1]. Typically, an arbitrary threshold is chosen and the difference is considered as significant if it is larger than the threshold. The FC method is often used because it is simple and intuitive. T-test, which is chosen from the second class, is extensively used in gene expression analysis. It demonstrates whether the difference between two groups or samples is significant [2]. In our study P_values calculated by T-test have been used without any multiple testing corrections. The DEGs were obtained by setting two criteria of P_values <0.05 and FC >=2. The last approach provided from the third class is the SAM method [5]. SAM uses a statistic called "relative difference" for gene. This is very similar to a t-statistic with equal variance, except that the gene "gene-specific scatter" in the denominator is offset by a fudge factor. This latter is an exchangeability factor chosen to minimize the coefficient of variation and is computed with a sliding window approach across the data.

Results & Discussions:
The aim of this paper is to understand how gene selection algorithms differ from one another and how it is influenced by using different software's. In this study, we compare results for three gene selection methods, SAM, FC, and T-test in both R/BioConductor and Bioinformatics Matlab Toolbox using two datasets. We evaluate first the influence of statistical tools on the size of selected genes using the Venn diagram then we evaluate the sensitivity and the specificity of each methods and tools using the ROC curves. The first result is represented in the Figure 1. A Venn diagram shows the overlap between the three gene selection methods and the genes identified as differentially expressed in each dataset (denoted by "identified genes" or "id" in Figure 1). The lists of genes generated as differentially expressed differ between methods, software's and datasets used. Viewpoint methods, SAM identified 80% to 89% of all genes in each dataset under R/Bioconductor relative to t-test and FC methods. While, meeting lists of each gene selection method in Bioinformatics Matlab Toolbox show a close between the three gene selection methods used with a reduced number of genes. In term of software's; the number of DEGs selected was still large in R/BioConductor , but was considered reduced in Bioinformatics Matlab toolbox (for example in spike-in data set ,1224 gene was selected by R/BioConductor face to 101 genes selected by Matlab toolbox for the same data). After this first study we can conclude that Bioinformatics Matlab Toolbox allows a selection of a reduced number of genes compared to R/BioConductor. But are remains unclear whether the selected genes are relevant. For this reason we used the ROC curves that compare the performance of the three methods and the software's in each dataset used in terms of sensitivity in specificity (Figure 2). Figure 2A represents the ROC curve using R/BioConductor for the breast cancer dataset. This figure shows that SAM and T-test report differentially expressed genes in a stricter manner than FC. This is not surprising given that the former two methods take into account the variance of an observation between different conditions, which is not the case of FC. As the figure shows, the SAM is more stringent than the T-test, the explanation coming from the "fudge factor" that reduces the bias of genes with large variances without necessarily biological significance. FC has a significant false positive rate, almost half of its results. Figure 2B represents the ROC curve using R/BioConductor for the platinum spike-in dataset, this figure shows that SAM is preferential than t-test and FC. Figure 2C represents the ROC curve using Bioinformatics Matlab Toolbox for the breast cancer dataset, this figure shows close results between the three gene selection methods used, especially between SAM and T-test. Figure 2D represents the ROC curve using Bioinformatics Matlab Toolbox for the Platinum spike-in dataset and shows also that the false positive rate is still comparable for the three methods.

Conclusion:
A fundamental step of microarray studies is the identification of a small subset of DEGs from among tens of thousands of genes probed on the microarray. DEGs lists must be concordant to satisfy the scientific requirement of reproducibility, and must also be specific and sensitive for scientific relevance. A Bioinformatics Matlab Toolbox allows a selection of a reduce list of DEGs used different methods proposed relative to the R/ BioConductor. Concerning the gene selection methods, the R/BioConductor using SAM is favored for microarray selection compared to the other methods notably the FC which is uncertain. While, the ROC curve when using Bioinformatics Matlab Toolbox shows that there is a not significant difference between the three genes selection methods with reproducible results. We conclude that the SAM approach is over conservative in R/ BioConductor but the FC has an extremely low power in the two datasets used. On the other hand, the three genes selection methods display a high power in Bioinformatics Matlab toolbox. Therefore, this latter would be selected if we prefer to select a small list of DEGs. In conclusion, a DEGs list should be chosen in a manner that concurrently satisfies scientific objectives in terms of inherent tradeoffs between reproducibility, specificity, and sensitivity.