An adaptive alpha spending algorithm improves the power of statistical inference in microarray data analysis.

The adaptive alpha-spending algorithm incorporates additional contextual evidence (including correlations among genes) about differential expression to adjust the initial p-values to yield the alpha-spending adjusted p-values. The alpha-spending algorithm is named so because of its similarity with the alpha-spending algorithm in interim analysis of clinical trials in which stage-specific significance levels are assigned to each stage of the clinical trial. We show that the Bonferroni correction applied to the alpha-spending adjusted p-values approximately controls the Family Wise Error Rate under the complete null hypothesis. Using simulations we also show that the use of the alpha spending algorithm yields increased power over the unadjusted p-values while controlling FDR. We found the greater benefits of the alpha spending algorithm with increasing sample sizes and correlation among genes. The use of the alpha spending algorithm will result in microarray experiments that make more efficient use of their data and may help conserve resources.


Background:
Microarray technology has become a widely used and effective research tool in modern molecular biology. It can produce a snapshot of the expression levels of thousands of genes simultaneously at a very low cost per data point. However, researchers are often more interested in how biological pathways respond to experimental condition changes rather than in changes in expression levels of individual genes. The total flux through a pathway can change dramatically through subtle changes in expression levels of genes involved in that pathway.
[1] Thus, the prevalence of microarray technology in the research of complex metabolic disorders makes the problem of identifying genes with subtle differential expression increasingly important. Unfortunately, the identification of genes with subtle differential expression is challenging due to the huge number of genes involved, the noisiness of the data, and the very small sample sizes (often not more than 5 observed expression levels per gene and/or per treatment group).
Most approaches for identifying differentially expressed genes may be of limited power because they neither take into account nor capitalize on dependencies among genes. As an alternative, we propose an adaptive alpha-spending algorithm that takes into account the dependencies of expression levels among genes explicitly by assigning gene-specific significance levels to each gene. The alpha-spending algorithm is named so because of its similarity with alpha-spending algorithms in interim analysis in clinical trials.
[2] Interim analysis is often carried out at multiple times in a clinical trial for reasons such as checking adherence to the protocol, economic and ethical reasons. Because in interim analysis the same null-hypothesis is tested multiple times, not correcting for multiple testing will inflate the type 1 error. Multiplicity is controlled in the alpha-spending algorithm by assigning stage specific significance levels to each stage in the clinical trial such that the sum of stage specific significance levels is equal to the overall significance level, i.e., with k the number of stages, αˆi the stagespecific significance level for the i-th stage and α the global significance level. The stage-specific significance level is given α is a monotonic nondecreasing function with ( ) 0 0 = α and ( ) α α = 1 called the alpha-spending function and ti is the fraction of information accrued in the clinical trial at stage i, a quantity between 0 and 1, which is often defined as a function of accrued and planned sample sizes in the clinical trial.
The key assumption underlying our alpha-spending algorithm is: if the expression levels of two genes are positively/negatively correlated, then one of the two genes is an activator/repressor of the other gene. This assumption is incorporated into the alpha-spending algorithm by computing the gene-specific significance levels in such a way that they are proportional to the linear regression predictor computed from the correlation matrix of the observed differential expression levels and the observed differential expression levels of other genes. For instance, if a particular gene A is highly positively correlated to many up-regulated genes, then this provides additional contextual evidence that gene A is also up-regulated. This additional contextual evidence is fed back into the alpha-spending algorithm by assigning a higher significance level to gene A. Similar to alpha-spending in clinical trials, the gene-specific significance levels α α,1 k K are computed such that they satisfy the condition in order to provide a mechanism for controlling the number of false positives. It can be seen that the alpha-spending algorithm controls the FWER in the weak sense. By this we mean that, under the global null-hypothesis that all genes are non-differentially expressed, the Bonferroni correction applied to the alpha-spending adjusted p-values controls the FWER. This approximate weak control of the FWER follows directly from Bonferroni's inequality where α i is the population quantity from which αi can be regarded as an estimate. The alpha-spending adjusted p-values will be derived in the next section. The alpha-spending adjusted p-values will be derived in the next section.

Methodolgy:
The gene-specific significance levels are based on a prediction equation similar to the linear regression prediction can be interpreted as the predictive information from the observed values of x for y.

Similar
to the predictive information , 1 σ above, we will derive predictive information for the unknown population differential expression level δ i for gene i, from the observed differential    instance, gene i is strongly positively correlated to highly upregulated genes then there is contextual evidence that gene i is also up-regulated and this contextual evidence is incorporated by a relatively large positive value of πˆi . Because the p-value for the differential expression of a gene depends on the absolute value of its differential expression, our gene-specific significance level will be defined as

Discussion:
We have proposed an adaptive alpha-spending algorithm for finding differentially expressed genes in microarray data sets in which observed dependencies among genes are incorporated by assigning gene specific significance levels to each gene. We think this procedure may increase the power in finding differentially expressed genes. The constraint provides a mechanism for controlling the number of false positives. We have shown that the alphaspending algorithm provides approximately weak control of the FWER.
To further investigate power of alpha-spending procedure and its ability to control the number of false positives we have conducted a simulation study with a relatively small number of genes ( 700 = k ) with two treatment groups of equal sample sizes. The alpha-spending algorithm was applied to the equal variances t-test for comparing the two groups using the within group correlation among genes as contextual information. We assessed the Per Comparison Error Rate (PCER) under the complete null, i.e. all genes are nondifferentially expressed, as well as the partial null, some genes are non-differentially expressed but not all. We also evaluated the False Discovery Rate (FDR) defined in [4] under the partial null only and the power improvement in special circumstances. The PCER is the expected number of false positives divided by the number of truly differentially expressed genes. The FDR is defined as All simulated microarray data sets were generated from a multivariate normal distribution.
Our simulation study confirms that the alpha-spending algorithm controls the PCER and FDR in many practical situations. Under the complete null, the PCER was controlled with respect to all genes overall as well as for the group of uncorrelated genes. For the group of correlated genes, the PCER tended to be inflated (Table 1). Under the partial null, the PCER was controlled in all simulation parameter settings and the FDR was controlled in most of the simulation parameter settings (Figure 1). The observed PCER decreases for increasing group-size and correlation, but this relationship was not seen in the observed FDR. On average the alphaspending algorithm improves the power and this power improvement increased for increasing group size or increasing correlation. The power improvement can be up to 47% for 0.7 ρ = and 6 n =   Table 1: Observed PCER for the alpha-spending post-processed p-values estimated for correlated genes, uncorrelated genes, and all genes under the complete null hypothesis that all genes are non-differentially expressed. The number of genes in each simulation was 700 and the nominal alpha levels of 0.01, 0.05, and 0.1 were used for identifying differential genes. In each simulation parameter setting ( ) n , ρ the observed PCER was estimated from 100 simulated data sets  in PCER. This relationship was not found for FDR, which was possibly due to simulation error in estimating FDR as a consequence of the large variation in the number of genes declared differentially expressed across different simulated data sets (see Figure 2). The inflation of the PCER among correlated genes under the global null may be explained by the fact that inflated for the correlated genes and deflated for the uncorrelated genes. This situation highlighted that the alphaspending algorithm may be more likely to detect spurious findings in case of strong correlations among many nondifferentially expressed genes. A topic of future research is to investigate whether this situation can be ameliorated by developing adjustment procedures for the gene-specific significance levels in which lower gene-specific significance levels are assigned to genes with lower observed differentially expression levels. Another topic of future research is the improvement of the power of the alpha-spending algorithm by the application of Empirical Bayes techniques [5] to the estimation of differential expression levels [6], correlations among differential expression levels [7], and the standard error of differential expression levels.
[8] A simulation study reported that the mean squared error of EB estimates of differential expression levels is as low as 0.05 times that of the ordinary least squares estimators.

Conclusion:
We have proposed an adaptive alpha-spending algorithm for finding differentially expressed genes in microarray data sets in which observed dependencies among genes are incorporated by assigning gene specific significance levels to each gene. We have shown that the alpha-spending algorithm approximately controls the FWER under the complete null. In a simulation study we have illustrated that the alpha-spending algorithm controls the PCER and FDR and improves the power when applied to the ordinary t-test under special circumstances within the two group comparisons with equal group sizes. However, there may be situations in which the PCER is inflated as was shown for the correlated genes under the complete null.