Exploratory methods for checking quality of microarray data.

In microarray experiments many undesirable systematic variations are commonly observed. Often investigators analyzing microarray data need to make subjective decisions about the quality of the experiment, by examining its chip image and a simple scatter plot. Thus, a more rigorous but simple method is desirable to determine the quality of microarray data. We propose two exploratory methods to investigate the quality of microarray experiments with replicated chips. The first method is based on correlations among chips and the second on the actual intensity values for each gene. The proposed methods are illustrated using a real microarray data set. The methods provide an initial estimation for determining the quality of microarray experiments.


as of each i
dividual chip, they do not provide a rigorous statistical criterion to detect chips in poor quality.At an earlier stage of analysis, each microarray slide is often examined graphically using the scatter plot between chips to examine large variability (or low reproducibility) and any unusual patterns.However, such examinations are based on subjective human pattern recognition, and chips in poor quality can frequently enter the subsequent analysis, resulting in unreliable inference on the whole microarray study.Therefore, in this study we are concerned about checking the quality of overall microarray experiments and to identify the outlying chips that have much lower reproducibility than other chips.

There have been several approaches for checking reproducibility in microarray experiments.For example, Parmigiani et al., [1] defined integrative correlation between two experiments that are conducted separately to answer the same biological question.This integrative correlation is calculated for each gene and called a gene's reproducibility score.King et a ., [2] used correlations, the rate of two fold changes, and principal component analysis to check the reproducibility of gene expression measurements.Park et al.,[3] proposed a diagnostic plots for identifying outlying slides.In this paper, we propose an exploratory method to check the quality of microarray data using two different approaches.


Methodology:

We first describe the approach based on the correlations between chips and then describe the other approach based on the actual intensity values.


Correlation Based Approach

Le

chips available, there are
n ( i n -1) / 2 correlation coefficients available.The chipwise correlation plots show the distribution of these correlations for each chip.The x−axis represents chips and the y-axis represents the distribution of pairwise within correlation coefficients.If a certain chip has a low reproducibility, then it is expected to show a different pattern of correlation coefficient.


Summary correlation plot

For each chip, there are two summa

correlation coefficients
available: (
w ij r , b ij r ).
We propose a summ x−axis
represents 1− w ij
r and the y-axis represents 1− b ij r .Then, each chip can be represented as a point in this plot.If there is an outlying chip, then its point will be located farther from the origin than other treatments.The closer to the origin, the more reproducible is.


Kolmogorov-Smirnov test

To check whether the experiment is repr

ucible within the same t
eatment, we compare the distributions of correlations,
w i R for i = 1, • • • , I. After z-transformation (z = log ((1+r) / (1−r))
, we apply the one-sided Kolmogorov-Smirnov test (K-S test) for all pairs of (
w i R , w i R ' ) for i ≠ i'.
The alternative hypothesis of this of correlation coefficients derived from the ith treatment is greater than that of the the i'th treatment.The alternative hypothesis is that the distribution function of
w i R is less than the distribution function of w i R ' , that is, the distribution of w i R is

Intensity Based Approach

The correlation based approach checks the treatment-wise and the chip-wise qualities.Therefore, it is not suitable for making any decisions concerning specific gene

ene.

For a specific gene
g, we develop test procedures for checking its reproducibility.If the ith treatment group is highly reproducible, the intensity values from the same gene in this group should be similar.For simplicity, we assume that ijg Y has the mean ijg μ with the common variance 2 ig σ .

To check the quality of ene g, we test whether the mean of intensities within a treatment are the same or not.The hypothesis of interest is as follows:

. 2 lg 0 ... :
g in g i i i H μ μ μ = = =
For testing this hypothesis, the analysis of variance (ANOVA) model is commonly used.[4] In our case, however, there is no repl cate data available to calculate the within sums of squares.Thus, a traditional ANOVA model is not applicable.Instead, we use the Local-Pooled-Error σ .The LPE is based on the idea that genes with similar intens ty values will have similar variabili ies within the same treatment.In each treatment, all genes with similar intensities are pooled together to estimate variances.


Prediction Model

We apply the following two step procedure for each gene.

Step 1: Estimate 2 ig σ , using LPE.

Step 2: Use the following statistic: X is sufficiently large, we can conclude that gene g does not have a hig e chips in which gene g has quite different intensity values in the ith treatment.In that case, gene g is called discordant.Otherwise, gene g is called concordant.
∑ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = j ig g i ijg g i Y Y X 2 . 2 , σ 2 ,g i X looks
Since we test G gene simultaneously, we may need to consider multiple testing issues.In our procedure, we control the false discovery rate (FDR, [6, 7]) using qvalues.With the predetermined cutoff value, we decide whether gene g is concordant or not.

After deciding whether each gene is concordant or discordant, we calculate the gamma value as a summary measure of concordance , ,


Example

In this section, the proposed methods are applied to murine B-cell data.To study gene expression profiles in murine Bcell development, total cellular RNA was extracted from five consecutive B-lymphocyte lineage sub-populations (pre-BI cells, large pre-BII cells, small pre-BII cells, immature B-cells, and mature B-cells), and then, gene expression profiles from the five consecutive stages of mouse B cell development were generated with more than five replicates.[8] Murine B-cell data show lower sensitivity (0.66) and specificity (0.02).For the further exploratory analysis, we apply the proposed methods.In the chip-wise correlation plot ( igure 1), most treatments except small Pre-BII cells (chip 23 -chip 27) show high chip-wise correlations.Chipwise correlations of the small Pre-BII cell treatment have a highly skewed distribution and the third replicate has very small correlations compared to the other chips in the same group.Therefore, we can conclude that thi third replicate is problematic and has to be checked or treated before a further analysis.In the summary correlation plot (Figure 2), Murine B-cell data shows outliers, chip 25.All the chips except chips in Small Pre-BII group are located in the upper triangular and chip 25 is far from the other chips.It supports the result from chip-wise correlati n plot (Figure 1).

In Table 1, the last column of P KS and P W show lower pvalues than the others.Therefore, we can conclude that the distribution of within correlation in Small Pre-BII group is greater than the distribution of the other groups.Also the mean of within correlation in small Pre BII group is less than the mean of the other groups.

Next, we apply the test based on intensities within treatment.We assume the FDR as 5%.Table 2 shows the result of the intensity based tests.Murine B-cell data show quite different patterns.Especially, the gamma of small Pre-BII treatment is lowest among five treatments.Therefore we can conclude that Murine B-cell data set is less reproducible.

We can conclude that murine B-cell data show lower reproducibility, sensitivity and specificity.Therefore, it is not clear whether or not a further statistical test procedure can detect true difference successfully among the five consecutive stages, especially with small pre-BII cells.It is mainly due to one outlying chip (chip 25), as shown in Figure 3. Therefore, the analyst should check the experimental procedure and ti sues used for this chip before a further statistical analysis.


Prediction Model


Discussion:

At the initial stage of the microarray data analysis, the exploratory data analysis (EDA) provides the first contact with data.The techniques of EDA consist of a number of informal steps such as checking the quality of the data, calculating simple summary statistics, and constructing appropriate graphs.

R

represents the withingroup correlations of the jth chip in the ith treatment with the other chips in the treatment i, b ij R and does the between group correlations of the jth chip in the treatment i with the other chips from the different treatments.




all within-group correlations and between group correlations for the ith treatment, respectively.Using these correlation measures, we can check reproducibility.Letw ij r be the average of all components of w ij R and b ij r be the average of all components of b ij R .If the chips are homogeneous ithin the same treatment, then w ij R would be Bioinformation by Biomedical Informatics Publishing Group open access www.bioinformation.netPrediction Model ________________________________________________________________________ ISSN 0973-2063 Bioinformation 1(10): 423-428 (2007) Bio nformation, an open access forum © 2007 Biomedical Informatics Publishing Group 424 close to 1. Thus, the specificity can be defined i