Meta analysis of Chronic Fatigue Syndrome through integration of clinical, gene expression, SNP and proteomic data.

We start by constructing gene-gene association networks based on about 300 genes whose expression values vary between the groups of CFS patients (plus control). Connected components (modules) from these networks are further inspected for their predictive ability for symptom severity, genotypes of two single nucleotide polymorphisms (SNP) known to be associated with symptom severity, and intensity of the ten most discriminative protein features. We use two different network construction methods and choose the common genes identified in both for added validation. Our analysis identified eleven genes which may play important roles in certain aspects of CFS or related symptoms. In particular, the gene WASF3 (aka WAVE3) possibly regulates brain cytokines involved in the mechanism of fatigue through the p38 MAPK regulatory pathway.


SNP data:
Forty two Single nucleotide polymorphisms (SNP's) for 10 different genes were genotyped.For the purposes of this analysis, we selected two SNP's, hCV245410 (on gene TPH2) and hCV7911132 (on gene SLC6A4), which were previously identified [2] to be associated with CFS severity.

Proteomic data:
Protein spectra are available for 63 subjects in the study.Serum was originally separated into 6 fractions of which we use the last four and then applied to three different SELDI surfaces, giving us a total combination of 12 different settings.Experiments were repeated twice and we averaged the two spectra for each subject.We removed the first 4000 m/z values from our analysis which roughly corresponds to m/z values smaller than 1700 Da.After that we divided the spectrum into the bins of size 10 and took the maximum intensity value in each bin.The data was reduced by a factor of 10, leaving 2650 m/z values in the data for further analysis.To de-noised data, we estimated the standard deviation for each m/z bin and took the median of these as a measure of noise' standard deviation σ.Intensity values smaller than 3σ were considered to be pure noise.If this happened in all samples, the m/z value was removed from the analysis.Then the data was then log transformed.

Statistical analysis:
The first step of the statistical analysis we performed was to identify a set of differentially expressed genes between different groups of subjects.Disease status of subjects came from the clinical portion of the CFS data (Intake Classific variable).All subjects included in the microarray study were classified into 5 different groups: Ever CFS -45 subjects ever experiencing CFS, Non-fatigues -34 controls who never experienced CFS, Ever ISF -45 subjects who are fatigued but cannot be classified as CFS because of insufficient symptoms, Ever ISF-MDDm -20 subjects experiencing ISF with melancholic depression, Ever CFS-MDDm -19 subjects experiencing CFS along with melancholic depression.ANOVA F-test for each probe was carried out to determine differentially expressed genes across the five groups.286 probes were identified as differentially expressed (p-values < 0.01).Since we are not interested in determining the differentially expressed genes per se, multiplicity correction was not used.The reduced microarray data consisting of 286 probes and 163 samples (subjects) was used later for further statistical analysis as discussed below.

Network construction and identification of associated gene sets:
To better understand the relationships between the selected 286 probes in terms of interactions/ associations, we employ two computational network inference techniques.The first method is based on the Partial Least Squares regression (PLS) [4], while the second method is based on the Partial Correlations (PC) [5].A number of similar characteristics are shared by the two approaches, such as computing association scores whose magnitude reflects the strength of the interaction between genes and local false discovery rate (local fdr) Empirical Bayes procedure for multiplicity adjustment in testing multiple hypotheses.The results from applying the PLS and PC network reconstruction techniques to the reduced microarray data are summarized in the first three columns of Tables 1 (for PLS) and 2 (for PC).The actual visual representation of the networks themselves can be found from Figures 1 & 2, respectively.Both Tables 1 and  2 have the same structure.The first column shows the number of genes in distinct gene association modules (connected components) within each network.Gene association modules were defined to be clusters of 4 or more connected genes such that genes in two distinct components are not connected by an edge.Thus, it differs from the definition used in [2].The tables are sorted by the second column which displays the percentages of each module's average association score when compared to the module with the largest average association score (the first module in each table).The exact definition of association scores are dependent on the method used.As for example, for the PC method, the association score of an edge is the partial correlation between the connected gene pair.Finally, in the third column we list all the genes belonging to each individual module.Genes shown in red are the genes that appear in both tables.  1 and 2 (see Supplementary material).Small p-values indicate that gene association modules are effective in predicting the symptom severity categories.

SNP association:
Carrying out a similar analysis as in the previous section, we study how effectively each gene cluster (module) can predict the genotypes of the two SNP's, hCV245410 and hCV7911132, which have been identified by [2] to be associated with symptom severity.Again, we fit multiple log-linear models and compute the p-values for the likelihood ratio tests.The p-values for both SNP's are shown in columns 5 and 6.

Integration of proteomic data:
We have run a number of well regarded classifiers (Random Forest, LDA, and others) based on the class information with the hope of identifying the features possessing the greatest classification ability; however this approach was abandoned since none of the classifiers produced desirable classification error rates when cross validation was used.An alternative analysis consisted of performing a t-test for each m/z value to compare case and control samples which identified the discriminating features by the magnitude of the p-values.Then we fitted regression models to predict the intensity values of the ten most discriminating features from the collection of expressions of the genes in the two modules (from PLS and PC, respectively) identified by our analysis of ISSN 0973-2063 (online)

Conclusion:
It is possible and perhaps desirable to integrate information from various experimental platforms in order to understand complex disorders.The findings in this study are based on data mining approaches using clinical, gene expression, SNP and proteomic data.The predictive models obtained here may explain certain aspects of CFS and may pave the way for further experimental validation.

Figure 1 :
Figure 1: Gene-Gene Association Network constructed using the PLS method

Figure 2 :
Figure 2: Gene-Gene Association Network constructed using the PC method Prediction of symptom severity: After identifying clusters of associated/interacting genes, we investigate the ability of each module to predict the CFS severity level.For that purpose, we fit a log-linear model for each gene module to regress the clinical variable Cluster on the set of expression profiles of genes included in the module.The overall predictive ability of the CFS severity by a given module can be judged on the basis of the likelihood ratio test which compares the full model (all genes in a module included as covariates in the model) and the null model which includes no covariates.The p-values obtained from the tests are shown in the fourth column of Tables1 and 2(see Supplementary material).Small p-values indicate that gene association modules are effective in predicting the symptom severity categories.

Table 3 (see Supplementary material) Discussion:
The genes have a good predictive ability as can be seen from Two gene association modules (indicated by asterisks) are of interest based on their predictive ability of symptom severity, at least, one of the SNP genotypes and intensity of identified proteomic features.The first cluster comes from the PLS reconstructed network and the other one from the PC reconstructed network.

Table 1 :
Gene association modules discovered by the PLS based network inference method.For each such module (in rows) we are listing the number of genes, relative association strength, gene names, and p-values for the three log-linear models as discussed in the text.Genes in red have also been included into modules by the PC method as well.

Table 2 :
Gene association modules discovered by the PC based network inference method

Table 3 :
P-values for regression models (regressing the PLS and the PC modules marked with an asterisk in Tables2 and 3on the intensity of the top-10 most discriminative features in the protein spectra corresponding to the plate IMAC30, fraction 4, and High laser for the PLS module and H50, fraction 6, and Low laser for the PC module.

Table 4 :
Common genes from the two PLS and PC clusters identified as predictive of disease severity status and SNP hCV245410 genotype.GO annotations and pathways were available from existing literature.