Linking co-expression modules with phenotypes

The method for quantifying the association between co-expression module and clinical trait of interest requires application of dimensionality reduction to summaries modules as one dimensional (1D) vector. However, these methods are often linked with information loss. The amount of information lost depends upon the percentage of variance captured by the reduced 1D vector. Therefore, it is of interest to describe a method using analysis of rank (AOR) to assess the association between module and clinical trait of interest. This method works with clinical traits represented as binary class labels and can be adopted for clinical traits measured in continuous scale by dividing samples in two groups around median value. Application of the AOR method on test data for muscle gene expression profiles identifies modules significantly associated with diabetes status.


Background:
In recent years the transcriptomic data analysis has witness a shift from gene level analysis to gene modules level analysis [1,2]. Module level analysis aims to identify set of co-expressed/coregulated genes from transcriptomic data [3,4]. Further downstream analysis includes identification of intra-modular hubs, investigation of relationship between co-expression modules and the comparison of network topology of different networks [5,6]. Generally module as whole is summarized with either meta-gene representing average expression of genes for each sample or with module eigen genes which can be considered as the best summary of standardized module expression data [7]. The module eigen gene of a given module is defined as the first principal component of the standardized expression profiles [8]. To find modules that relate to a clinical trait of interest, the module eigen genes are correlated with the clinical trait of interest [9, 10]. Therefore, it is of interest to describe a method using analysis of rank (AOR) to assess the association between module and clinical trait of interest.

Materials and methods:
The AOR algorithm to assess module-trait association: The method works such that where clinical traits are distinct binary class (positive or negative). Given the total number of samples is and out of which belongs to negative class, the method examines the expression pattern of modules across the samples and identify modules which tend to have significant association with the clinical trait of interest. The method is based on the observation that the samples of the two class will show clear distinction in distribution of ranks if they are arranged according to expression values of a gene having true differential expression. However, in module level analysis instead of a single gene we deal with set of genes assigned to the module. Therefore, in order to find sample ranking based on expression of modules as whole, a matrix with rows, equal to the number of genes in the module and columns, equal to the number of total samples was created. Each row of the matrix contains samples arranged in ascending order as per the expression values of genes. A position vector was created by calculating the negative class sample frequency for each column of the matrix. For a module which is not related to clinical trait, the first largest frequencies will be uniformly distributed across the position vector. However, a module having significant relation with clinical trait will cause larger frequencies to concentrate towards one end of the position vector. The index of first largest frequencies were summed to get a score . A significantly lower score represents lower expression of module in negative class sample and a significantly higher score represents higher expression of module in negative class sample.

Calculation of score
A module was defined as collection of genes ( =1. . . ) having similar expression pattern across the samples ( =1. . . ) . The expression information of ∈ across samples can be stored in a matrix where value represents the expression information of ℎ gene in ℎ sample. A sample was assigned to set 0 if it belongs to negative class and to set 1 if belongs to positive class.
The expression matrix was converted to index matrix where value was set as per Eq. (1).
Where contains samples arranged in ascending order according to expression values of . Each columns of index matrix was summed to create a position vector The score was calculated by adding indexes of the first largest values in the position vector .
Assessing the significance of score A null distribution of score was calculated by assuming that the module is not associated with the clinical trait. In case of no association, the position vector will have uniform distribution of first largest values. Therefore, in order to assess the significance of score of a module having number of genes following steps were performed: [1] False modules were created by randomly selecting genes from the total number of genes for which expression data is available.

[2]
The score for False module was calculated.

[3]
Step 1 and 2 were repeated 1000 time to generate a null distribution of score .
A module with score greater than | ∓ (1.96x )| was considered significantly associated with negative class samples. Where and are mean and standard deviation of the null distribution respectively. The score was further converted to scaled score ( ) and he standardised score ( ). The score was scaled to have value between -1 and +1 using Eq. (2).
Where and are the minimum and maximum permissible value for score Gs. The was calculated by subtracting the oserved score from the mean of the null distribution and dividing by the standard deviation of the null distribution.

Results:
The R-package WGCNA [11] was used to identify set of coexpressed genes from muscle transcriptome of healthy (NGT) and diabetes (T2D) subjects [12]. In total WGCNA has identified 30 modules having tightly co-expressed genes grouped together. In order to assess the association between modules and subject diabetes status we applied the AOR and module Eigen gene (as implemented in WGCNA package) method to expression data of each of the module. Using the cutoff of p value < 0.001 thirteen modules were found significantly associated with subject diabetes status using AOR method whereas module Eigen gene method was not able to produce statistical significant result for any of the module (Figure 1). This showed that the subject diabetes status have significant affect on muscle gene expression.

Discussion:
In this study, analysis of rank (AOR) method has been used to assess the association between clinical traits and co-expression modules. Application of the method on muscle gene expression profile identified significant association between module and subject diabetes status highlighting importance of AOR method in identifying hidden patterns across gene expression profiles. While interpreting the results of the AOR method both type of score and should be taken into consideration along with the obtained p-value. The significant p-value obtained using AOR method for module-trait association just indicates that, more number of negative class samples, than expected by chance, are concentrated at the beginning/end of the position vector. The similar sign for and score for a module supports deregulation of genes belonging to concerned module between negative and positive class samples. However, opposite signs for and suggest the existence of expression heterogeneity within negative class samples with respect to expression of genes belonging to concern module. Figure 1: Heatmap representation of module-trait correlation matrix. The modules were identified from muscle expression data of normal and diabetic subjects using WGCNA R-package. The colour represents the extent of correlation between module and subject diabetes status (clinical traits). a) The module-trait association as quantified using eigengene method as available in WGCNA package. b) The module-trait association as quantified using AOR method. Cells were coloured as per standardised Gs score and c) The module-trait association as quantified using AOR method. Cells were coloured as per scaled Gs score.

Conclusion:
The AOR method helps in identifying hidden patterns from gene expression data and can provide deeper insights into disease biology by discovering co-expression modules linked to clinicaltraits.

Conflict of interest:
The author(s) declared no conflicts of interest with respect to the research, authorship, and publication.