Differences in protein-protein association networks for lung adenocarcinoma: A retrospective study

Various methods to determine the connectivity scores between groups of proteins associated with lung adenocarcinoma are examined. Proteins act together to perform a wide range of functions within biological processes. Hence, identification of key proteins and their interactions within protein networks can provide invaluable information on disease mechanisms. Differential network analysis provides a means of identifying differences in the interactions among proteins between two networks. We use connectivity scores based on the method of partial least squares to quantify the strength of the interactions between each pair of proteins. These scores are then used to perform permutation-based statistical tests. This examines if there are significant differences between the network connectivity scores for individual proteins or classes of proteins. The expression data from a study on lung adenocarcinoma is used in this study. Connectivity scores are computed for a group of 109 subjects who were in the complete remission and as well as for a group of 51 subjects whose cancer had progressed. The distributions of the connectivity scores are similar for the two networks yet subtle but statistically significant differences have been identified and their impact discussed.


Background:
For some prevalent types of cancer such as lung adenocarcinoma where the effectiveness of standard chemotherapy is limited, alternative treatments based on targeting critical genes are desired [1]. Consequently, it is important to examine the differences between the interactions of the proteins encoded by genes between patients whose cancer progress differently. Identifying such proteins or groups of proteins helps to identify potential targets for new treatments. In order to identify a protein or a class of proteins which is differentially expressed, we need a formal statistical framework. Previously, a framework for differential network analysis was developed [2] and applied to microarray data from a pair of networks. In this paper, methods from the above mentioned were adapted to analyze protein expression data and include tests for differential connectivity for individual proteins relative to all other proteins as well as tests for differential connectivity within a class of proteins.

Methodology: Dataset used
Herein, the methods are applied to data from a study on lung adenocarcinoma. The data set is freely available in the International Cancer Genome Consortium (ICGC) data repository [3]. It was also featured as one of the challenge data sets at the recent Critical Assessment of Massive Data Analysis (CAMDA) conference [4]. We used version 14 of the data, which includes expression values of 174 protein antibodies. Some antibody IDs correspond to the same gene so the data set only includes protein expression values for 139 genes. There are protein expression values for each of the 160 subjects, 109 of whom are in the complete remission group and 51 of whom are in the progression group.

Model
To quantify the strength of pairwise interactions between protein expression values within a group, we use connectivity scores based on the method of partial least squares. Specifically, the scores for each protein are computed by fitting a regression model with all of the other proteins as covariates using estimates obtained from partial least squares. For each network, this creates a × square matrix of coefficients for each pair of proteins where is the number of proteins common to both networks. Finally, the matrix for the th network is symmetrized to obtain the connectivity scores , ; where and refer to the row and column numbers of the matrix of scores. See [2] and [5] for a complete description of the algorithm for obtaining association/interaction scores based on partial least squares and [6] for discussion of a freely available R package dna which provides a flexible implementation of the methods.

Figure 1:
Flow diagram illustrating the test for differential connectivity for an individual protein, starting with the expression values from both groups. The values from both groups are pooled. Then the labels are randomly permuted 1000 times to form new pairs of groups for each data set. The connectivity scores are computed for each actual and permuted group. These connectivity scores are then used to calculate the test statistic for both the observed and permuted data sets. Finally, a p-value is determined by comparing the observed test statistic with the values of the test statistic based on the permuted data sets and is used to make a decision on whether there is a significant difference between the scores for the two proteins between the two groups.

Statistical tests
Next, formal statistical tests can be formulated based on these connectivity scores, similar to the framework proposed in [2]. To test the differential connectivity of the scores corresponding to protein , compared with all other proteins, we use the mean absolute difference statistic  Alternately, to test the differential connectivity of the scores within a particular subset of proteins, we use a similar statistic. Without loss of generality, suppose that is the first proteins. Then, to test the differential connectivity of the proteins within the subset of proteins in , we use the test statistic The procedure for obtaining the p-value and performing a permutation test based on this statistic is analogous to the procedure described in the previous paragraph for testing the differential connectivity of an individual protein.

Results & Discussion:
The connectivity scores were computed for each group of proteins. A histogram illustrating the distributions of the connectivity scores for the proteins in each network is shown In Figure 2. Overall, the distributions for the two networks appear to be very similar visually, which is to be expected since all of the patients were lung adenocarcinoma patients.
However, differences were identified when the networks were tested formally with the statistical framework, which shows how essential these formal tests are when trying to detect important differences in association. More notably, the pairs of proteins corresponding to the largest connectivity scores are very similar for both networks, as shown in Also, the proteins HER2_pY1248 are connected in both networks, but HER2_pY1248 is also connected to Src_pY416 in the complete remission network.
Although the overall distributions of connectivity scores are similar and many of the pairwise interactions between proteins with the top connectivity scores are similar, there are some differences between the networks which are statistically significant. The individual proteins with the smallest p-values based on the test for differential connectivity are listed in It should be noted that different proteins are differentially expressed when the groups are analyzed marginally by traditional statistical tests. For instance, the two sample t-test identifies ER-alpha (p-value = .02), Cyclin_B1 (p-value = .02), YB-1 (p-value = .02), GATA3 (p-value = .04), and XBP1 (p-value=.04) as significantly different at level .05. Thus, differential network analysis identifies important differences in protein expression values not found by analyzing each protein alone. Also, there are 7 genes (EIF4EBP1, EGFR, SRC, PRKCA, GSK3A|GSK3B, CDKN1B, and AKT1|AKT2|AKT3) in the data set which encode 3 or more proteins. The test for differential connectivity within each of these classes of proteins is performed for each gene, but only one gene exhibited significant differences between connectivity scores; the test statistic for the group of protein antibodies Akt_pT308, Akt_pS473, and Akt encoded by gene AKT1|AKT2|AKT3 is ∆= 0.0298 with corresponding p-value 0.022. The role that AKT plays in lung adenocarcinoma is discussed in [14]. On the other hand, for example, the test was not rejected (p-value = 0.114) for the protein antibodies encoded by the three protein antibodies (p27, p27_pT157, p27_pT198) encoded by CDKN1B.

Conclusions:
A method has been presented for testing whether proteins and groups of proteins interact differently with other proteins in two groups. The method was applied to protein expression data on lung adenocarcinoma. Analysis shows that the two networks are similar in appearance. However, the method was successful at identifying proteins with statistically significant differences in the connectivity scores for the complete remission and progression groups. A group of proteins encoded by the AKT1|AKT2|AKT3 gene was found to be significantly different between the two classes. These were known to be associated with cancer. Thus, we describe a method for analyzing expression levels to help identify processes, which differ between cancer patients with different progressions. This type of analysis finds application in gene discovery for cancer treatment.