Towards the early detection of ductal carcinoma (a common type of breast cancer) using biomarkers linked to the PPAR(γ) signaling pathway

Breast cancer is a leading cause of morbidity and mortality among women comprising about 12% females worldwide. The underlying alteration in the gene expression, molecular mechanism and metabolic pathways responsible for incidence and progression of breast tumorigenesis are yet not completely understood. In the present study, potential biomarker genes involved in the early progression for early diagnosis of breast cancer has been detailed. Regulation and Gene profiling of Ductal Carcinoma In-situ (DCIS), Invasive Ductal Carcinoma (IDC) and healthy samples have been analyzed to follow their expression pattern employing normalization, statistical calculation, DEGs annotation and Protein-Protein Interaction (PPI) network. We have performed a comparative study on differentially expressed genes among Healthy vs DCIS, Healthy vsIDC and DCIS vs IDC. We found MCM102 and SLC12A8as consistently over-expressed and LEP, SORBS1, SFRP1, PLIN1, FABP4, RBP4, CD300LG, ID4, CRYAB, ECRG4, G0S2, FMO2, ADAMTS5, CAV1, CAV2, ABCA8, MAMDC2, IGFBP6, CLDN11, TGFBR3as under-expressed genes in all the 3 conditions categorized for pre-invasive and invasive ductal breast carcinoma. These genes were further studied for the active pathways where PPAR(γ) signaling pathway was found to be significantly involved. The gene expression profile database can be a potential tool in the early diagnosis of breast cancer.


Background:
Breast cancer is one of the second leading cause of mortality in females at the global level [1]. Around 2 million incidences of breast cancer had been reported in the year 2018 [2]. Earlier studies have reported that the frequency of registered cases has increased alarmingly in the rural areas as compared to incidences in the urban area [3][4][5]. The estimated incidence rate of various types of cancer by the year 2030 has been predicted to be 1.7 million with the expectation of 17 million deaths per year [6,7]. Not all cancers are considered to be fatal; rather it is treatable if diagnosed at an early stage. Breast cancer has been considered as a complex cancer type due to its heterogeneity [8], having a wide range of risk factors involved, including a high-fat diet, alcohol intake, obesity, genetic risk and family history, etc. [9]. 800 ©Biomedical Informatics (2019) Breast cancer can be broadly classified into two categories based on the location of tumor origin in the breast, specifically ductal and lobular carcinomas [10]. A ductal tumor develops in ducts, which contributes to approximately 80% of reported breast cancer cases. The second most commonly found category is a lobular tumor, which develops inside the breast lobules and found in 10-15% of diagnosed tumors [11]. Among the two major categories of breast tumorigenesis, ductal tumor has been more widely studied and provides deeper insight into molecular mechanisms and genetic basis of the disease. Ductal Carcinoma In-Situ (DCIS) is preinvasive type breast cancer with an alarming increase in frequency from 1-5% to 10-15% in recent years [12]. . Microarray analysis helps in genetic alteration that might be responsible for cancerous alterations by calculating genes expression values for the respective disease under study. The study schema is given in Figure 1.

Methodology: Affymetrix microarray data:
The breast cancer-specific microarray data sets were downloaded from GEO (Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/) database with the accession id GSE21422 and GSE5764. The microarray data in cell file format is based on the GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array platform for both the datasets. The downloaded dataset with accession ID GSE21422 (set-1) consists of 801 ©Biomedical Informatics (2019) breast cancer for tumorigenesis, in current analysis we selected three stages, i.e. healthy, DCIS and IDC patient gene expression with the aim to identify a set of genes which play a pivotal role in the initiation, progression of cancerous and also gene involved in further development of the DCIS to IDC stages.
Data pre-processing: The microarray data analysis was carried out to identify potential marker genes related to breast cancer development in different stages as stated above. It enabled us to identify a set of genes that play an essential role, which acts as a tool for early detection of Ductal Carcinoma. From set-1 and set-2, we included nine samples for each condition and CEL files were downloaded for further processing. Thus, the three different cases under this study are gene expression analysis of DCIS vs healthy samples, IDC samples vs healthy and DCIS with IDC. The sample array was analyzed in R (3.2.5) using microarray data analysis specific Bioconductor packages, which are freely available in Bioconductor (https://www.bioconductor.org/), like limma, affy, and few others in-house built pipelines for microarray data analysis. Normalization of data was done using Robust Multi Array Analysis (RMA) procedure, which uses quantile normalization to minimize the noise impact. RMA provides precise estimates of probes expression as it performs various steps which include background correction, normalization of probes and summarization. Normalization of expression data brings data closer and makes them less scattered, which helps to balance data to make meaningful biological comparisons [22]. Gene expression for 19,902 genes obtained as the result of the pre-processing of both the datasets. The set of probe IDs with no corresponding locus or gene IDs were excluded for further processing.

Differentially expressed genes (DEGs) analysis:
The differentially expressed genes from the DCIS patient compared with those from the healthy people were analyzed by limma package (v.3.26.8) and then the t-test was applied to the data to get the final list of significantly expressed genes table. The extraction of a gene on DEG based on the criteria limited to adjusted pvalue < 0.05 and |log2FC| as 1, which was performed based on t-test on the resultant normalized data. P-values were adjusted by applying Benjamini & Hochberg (FDR).

Gene ontology (GO), annotation and pathway enrichment analyses:
The Database for Annotation, Visualization and Integrated Discovery (DAVID), an open-source tool for Gene Ontology (GO) annotation of differentially expressed genes is used for gene-gene annotation. GO terms are categorized into biological processes, cellular components and molecular function where biological process refers to complex changes on the granularity level of the cell which is mediated by one or more gene products, cellular component is a part of cell or its extracellular environment which may contain gene product and Molecular Function help us to understand the potential of the molecule to execute the function [23], which may be actively responsible in gene expression variation. Pathways in which the DEGs are involved were identified using KEGG (Kyoto Encyclopedia of Genes and Genomes).

Discussion:
The pre-and post-normalization hybridization intensities across all the samples in the dataset has been shown in Figure 2. The prenormalization plots show non-uniform data distribution. The intensity of the data was adjusted using quantile normalization algorithms. The expression values were determined using Affy package in R software. Thus the distribution of data post normalization has uniform intensity within the same intervals and the same density center. This helps us to understand the nonuniformity of the gene expression in the raw data which made it necessary to normalize the data before processing for minimizing the error probability in the results.
The expressed genes with the threshold value for log2 Fold Change (logFC~1) used to obtain the DEGs in various conditions that are Healthy vs DCIS, Healthy vs IDC and DCIS vs IDC. Each circular dot represents one gene with its corresponding -log10(P-value) and the cut-off for the selected DEG represented in blue circular dot highlighting the most significant genes which fall under the category of differentially expressed genes. LogFC > 1 representing upregulated genes and logFC < -1 corresponds to down regulated genes (Figure 3). This method leads to filter the genes on the scale of unstandardized signals (e.g. log2fold change) against noiseadjusted/standardized signals (e.g., log10(p-value)) and help us determine the curated set of genes which are significantly expressed. It also helps to visualize those genes to present them more interactively.

Pathways analysis:
The PPARs family is a fat metabolism related pathway which includes PPARα, PPARγ and PPARδ, serving different functions in cancer categorized as PPARα and PPARγ inhibits tumor progression while PPARδ promotes tumor development [24].Although the results have not always been consistent but PPARγ which is hormone receptor and plays role in regulating adipocyte differentiation, insulin-signaling showed associativity with breast cancer risk [25].Thus, PPAR signaling pathway is considered as actively involved in tumerogenesis. Among the 20 down regulated genes, FABP4/aP2, SORBS1/CAP and PLIN1/Perilipin were significantly enriched in the peroxisome proliferator-activated receptor (PPAR) pathway ( Figure 5) and noticeably these genes were down regulated in both pre-invasive (DCIS) and invasive ductal carcinomas (IDC) in our result. Significance of PLIN1 as a gene that inhibits cancer cell proliferation, migration and invasion in human breast cancer is reported in a study where it was also found as significantly underexpressed in mRNA expression [26].   The upregulated genes are SLC12A8 and MCM10, which showed high expression in both DCIS and IDC. An earlier study reports the high expression levels of SLC12A8 is associated with better prognosis of breast and pancreatic ductal carcinoma, therefore considered as a significant gene that contributes to the personalized treatment of breast cancer [27]. MCM10 (Minichromosome Maintenance 10 Replication Initiation Factor) is a protein-coding gene which was found upregulated in breast tumor tissues and also in the triple-negative breast cancer. MCM10 might induce breast cancer metastasis via the Wnt/-catenin pathway which defined it as a potential diagnostic tool as well as a promising target for breast cancer [28].
We studied the common genes which are differentially expressed among all 3 conditions i.e. Healthy samples vs DCIS, Healthy vs IDC and DCIS vs IDC. By comparing the first 2 sets we obtained genes that were differentially expressed in both Healthy vs DCIS, Healthy vs IDC. In total, we obtained 308 upregulated and 355 downregulated genes. For this set of commonly expressed DEGs, we studied their expression and compared it with the genes which were showing consistent upregulation or down regulation. We obtained 2 upregulated and 20 down regulated genes and filtered them for Gene Enrichment Analysis, Pathway Analysis (Table 1) and Gene/PPI network for 22 differentially expressed identified genes. We found MCM102 and SLC12A8as upregulated genes among groups in all the 3 conditions and LEP, SORBS1, SFRP1,  PLIN1, FABP4, RBP4, CD300LG, ID4, CRYAB, ECRG4, G0S2, FMO2,   ADAMTS5, CAV1, CAV2, ABCA8, MAMDC2, IGFBP6, CLDN11, TGFBR3as down regulated genes (Figure 4). Some of the identified DEGs were found to be directly related to breast cancer. Also, the identified DEGs such as CAV1, FABP4, FMO2, G0S2, MAOB, PCK1, PLIN1, RBP4 showed close relativity with alcoholism, endometrial and bladder cancer which implies alcohol consumption might be a crucial factor responsible for ductal carcinoma. Additionally, abrupt growth in endometrial cells can lead to malignant tumor growth or endometrial hyperplasia, which can further lead to cancer.

Conclusion:
Gene expression based analysis has been used for disease biomarker discoveryfor a decade, providing ways for better diagnosis, novel drug design strategies and biomarker identification which leads to improvement of clinical treatment efficacy. The expression analysis results reported in the present study can be exploited as a breast cancer-specific novel biomarker and drug target identification. In recent years, increasing incidences of breast cancer demand extensive research on progression mechanisms of this dreadful disease. The breast cancer-associated genes were found to be majorly involved in PPAR signaling pathway which consists of nuclear hormone receptors playing a different role in tumor development and also in cancer progression.We have identified the genes which are involved in PPARγ pathway which are which are actively involved inadipocytic differentiation and the evidence of associativity of these genes with breast cancer has also been reported in earlier studies. Thus, PPAR pathway can be considered as having major role in tumerogenesis which regulates cancer cell proliferation, survival, fatty acid-activated nuclear hormone receptors and its derivatives. Also, we have identified the genes which involved in alcoholism-related pathway and other cancer types which signals these set of genes to be considered as potential genes for early detection of breast cancer and its progression from Ductal Carcinoma In-Situ to Invasive Ductal Carcinoma. Furthermore, this study can be extended to RNA-Seq data from breast cancer patients to validate the existing results and Exome Seq data can also be tested for mutation analysis of the identified significantly expressed genes for better understanding of the underlying biological mechanism in breast cancer occurrence and its progression.