Detection of G-type density in promoter sequence of colon cancer oncogenes and tumor suppressor genes

The guanine rich locations are present in human genome. Previous studies have shown that the presence of G rich sequences and motifs may be significant for gene activity and function. We decided to focus our interest to identify G rich motifs in promoters of oncogenes and tumor suppressor genes. We used a set of 100 most common oncogenes and tumor suppressor genes (TSG) for this analysis. We collected 600nt long promoters with -500 and +100 TSS (transcription start site) from the oncogenes and TSG set. Using a computer program, we calculated the G densities using numbers and locations of G forms with 100nt moving widow. We included G numbers from 2 to 7 guanines. Analysis shows that G density increases from -500 to +100 and more from TSS. G density is found to be maximum within -/+100 of TSS. The results of G densities were compared with the expression data of the selected oncogenes and tumor suppressor genes in patients with colon cancer (n=174).


Background:
The guanine rich region is a relatively unexplored part of the human genome. Although there are some algorithms to detect special motifs, such as G quadruplex, the algorithms to detect other types of G rich motifs do not exist. It was first reported in 1910 that guanylic acid forms a gel at high concentrations [1]. Therefore, it is suggested that G-rich sequences may form some other structures. About 50 years later, Gellert used X-ray diffraction to display that guanylic acids can accumulate into tetrameric structures [1]. The presence of G-rich sequences is found in functional regions of many genomes. For example, Grich regions have the potential to form G4 structures which locate telomeres, promoters, mitotic and meiotic double-strand break (DSB) sites [2]. Naturally occurring 'G' rich sequences, via non-Watson-Crick base pairing capable of forming Gquadruplexes and stabilized by cyclic Hoogsteen hydrogen bonding, have been implicated in some different genomic activities such as: transcription pausing, FMRP binding, mRNA stability, translation initiation as well as repression [3].
Although the G-quadruplex (G4) motif has been analyzed as a non-B-form DNA secondary structure, there may be some others which have not been given nomenclature yet [4]. It is already known that G-quadruplex is involved in different human cancers [5, 6, 7]. Thus G-quadruplexes may be targeted for therapeutic purposes [5, 6, 7]. DNA folding properties allow it to make various inter-and intramolecular secondary structures. Although the structures seem in vitro artefacts, bioinformatics reveals that DNA sequences capable of forming such structures which are conserved [2].
It is known that there are some types of guanine rich regions and motifs. Z-DNA motifs are mostly related to transcriptional start sites in eukaryotic genomes [8]. Cruciform structures are located close to replication origins, breakpoint junctions and promoters in various organisms. Triplexes cause genomic instability by breaking double-strand that result in translocations [9]. The repeated expansion may relate to human genetic disorders [10]. G4 structures present different topologies and are separated into various groups depending on the orientation of the DNA sequences. It is unclear how many G-rich sequences form stable G4 structures in vivo, but G4 DNA motifs are common in G-rich micro and minisatellites, up and downstream of TSSs, often near promoters, transcription factor binding sites, and mitotic and meiotic DSB sites [11,12,13,14]. Telomerase activity in most human cancers can be influenced by G4, because different small ligands target the regions and bind, as has been tested in different experiments [15]. G4 motifs are most likely found within 1,000 nt upstream of the TSS in 50% of human genes [16]. Special Bioinformatics algorithms find that the promoters of human oncogenes and regulatory genes have G4 motifs more than in the promoters of housekeeping and tumor suppressor genes [14]. G-rich sequences or G4 may cause supercoiling in the structures are in or near promoter regions, which can have both positive and negative effects on transcription. First, the location of the G motif is a very powerful factor for transcription.
Approximately ~ 400,000 presumed G4 motifs are found in the human genome. The motifs are frequently located within the promoter regions of oncogenes, assuming that G4 motifs may act in a key role for regulation of different cellular activities such as transcription, translation, telomere maintenance, and replication [2]. The G4 motif importance in the regulation of gene transcription came from v-myc viral studies, oncogene homolog (MYC), the transcription factor regulates the expression of different genes which are altered in human cancer, is a non-regulated in around half of the tumors [17]. Guanine-rich nucleic acid sequences which form G-quadruplex structures are key regulators of some biological processes and are targeted for therapeutic medicine such as Quarfloxin, a fluoroquinolone [18,19].
Guanine numbers and densities are very distinctive parts of the genome. The aim of this study is to find repeating G motifs consisting of 2, 3, 4, 5, 6, and 7 guanines in the promoter sequence of selected genes important for carcinogenesis (50 tumor suppressor genes and oncogenes). Previous studies have analyzed G quadruplexed for oncogenes, but not other types of motifs and genes like tumor suppressor genes [20]. However, in this paper the promoter sequences of oncogenes and tumor suppressor genes (TSG) are the candidates for finding G-type densities.

Guanine density detection
The guanine nucleotide number is reported in other studies, especially in genomic locations such as, telomere, promoter, exon and intron [24]. G types including GG, GGG, GGGG, GGGGG, GGGGGG, GGGGGGG, have been produced to detect guanine density (GD) in the promoter sequences [25]. For each promoter sequence, GD is detected by a computational program which was created for this study. The program searches GD types of the sequences between -500 to +100 in a 100 nucleotide group for both oncogenes and TSG in Figure 2. According to the guanine density of each group, oncogenes and TSG promoter sequence profiles are characterized and listed Table 1

(see supplementary material).
Our results indicate that the oncogenes and TSG G profiles present increasingly high density between -100 to 0, where they achieved maximum density after Transcription Start Site (TSS) ( Table 1 & Table 2, respectively). The G types, G2, G3, G4, G5, G6, G7, show increasing order and reach the maximum level after TSS, 0 to +100. The G-types show diverse density in different groups of the promoter sequences. Especially, G2, G3, G4, G5 types are detected more than other types. In addition to that, in the -100 to 0 locations of the sequence, the G6 types appear 8 times in both oncogenes and TSG, which is especially rare. G2 is the most commonly found type and followed G3, G4 and G5 in the all small groups. Unexpectedly, G7 type is found If we analyze the promoters of oncogenes only, the G profiles consist of maximum G3, G5, G6, G7 types in -100-0 and G2, G4 types maximum 0-100. In the TSG promoters, the maximum profile of G types is demonstrated as G4, G5, G6, G7 in -100-0 and G2, G3 in 0-100. The maximum G density is found before and after 100 nucleotides after TSS (Figure 3).
G-type density is compared before and after TSS; the average G-type density of all 5 nucleotide groups between -500 and 0, is compared to the G-type density of the group between 0 and +100. Before TSS, on average, starting from -500, the oncogenes have 336 and TSG have 332 G types, but after TSS, to +100, the oncogenes have 402 and TSG have 435 G types (Figure 4). Surprisingly, 200-100 location of both promoters has distinct G types, such as G9 and G11 in oncogenes and 2 times G8 in TSG. The G types number increases between the segments 400-300 to 300-200 around and then a little bit decrease after that. However, the last segment before the TSS and 100 nucleotides after TSS have the maximum G type number over all comparison.

Gene expression comparison
Since G profiles are found in promoters sites, which are important for the regulation of gene expression, we decided to compare the G profiles of selected cancer-related genes to their expression in the colon cancer patients. The expression data were downloaded from The Cancer Genome Atlas (TCGA) data portal which supplies many different patient genomics data, including gene expression, microRNA, RNAseq, methylation, mutation and others. We downloaded the expression data from 17815 genes from 174 colon cancer patients and 46 controls. The average expression level of all 17815 genes was determined and compared with control data (Supplementary 2). Abnormal fold change of 50 selected oncogenes and 50 selected TSGs has been found. According to the fold change, high-expressed and lowexpressed genes are profiled with G Density Table 3 (see supplementary material).

Discussion:
Guanine numbers and densities are very distinctive parts of the genome. In this study, we presented the G densities of motifs consisting of 2, 3, 4, 5, 6, and 7 guanines in the promoter sequence of selected genes important for carcinogenesis (50 tumor suppressor genes and 50 oncogenes). Previous studies presented different algorithms and methods to find guanine rich regions and potential motifs. In those studies, different Gscore for G quadruplex calculation methods were developed [20]. However, no previous study has shown the densities of other G repeats such as GG, GGG, GGGG, GGGGG, GGGGGG, GGGGGGG and GGGGGGGG. Our study is the first to compare the G repeats in tumor suppressors and oncogenes' promoters.
The promoter sequences were separated into small groups of 100 nucleotides, from -500 to +100. Our results showed that the oncogenes and TSG G profiles present increasingly high density between -100 to 0, where they achieved maximum density after Transcription Start Site (TSS) ( Table 1 & Table 2, respectively). Analysis shows that G density increases from -500 to +100 and more from TSS. G density is found to be maximum within -/+100 of TSS. The results of G densities were compared with the expression data of the selected oncogenes and tumor suppressor genes in patients with colon cancer (n=174, Table 3).

The relation between gene expression and Guanine types density
Since G profiles are found in promoters sites, we decided to compare the G profiles of selected cancer-related genes to their expression in the colon cancer patients. TCGA colon cancer data of 174 patients and 46 controls are compared. According to fold changes, high and low expressed genes have been determined as compared to the controls. All types of G repeats of 18 highly expressed genes from both oncogenes and TSG have been analyzed

Conclusions:
This study describes a method with a computer program for quantitatively evaluating the conservation of different guanine types and densities. Guanine types, G (2-7), can be identified by guanine density (GD) program in order to detect potential sequence motifs which are conserved in promoters of oncogenes and TSGs. The computer program quickly and efficiently identifies conserved Guanine Types Density regions where there is a relatively high probability of sequence conservation. The program reported in this study has application for the analysis of large datasets.
Our results show that depending on the exact locations of the guanine types and density, the gene promoter sequences demonstrate conserved characteristics. In other words, the G densities increase closer to the transcriptional start site of both oncogenes and TSGs. The G density is the highest within the 100 base pairs proximal to the transcriptional start site. Identifying common conserved GDs may help us validate these findings on larger datasets to show the role of G densities in pathogenesis and disease.
Moreover, the G types density demonstrates that the location and number of G repeates are conserved in oncogenes and TSG promoter sequence. The paper may help elucidate the potential role of the specific G types in therapeutic and diagnostic pursuits.