Computational analysis reveals abundance of potential glycoproteins in Archaea, Bacteria and Eukarya.

Glycosylation is the most common type of post-translational modification (PTM) and is known to affect protein stability, folding and activity. Inactivity of enzymes mediating glycosylation can result in serious disorders including colon cancer and brain disorders. Out of five main types of glycosylation, N-linked glycosylation is most abundant and characterized by the addition of a sugar group to an Asparagine residue at the N-X-S/T motif. Enzyme mediating such transfer is known as oligosaccharyl transferase (OST). It has been hypothesized before that a significant number of proteins serve as glycoproteins. In this study, we used programming implementations of Python to statistically quantify the representation of glycoproteins by scanning all the available proteome sequence data at ExPASy server for the presence of glycoproteins and also the enzyme which plays critical role in glycosylation i.e. OST. Our results suggest that more than 50% of the proteins carry N-X-S/T motif i.e. they could be potential glycoproteins. Furthermore, approximately 28-36% (1/3) of proteins possesses signature motifs which are characteristic features of enzyme OST. Quantifying this bias individually reveals that both the number of proteins tagged with N-X-S/T motif and the average number of motifs per protein is significantly higher in case of eukaryotes when compared to prokaryotes. In the light of these results we conclude that there is a significant bias in the representation of glycoproteins in the proteomes of all species and is manifested substantially in eukaryotes and claim for glycosylation to be the most common and ubiquitous PTM in cells, especially in eukaryotes.


Background:
Glycosylation is the most common type of post-translational modification (PTM) in proteomes and thought to be one factor contributing in enhancing the diversity of proteomes [1].There are five main types of glycosylation: Olinked, N-linked, C-linked, P-linked and G-linked [2].N-linked and O-linked glycosylation are the most abundant types of glycosylation [3].O-linked glycosylation is characterized by the attachment of carbohydrate units to the hydroxyl (OH) group of Serine (S), Threonine (T), Tyrosine (Y), Hydroxyproline, or Hydroxylysine side chains [4].In contrast, N-linked glycosylation generally means the attachment of carbohydrate units to the nitrogen group in Asp-Xaa-Ser/Thr (N-X-S/T) motif, where Xaa is any amino acid except for Proline [5].Species of three domains of life possess unique biological characteristics that provides basis for discrimination among them, but some of the biological characteristics are common in them as well [6].One of the common biological function is glycosylation which is known to help prokaryotes invade host cells and induce pathogenecity [7].Glycosylation is a ubiquitous form of PTMs necessary for most of cellular organisms [8].It is also know to affect the stability of proteins and some proteins need to be glycosylated in order to fold properly [9].In the absence of glycosylation, immature proteins do not fold properly hence shows that it is an essential cotranslational event for correct folding of proteins [10].Furthermore, cells of immune system also employ glycosylation strategies for cell adhesion purposes [11].Problems with glycosylation mechanism can result in serious disorders including colon cancer and brain diseases.Thus understanding glycosylation is central to our understanding [12].In case of N-linked glycosylation, the N of N-X-S/T motif serves as the acceptor site for the addition of glycan chains.Xaa position can have any amino acid, but Proline, but it has also been shown that the presence of negatively charged residues at Xaa result in partial glycosylation and positively charged residues are favorable [13].It is also an experimental fact that due to several structural constraints only 66% of the NXS/T motifs are glycosylated [14].Enzyme responsible for charging the proteins with sugar groups is generally known as oligosaccharyltransferase (OST) in eukaryotes.Homologs of OST in bacteria are known as PglB whereas Archaeal enzymes are referred to as AglB.The C-terminal of these enzymes has a signature motif of 5 residues with a sequence of Trp-Trp-Asp-[Tyr-Asn-Phe-Trp]-Gly (WWD[YNFW]G) that is of central importance for the activity of OSTs.Any mutation in this motif results in deactivation of OSTs and consequently loss of glycosylation activity [15].Other important motifs that have been identified in OSTs include Met-Xaa-Xaa-Ile/Val/Met(MxxI/V/M) and Asp-Xaa-Xaa-Lys (Dxxk) motifs [16].MxxI motif is found to be involved in enhancing the functional activity of OSTs.DK motif in eukaryotes, especially yeast, revealed certain importance in the survival of the organism as it involves the metal ion binding to the OSTs .Presence of such a high number of potential glycoproteins is intriguing and points towards the importance of glycosylation for cells.In this study, we used programming implementations in Python to quantify the distribution of glycoproteins in the proteomes of all cellular organisms.We also explored the number of possible OST enzymes present in proteomes by searching for its signature motifs in target proteins.
Our results support the findings Apweiler made 12 years ago and show that a significant proportion of all the cellular proteomes is devoted for the allimportant function of glycosylation.Eukaryotes carry the most number of potential glycoproteins followed by Bacteria and Archaea.This tendency should not surprise us given the nature of rich protein repertoires of eukaryotes.
The current study, based on the calculation of total number of glycoproteins, would be helpful in understanding the functional aspects of proteomes that have remained conserved in all the three domains of life and reveals exciting patterns in proteomes.

A Computational Program: Motif Percentage Calculations:
In order to calculate statistics related to the characterization of glycoproteins and OSTs in proteomes a Python script was developed that first combines all the proteomes belonging to a superkingdom into a single file.For instance, the proteome data of all eukaryotes is represented by separate FASTA files and the first step in the execution of our script is to pool all the proteomes together and generate separate files for Archaea, Bacteria, and Eukarya.This step is not necessary when handling global statistics for Swiss-Prot and TrEMBL databases.Next step is to scan all the proteins for the presence of motifs linked to N-glycosylation.Specifically, the output reports the following statistics: total occurrences of N-X-S/  1A).Eukaryotes appear to have the highest number of potential glycoproteins followed by Bacteria and then Archaea.In order to find whether the observed differences between the three superkingdoms (Archaea, Bacteria, and Eukarya) are statistically significant or not we conducted Analysis of Variance for Completely Randomized Design (ANOVA CRD) on a sample of 1,000 proteins each from three domains of life.The selected sample was parsed using Python and counts for number of times N-X-S/T motif is present per protein was calculated.The selected sample of 1,000 proteins each from the three superkingdoms had all their lengths in range of 1000-1500 amino acids.The computed statistics from ANOVA CRD are shown in (Table 1 see Supplementary material) The computed p-value for the differences between the three super kingdoms at 95% confidence interval using R was <2.2e-16 depicting that the results are significant at P<0.0002.Therefore, null hypothesis (means of count of NXS/T motif per protein for all superkingdoms are equal) is rejected.Tukey's Honesty Significant Differences (HSD) test was used to detect the pair-wise differences between all of the three superkingdoms of sample data.It appears that all the three kingdoms differ significantly to each other at 95% confidence intervals as shown in Figure 1B.The interval is not spanning zero hence implies that superkingdoms differ significantly.Variance between the eukaryotes and bacteria pair is more than the rest of the two.MI motif follows the similar pattern i.e. eukaryotes represents the higher number.However, the numbers for DK motif deviate from the general pattern a little.
Eukaryotes have still the highest percentage for DK motifs (60.89%), followed by Archaea (50.88%) and then Bacteria (47.33%) as shown in (Table 2 see Supplementary material) Global analysis of Swiss-Prot and TrEMBL revealed that nearly 50.88% of the proteins carry DK motif for Swiss-Prot whereas 45.25% of proteins are tagged with DK motif in TrEMBL (Figure 1C). Figure 1D reports the average number of times a motif is present in an individual protein.The mean values are reported for all the 4 motifs.On average more than one site of NXS/T sequon was recorded in each of the potential acceptor protein in all three domains.According to our results tendency of eukaryotes to produce glycans is higher, ~4, than the rest of the other two domains of life i.e., archaea and bacteria, 2.56 and 2.23 respectively.However, one to two sites of MxxI and DxxK motifs could be present per protein, eukaryotes being the highest one.It can be observed that eukaryotes possess higher content of glycoproteins.It has also been cleared from the literature study that highest percentage of glycosylated proteins is present in eukaryotes as compared to prokaryotes [18].Hence implying that glycosylation is being an important mechanism for the healthy survival of eukaryotic organisms.Furthermore, co-occurrence of the motif pairs ((WWD[YNFW]G+DK), (WWD[YNFW]G+MI) and (MI+DK)) was calculated in order to find out the total number and percentage of proteins that can potentially act as OSTs.According to the results, the highest percentages for the WWD[YNFW]G+DK (0.056%) and WWD[YNFW]G+MI (0.057%) motif pairs were observed in case of archaea database suggesting that these proteins can be potential OSTs.Whereas, it was observed that percentage of DK+MI motif co-occurrence pair was relatively higher than that of the WWD[YNFW]G+DK and WWDYG+MI pairs in all the five databases, especially in case of eukaryotes i.e. 40.44% (211381/522631) (Table 3 see Supplementary material).But the proteins possessing MI+DK motif pairs cannot be declared as OSTs as they lack the catalytic motif WWD[YNFW]G.Therefore, overall 28-36% of proteins can act as potential OSTs.There is a chance that these potential OSTs may possess NXS/T sequon i.e. self glycosylation site as well.Signature motifs in catalytic center help differentiating such OSTs from those of the glycoproteins.In order to filter out the number of OSTs, that possess catalytic center motifs and additionally the self glycosylation site (NXS/T), three different pairs for the co-occurrences of catalytic center motifs also containing NXS/T motif were made.WWD[YNFW]G+MI+DK+NXS/T was the first pair whose ratio was calculated.It was shown that archaeal proteomes have higher percentage i.e. 0.052314 % for co-occurrence of all of these four motifs (Table 4).Percentage calculation of DK motif in (Table 2 see Supplementary material) and the literature study have shown that it might be one of the components of the catalytic center of OSTs but MI motif have higher chance of occurrence.So two sub pairs of the WWD[YNFW]G+MI+DK+NXS/T signature cooccurrence pair were made.One was WWD[YNFW]G+NXS/T+MI (DK motif eliminated) and the other was WWD[YNFW]G+NXS/T+DK (MI motif eliminated) co-occurrence sub pair.Total number and percentage calculation of both of these sub pairs have shown more or less similar results except in case of bacteria whose difference was little bit higher than the rest of the four databases.An interesting observation was made from the computed result that the percentage occurrence of WWD[YNFW]G+MI as shown in [17].In  1999, Rolf Apweiler analyzed protein sequence data and proposed that approximately half the BIOINFORMATION open access ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 6(9): 352-355 (2011) 353 © 2011 Biomedical Informatics proteins in a proteome are glycoproteins [12]

Figure 1 :
Figure 1: (a) Total Protein Entries in all Five ExPASy Databases, (b) Differences in mean levels of treatment for all three datasets using Tukey's HSD test, (c) Percentage of four defined motifs in ExPASy's databases, (d) Average number of defined motifs per protein in ExPASy's databases.

Table 1 :
Analysis of Variance (ANOVA) on a sample dataset of 1,000 proteins each from the three super kingdoms Source Degree of Freedom (df) Sum of Squares (SS) Mean Squares (MS) F-ratio Computed P-value

Table 2 :
Statistics for all of the five databases at ExPASy: (a) Total number of proteins.

Table 3 :
Co-occurrences of WWD[YNFW]G+DK, WWD[YNFW]G+MI and MI+DK motifs in all five databases of ExPASy

Table 4 :
Self glycosylation sites (NXS/T) in potential OSTs in all five ExPASy's databases by the mid of 2010