Analysis of aminoacids pattern in receptor tyrosine kinase using Boolean Association Rule

Cancers are characterized by unrestricted cell division and independency of growth factor and other external signal responsiveness. Eukaryotic parental cells of tumors, on the other hand, constitute tissues and other higher structures like organs and systems and are capable of performing various functions in a highly co-ordinated fashion. Hence, cancer cells may be considered as entities capable of incessant growth and cell division but lacking any evolutionarily advanced intracellular or intercellular regulation. Since receptor tyrosine kinases are highly altered and exist in deregulated/constitutively active forms in cancer cells - achieved through various epigenetic mechanisms - we hypothesize the functional RTKs in cancer cells to resemble their counterparts in more primitive species. Analysis of RTK sequences of various species and of cancer is, therefore, expected to prove this hypothesis. Association rule in data mining can reveal the hidden biological information. This study utilizes the Boolean association rule to mine the occurrence pattern of glycine, arginine and alanine in receptor tyrosine kinases (RTKs) of invertebrates, vertebrates and cancer related vertebrate RTKs based on protein sequence informations. The results reveal that vertebrate cancer RTKs resembles prokaryotes and invertebrate RTKs showing an increasing trend of glycine, alanine and decreasing trend in arginine composition. The aminoacid compositions of vertebrates: invertebrates: prokaryotes: vertebrate cancer with respect to Glycine (>=6.1) were 42.86: 50.0: 85.71: 100%, Alanine (>=6.2) were 10.72: 66.67: 85.71: 100%, whereas Arginine (>=5.9) were 21.43: 16.67: 14.29: 0%, respectively. In conclusion, results from this study supports our hypothesis that cancer cells may resemble lower organisms since functionally cancer cells are unresponsive to external signals and various regulatory mechanisms typically found in higher eukaryotes are largely absent.


Background:
Data mining techniques can be applied to study the behavior of different amino acid in protein sequences.The association rule mining technique is a popularly used data mining technique.Association rule mining involves counting frequent patterns (or associations) in large databases, reporting all that exist above a minimum frequency threshold known as the 'support' [1].
The receptor tyrosine kinase (RTK) pathway plays critical roles in growth and division of cells.The RTK family comprises numerous cell-surface receptors that mediate cell growth, differentiation, migration and metabolism [2].RTKs have an extracellular portion to which polypeptide ligands bind, a single-pass transmembrane helix, and a cytoplasmic portion containing a protein tyrosine kinase domain that catalyses phosphoryl transfer from ATP to tyrosine (Tyr) residues in protein substrates [3].In cancer cells, mutations in the genes encoding RTKs and various epigenetic mechanisms like alternative splicing lead to inappropriate activation of kinases resulting in uncontrolled cell division [4].Amino acid restriction sends normal cells into a quiescent mode, their growth and division cycles being shut down in a reversible manner.Tumour cells usually fail to move out of cycle, the resulting imbalance generally leading to cell death in a matter of days [5].Our preliminary studies reveal that the percentage of the amino acids present (except glycine, arginine and alanine), is approximately the same in most of the Receptor Tyrosine Kinase (RTK) protein sequences irrespective of different species or taxa, whether it is vertebrate or invertebrate or cancer sequences.Glycine is a non polar neutral amino acid with hydropathy index -0.4.The amino acid glycine was found to reduce tumour growth in rats.Dietary glycine prevented increases in cell proliferation, a key event in cancer development, suggesting that it may be an effective anti-cancer agent This study attempts to analyse the variations in the occurrence of amino acids glycine, arginine and alanine in RTKs of invertebrates, vertebrates and cancers using association rule mining technique.

Methodology: Data Source and Data Selection
The complete RTK protein sequences have been collected from NCBI databases (www.ncbi.nlm.nih.gov/) and Swiss Prot.There are 28 vertebrate sequences, 6 invertebrate sequences, 7 prokaryote sequences and 2 cancer sequences.The minimum length vertebrate and invertebrate sequences are 1045 and 799, respectively.Two cancer sequences namely human cancer and mouse cancer are of same length 1620.The ProtParam software from Expasy server (web.expasy.org/protparam/) is used to calculate the protein parameters.The protparam results showed interesting features in glycine, arginine and alanine amino acids, hence these three amino acids have selected as a feature set for the association rule mining.
In this study, we consider each amino acid as an item, the protein sequence as basket that contains items and each taxa or species as one transaction.On these transactions association rule mining technique has been applied to obtain meaningful association among the amino acids, also how frequently the amino acid is present in the transactions.The quantitative value for the items has been mapped to Boolean values, and then Boolean association rule mining techniques has been applied to study the behaviour of the amino acids in the sequences.

Association rule
An 'association rule' is a pair of disjoint item sets.If LHS and RHS denote the two disjoint itemsets, the association rule is written as LHS→RHS i.e LHS and RHS are sets of items, the RHS set being likely to occur whenever the LHS set occurs.The 'support' of the association rule LHS→RHS with respect to a transaction set T is the ratio of support (LHS U RHS)/ T. The 'confidence' of the rule LHS→RHS with respect to a transaction set T is the ratio of support (LHS U RHS)/ support (LHS) [9].

Boolean association rule
Boolean values were used to represent the present or absent of the item in transaction.'0' represents the absence of particular item in the transaction and '1' represents presence of particular item in the transaction.

Data Optimization
The quantitative values of glycine, arginine and alanine columns are converted to boolean form i.e. 0 and 1 Table 1 (see supplementary material).Every amino acid column is divided in two groups, the grouping is necessary to convert into Boolean form.Based on the quantitative values, their variations, range etc. the grouping is done like glycine<6.1% and glycine>=6.1%.For example in most of the transactions glycine percent is either very less than 6.1% or more than it.6.1% can be used as a boundary for this classification.The '1' in the table represents the presence and '0' represent the absence of that item for that particular transaction.Total eight items are been considered (i.e.glycine<6.1%,glycine>=6.1%,arginine<5.9%,arginine>=5.9%,alanine<=6.2% and alanine>6.2%)(Figure 1).

Disscussion:
Association rules are used widely in the area of market basket analysis and can also reveal biologically relevant associations between different genes or between environmental effects and gene expression [9].The results show that in 42.86% normal vertebrates the glycine composition is more than or equal to 6.1, which is 50% in invertebrates, 85.71% in prokaryotes and 100% in cancer sequences, thereby reflecting the increasing trend of glycine from normal vertebrates to cancerous RTK protein (Table 1).Similarly, alanine and arginine show increasing and decreasing trends, respectively, from normal vertebrate sequence to cancer sequences.Correspondingly, the confidence value shows that if the arginine is less 5.9%, then alanine is always less than or equal to 6.2% in vertebrates.Similarly, if glycine is less than 6.1%, then 93.75% alanine is less than or equal to 6.2% and arginine will be less than 5.9%.It describes how one amino acid is associated with another.
Both the human and mouse cancer sequences possess similar characteristics (Table 1).It can be seen that the support is either 0 or 100.This reflects that in both the cancer sequence transaction one particular item is either present or absent.The confidence levels show zero for the above-mentioned combinations.It can be assumed that the association between different items is varying from vertebrates to cancer.RTK activity in resting, normal cells is tightly controlled.When they are mutated or structurally altered, RTKs become potent oncoproteins: abnormal activation of RTKs in transformed cells has been shown to be involved in the development and progression of many human cancers [10].Consequently, RTKs and their growth-factor ligands have become rational targets for therapeutic intervention using humanized antibodies and small molecule drugs.Although a complete understanding of RTK function and dysfunction in diverse tissues and multiple biological processes is still to be achieved, studies of members of this family have already had a significant impact on cancer therapy [10].
Analysing the data it is clear that glycine and alanine show increasing trend from normal vertebrate sequences to cancer sequence and on the other hand arginine show the decreasing trend.The association among these three amino acids can be established as follows: support for glycine<6.1 decreases when the support for arginine<5.9increases.It is also evident that there is a trend in the increase/decrease of amino acid composition from vertebrates to cancer sequences

Conclusion:
In this paper the Boolean association rule mining technique has been applied to find differences in the frequency of incidence of a few important amino acids in various RTKs of different species.The analysis shows that the three amino acid characters -glycine, arginine, alanine -of cancer sequences are more similar towards invertebrates and prokaryotes, which may lead the cancer RTK's to de-evolve.

[ 6 ]
. Arginine is a nonpolar positively charged amino acid with hydropathy index of -4.5.It is involved in a number of biosynthetic pathways that significantly influence carcinogenesis and tumour biology [7].Alanine is a neutral nonpolar amino acid with hydropathy index 1.8.Elevated rates of glucose and alanine turnover and gluconeogenesis from alanine were detected in patients who had advanced lung cancer with weight loss [8].