An ANN model for the identification of deleterious nsSNPs in tumor suppressor genes.

Human genetic variations primarily result from single nucleotide polymorphisms (SNPs) that occurs approximately every 1000 bases in the overall human population. The non-synonymous SNPs (nsSNPs), lead to amino acid changes in the protein product may account for nearly half of the known genetic variations linked to inherited human diseases and cancer. One of the main problems of medical genetics today is to identify nsSNPs that underlie disease-related phenotypes in humans. An attempt was made to develop a new approach to predict such nsSNPs. This would enhance our understanding of genetic diseases and helps to predict the disease. We detect nsSNPs and all possible and reliable alleles by ANN, a soft computing model using potential SNP information. Reliable nsSNPs are identified, based on the reconstructed alleles and on sequence redundancy. The model gives good results with mean specificity (95.85&), sensitivity (97.40&) and accuracy (96.25&). Our results indicate that ANNs can serve as a useful method to analyze quantitative effect of nsSNPs on protein function and would be useful for large-scale analysis of genomic nsSNP data. Availability The database is available for free at http://www.snp.mirworks.in


Background:
Single Nucleotide Polymorphism (SNP) represents the most abundant class of genetic variations in the human genome. Non-synonymous SNPs (nsSNPs), which cause the changes of amino acid residues in proteins, account for almost half of all DNA mutations and may be functionally neutral or deleterious [1,2]. The disease causing variations may cause deleterious effects on proteins. They may inactivate the functional sites or interaction sites of enzymes or impact the folding of proteins and may significantly destabilize the stability of proteins, or change the solubility of proteins [3,4]. So these variations represent critical molecular markers for dissecting the biological mechanisms underlying complex diseases, as well as for Pharmacogenomic studies. Such markers have become very popular for all kinds of genetic analysis, and disease like cancer. Cancer suppressors and Oncogenes play an important role in the control of the cell cycle, apoptosis, angiogenesis, and development processes that are under pressure of purifying selection [5]. Therefore, protein-damaging mutations in cancer-related genes would be expected to be under the pressure of purifying selection and thus to have a lower population frequency. The relationships between the genotype and phenotype of nsSNPs in tumor suppressor genes have received a plenty of research attentions because of their prevalence in the drug responses and cancer therapy [6,7]. Tumor suppressor genes are normal genes that slow down cell division, repair DNA mistakes, and apoptosis or programmed cell death. Non synonymous SNPs (nsSNPs) are the main cause of these mutations and to mining nsSNPs from cancer related genes considered as a laborious process and done only by site directed mutagenesis experiments and gene knock out/knock in experiments. Recently, some groups have tried to evaluate the deleterious nsSNPs based on 3-dimensional (3D) structure information of proteins and homology based SIFT (Sorting Intolerant from Tolerant) algorithm (http://sift.jcvi.org/, [1]). Some other methods based on site entropy calculations, relative stability changes were also developed for predicting deleterious nsSNPs [8]. These methods based on protein sequence have been demonstrated that the accuracy is the same as other methods using tertiary structure information. However, the theoretical prediction methods for deleterious nsSNPs are still in its infancy since the 3D structural information of most proteins are still unavailable. To overcome these, our primary challenge is that how to accurately predict those potentially deleterious nsSNPs. Deleterious nsSNPs prediction for the tumor suppressor genes has received great focus from experimental researchers. In this work, we suggest a computational model used to predict deleterious nsSNPs in tumor suppressor genes. Evolutionary conservation features and changes in the physicochemical properties of amino acid are used as parameters for ANN. The method predicts deleterious SNPs from a dbSNP id or a SNP sequence. Both fasta and raw formats are acceptable as input sequence. ANN verification is done for predicted deleterious SNPs. A database search is also included for known deleterious nsSNPs.

Physicochemical properties:
In this study many properties of amino acids are considered and filtered out based on the relevance of our prediction. Finally, following physicochemical properties of amino acid sequence are considered. They are, changes in hydrophobicity, free energy of transfer from inside to outside of a globular protein, hybridization potential based on free energy of transfer (kcal/mol), hydrophilicity, optimized matching hydrophobicity, molecular weight of amino acids, average flexibility index and solvent accessibility of amino acids. From the parameters input matrix are created for the prediction system.

ANN prediction:
The prediction of the deleterious SNP is carried out with an adaptive artificial neural network (ANN). In this study, ANN was chosen as the tool for the prediction, as they are powerful classifiers whose ability to cope with complex data and their potential for modeling data of high non-linearity [12, 13]. We used a feed forward Multi Layer Perceptron (MLP), with four layers. From the selected twelve parameters (four evolutionary conservation and eight physicchemical properties), undoubtedly the input layer of the neural network must have twelve neurons. In the hidden layers, various combinations were tried out and we got the best results as eight neurons in the first hidden layer and four in the second hidden layer, by trial and error. The output layer has one neuron. Back propagation algorithm was used to train the network. Training can be performed with use of several optimization schemes and there is access to exact partial derivatives of network outputs versus its inputs. The learning rate and momentum were initially set at 0.2 and 0.8 respectively. The training dataset is divided into two subsets. First subset (70% of the total training data) was used to train the neural network. Second subset was used to stop the training process once the model had reached the performance conditions like optimal error value thus preventing over training. Once the training is stopped, the efficiency of the model was further assessed by presenting another data subset, to determine the performance for unseen cases which were not involved in the training process [14]. Optimization was done by repeating the process with different data subsets. The optimization needs nearly 500 epochs for this network.

Discussion:
A vast numbers of SNPs are seen in human chromosome and presently there are more than a million SNPs in dbSNP that can be screened for association with diseases. We used a good dataset for human nsSNP predictions from the Swiss-Prot annotated 'disease' and 'polymorphism' variants of known human proteins. Variants annotated neutral polymorphisms may have an unknown association with disease. For prediction type problems, a prediction can be either positive or negative. These counts falls into four categories: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). These contents are used to calculate sensitivity (true positive rates), specificity (1false positive rates) and total prediction accuracy for assessment of the prediction system. A test dataset is prepared which consists of protein sequences, obtained from the Swiss-Prot database and NCBI human genome protein sequence. The mean specificity (95.85%), sensitivity (97.40%) and accuracy (96.25%) of the ANN prediction were obtained from the test results.
Using novel combination of parameters, the ANN predictor performs well in terms of specificity, sensitivity and accuracy. When compared with the previous theoretical studies, Ridge Partial Least Square (RPLS) and Linear Discriminant Analysis (LDA), our method has better performance measures. The prediction accuracy of the RPLS model and LDA model are 84.8% and 80.4% respectively [1]. This improvement in accuracy of our model is due to combined feature framework into an ANN predictor. Our method uses amino acid sequence properties and other theoretical prediction methods for deleterious nsSNP prediction requires 3D structural information of proteins, but 3D structural information of most of the proteins are still unavailable.

Figure 1:
Organizational chart of deleterious nsSNP prediction system. The input to the method is in two forms: dbSNP id and amino acid sequence. If the sequence is dbSNP id, search the database for similar entries. The output contains general information of the gene and nsSNP, information on the variant and sequence based information.
We investigated the role of two parameter sets (evolutionary features and physicochemical properties) in the ANN predictor. A measure of how each feature set contributes to the prediction performance of ANN can be calculated in the course of training. We estimated the relative importance of two feature sets in prediction. For that purpose, input layer of the ANN was redesigned by removing one feature set from the input vector field. The performance of the ANN was evaluated by giving a new feature vector after omitting a feature set whose performance being evaluated. Same dataset is used for training and evaluation of the ANN. The test result is compared with mean accuracy of the ANN predictor. The accuracy decreased to 23.4% and 17.5% from the original ANN predictor's accuracy. Physicochemical attributes gives the information sufficient to identify the amino acid substitution involved. But results are better when evolutionary features and physicochemical properties are combined together. This demonstrates that each parameter in the feature set makes a significant positive contribution to the overall performance of the model, though predictability. Thus, any good predictor of result which relies upon a single set characteristic will fall short of the accuracy obtainable by a combination of characteristics.
The system architecture of the prediction system is given in Figure 1 The database search and a prediction models are incorporated in the web server. The interface of the deleterious nsSNP web server is shown in Figure 2.

Figure 2:
The interface for prediction and results. SNP_IDs are available at NCBI site.

Conclusion:
The functional analysis of deleterious nsSNPs may offer lead to understand the genetic differences between individuals and disease states. It eventually improves the medical treatments by allowing the prediction of genetically related disease risk and drug response. Therefore, computational prediction of deleterious nsSNPs gives great importance and promotes the development of pharmacogenetics and pharmacogenomics. At protein structure level, in-vitro experimental data have shown that the deleterious nsSNPs might contribute to changes of the structure, stability and function of proteins. At amino acid sequence level, the deleterious nsSNPs might be expected to produce a least conservative replacement. In this work, parameters of position-specific features and scale of amino acid change were calculated and is used to predict deleterious nsSNPs in the tumor suppressor genes using ANN, combining physicochemical properties of amino acids. Available genes and nsSNPs are included in the database search.