Screening and structural evaluation of deleterious Non-Synonymous SNPs of ePHA2 gene involved in susceptibility to cataract formation

Age-related cataract is clinically and genetically heterogeneous disorder affecting the ocular lens, and the leading cause of vision loss and blindness worldwide. Here we screened nonsynonymous single nucleotide polymorphisms (nsSNPs) of a novel gene, EPHA2 responsible for age related cataracts. The SNPs were retrieved from dbSNP. Using I-Mutant, protein stability change was calculated. The potentially functional nsSNPs and their effect on protein was predicted by PolyPhen and SIFT respectively. FASTSNP was used for functional analysis and estimation of risk score. The functional impact on the EPHA2 protein was evaluated by using SWISSPDB viewer and NOMAD-Ref server. Our analysis revealed 16 SNPs as nonsynonymous out of which 6 nsSNPs, namely rs11543934, rs2291806, rs1058371, rs1058370, rs79100278 and rs113882203 were found to be least stable by I-Mutant 2.0 with DDG value of > −1.0. nsSNPs, namely rs35903225, rs2291806, rs1058372, rs1058370, rs79100278 and rs113882203 showed a highly deleterious tolerance index score of 0.00 by SIFT server. Four nsSNPs namely rs11543934, rs2291806, rs1058370 and rs113882203 were found to be probably damaging with PSIC score of ≥ 2. 0 by Polyp hen server. Three nsSNPs namely, rs11543934, rs2291806 and rs1058370 were found to be highly polymorphic with a risk score of 3-4 with a possible effect of Non-conservative change and splicing regulation by FASTSNP. The total energy and RMSD value was higher for the mutant-type structure compared to the native type structure. We concluded that the nsSNP namely rs2291806 as the potential functional polymorphic that is likely to have functional impact on the EPHA2 gene.

associated with autosomal dominant cataracts and recently it was implicated in age-related cortical cataracts in humans and mice [8][9][10]. EPHA2 interacts with membrane-bound ephrin ligands, which play an important role in morphogenesis and in numerous developmental processes [11]. Recently, several EPHA2 single nucleotide polymorphisms were found to be associated with age-related cortical cataracts in different worldwide Caucasian populations [8,9,12]. It is estimated that around 90% of human genetic variations are differences in single bases of DNA, known as single nucleotide polymorphisms (SNPs) [13,14]. Among them, non-synonymous single nucleotide polymorphisms (nsSNPs), that cause amino acid changes in proteins have the potential to affect both protein structure and protein function [15,18]. These nsSNPs can result in inherited diseases and drug sensitivity [19,21]. These nsSNPs affect gene expression by modifying DNA and transcription factor binding [22,23] and inactivate active sites of enzymes or change splice sites, thereby produce defective gene products [24,25]. Some of the mutations in nsSNP sites are not associated with any changes in phenotype and are considered functional neutral, but others bringing deleterious effects to protein function and are responsible for many human genetic diseases [26,27]. There are several databases with these variations of SNPs, such as the human genome variation database, HGVBase [28], National Center for Biotechnology Information (NCBI) database, dbSNP [29], and SWISSPROT [30]. The large size of these databases presents a challenging hurdle for annotating the effects of all nsSNPs by experimental approaches. Therefore, computational methods that can quickly distinguish diseasingcausing nsSNPs from neutral nsSNPs are in urgent need.
Epidemiologic association studies focus a great amount of effort into identifying SNPs in genes that may have an association with disease risk. Often, the SNPs that have an association with disease are those that are known as nonsynonymous SNPs, which result in an amino acid substitution. Many molecular epidemiologic studies focus on studying SNPs found in coding regions in hopes of finding significant association between SNPs and disease susceptibility, but often find little or no association [31]. With the availability of high-throughput SNP detection techniques, the population of nsSNPs is increasing rapidly, providing a platform for studying the relationship between genotypes and phenotypes of human diseases. Our ability to better select a nsSNP for an association study can be enhanced by first examining the potential impact an amino acid variant may have on the function of the encoded protein with the use of different SNP detection programs like, I-Mutant, Sort Intolerant from Tolerant (SIFT) and Polymorphism Phenotype (PolyPhen) [31]. Discovering the deleterious nsSNPs out of a pool of all the SNPs will be very useful for epidemiological population based studies. So the main aim of this study is to identify deleterious nsSNPs associated with EPHA2 gene. We undertook this work mainly to perform computational analysis of the nsSNPs in EPHA2 gene and to identify the possible mutations and proposed modeled structures for the mutant proteins and compared them with the native protein. We report that, the mutation namely, E825K in the native protein of EPHA2 gene could be a candidate of major concern for cataract formation.

Screening of SNPs
We used dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) to identify SNPs and their related protein sequence for the EPHA2 gene [29].
Protein stability prediction I-Mutant 2.0 software was used to predict nsSNP causing protein stability change [34]. I-Mutant 2.0 is a support vector machine (SVM)-based tool for the automatic prediction of protein stability change upon single amino acid substitution. The protein stability change was predicted from the EPHA2 protein sequence (NP_004422). The software computed the predicted free energy change value or sign (DDG) which is calculated from the unfolding Gibbs free energy value of the mutated protein minus unfolding Gibbs free energy value of the native protein (kcal/mol). A positive DDG value indicates that the mutated protein possess high stability and vice versa.

Analysis of functional effect on protein
There are many web-based resources available that allow one to predict whether nonsynonymous coding SNPs may have functional effects on proteins. We chose SIFT [35] to perform protein conservation analysis and predict the phenotypic effect of amino acid substitutions. SIFT is based on the premise that protein evolution is correlated with protein function. Variants that occur at conserved alignment positions are expected to be tolerated less than those that occur at diverse positions. The algorithm uses a modified version of PSIBLAST [36] and Dirichlet mixture regularization [37] to construct a multiple sequence alignment of proteins that can be globally aligned to the query sequence and belong to the same clade. The underlying principle of this program is that it generates alignments with a large number of homologous sequences and assigns scores to each residue, ranging from zero to one. SIFT scores [38] were classified as intolerant (0.00-0.05), potentially intolerant (0.051-0.10), borderline (0.101-0.20), or tolerant (0.201-1.00). Higher the tolerance index of a particular amino acid substitution, lesser is its likely impact.

Simulation for functional change in coding nsSNPs
PolyPhen [39] is a computational tool for identification of potentially functional nsSNPs. Predictions are based on a combination of phylogenetic, structural and sequence annotation information characterizing a substitution and its position in the protein. For a given amino acid variation, PolyPhen performs several steps: (a) extraction of sequencebased features of the substitution site from the UniProt database, (b) calculation of profile scores for two amino acid variants, (c) calculation of structural parameters and contacts of a substituted residue. PolyPhen scores were classified as 'benign' or 'probably damaging' [38]. Input options for the PolyPhen server are protein sequence or accession number together with sequence position with two amino acid variants. We submitted the query in the form of protein sequence with mutational position and two amino acid variants. PolyPhen searches for three-dimensional protein structures, multiple alignments of homologous sequences and amino acid contact information in several protein structure databases. Then, it calculates position-specific independent counts (PSIC) scores for each of two variants, and computes the difference of the PSIC scores of the two variants. The higher a PSIC score difference, the higher functional impact a particular amino acid substitution is likely to have. PolyPhen scores of >2.0 are expected to be "Probably damaging" to protein structure and function and PolyPhen scores of 1.99-1.50 are expected to be "Possibly damaging" to protein function [33].

Analysis of functional nsSNPs by FASTSNP
The Functional Analysis and Selection Tool for Single Nucleotide Polymorphism (FASTSNP) is a web server which connects many softwares and databases for processing analysis [40]. We used FASTSNP for the prediction of the functional effect of nsSNPs and an estimation of their risk score. The FASTSNP uses a decision tree for prioritizing the functional effect and estimating risk score. The nsSNPs were submitted for FASTSNP analysis and output files were displayed as a decision tree.

Total energy and RMSD calculations
Structure analysis was performed for evaluating the structural stability of native and mutant protein. We used the web resource SAAPdb [41] to identify the protein coded by EPHA2 gene (PDB id 1MQB). The mutation position and residue were in complete agreement with the results obtained with SIFT and PolyPhen programs. The mutation was performed by using SWISSPDB viewer and energy minimization for 3D structures was performed by NOMAD-Ref server [42]. This server use Gromacs as default forcefield for energy minimization based on the methods of steepest descent, conjugate gradient and L-BFGS methods [43]. We used conjugate gradient method for optimizing the 3D structures. Deviation between the two structures is evaluated by their RMSD values. Divergence in mutant structure with native structure is due to mutation, deletions, and insertions [44] and the deviation between the two structures is evaluated by their RMSD values which could affect stability and functional activity [45].

SNP dataset
The EPHA2 gene investigated in this work contained a total of 181 SNPs, of which 16 were nsSNPs, 9 were synonymous SNPs, and none were in noncoding regions. The rest were in the intron region. We selected nonsynonymous coding SNPs for our investigation Table 1 (see supplementary material).

Stability of nsSNPs
More negative is the DDG value, less stable the given point mutation is likely to be, as predicted by I-Mutant 2.0 server. We obtained 8 nsSNPs that were found to be less stable by this server as shown in Table 1. Out of 8 nsSNPs, 6 nsSNPs, namely rs11543934, rs2291806, rs1058371, rs1058370, rs79100278 and rs113882203 showed a DDG value of > −1.0. The remaining nsSNPs showed a DDG value of < −1.0, as depicted in Table 1. Since these six nsSNPs showed higher DDG value, we considered these nsSNPs to be least stable and deleterious.

Prediction of deleterious nsSNPs by SIFT
Sixteen nsSNPs retrieved from EPHA2 gene, submitted independently to the SIFT showed 8 nsSNPs to be deleterious, having the tolerance index score of ≤ 0.05. The results are shown in Table 1 (see supplementary material). We observed that, out of 8 deleterious nsSNPs, 6 nsSNPs showed a highly deleterious tolerance index score of 0.00 Table 1 (see supplementary material). We found that among these 6 nsSNPs that are seen to be deleterious according to SIFT, 4 nsSNPs namely rs2291806, rs1058370, rs79100278 and rs113882203 were also found less stable by I-Mutant 2.0 server.

Damaged nsSNPs by PoluPhen
To identify the EPHA2 nsSNPs that affected protein structure, the EPHA2 nsSNPs were analyzed for predicting a possible impact of amino acids on the structure and function of the protein using PolyPhen server. The EPHA2 protein sequence (NP_004422) with each nsSNP position and their 2 amino acid variants was submitted as input for analyzing the protein structural change due to amino acids. Our result showed, 4 nsSNPs namely rs11543934, rs2291806, rs1058370 and rs113882203 to be probably damaging with PSIC score of ≥ 2. 0. The nsSNPs namely, rs2291806, rs1058370 and rs113882203 which were observed to be the cause of protein less stability by I-Mutant 2.0 server and SIFT were also predicted to be probably damaging by PolyPhen server. In addition two nsSNPs were highly predicted as possibly damaging and the remaining as benign Table 1

Functional effect and estimated risk of nsSNPs
In order to identify nsSNP with a high possibility of having a functional effect, FASTSNP was applied for the detection of nsSNP influence on cellular and molecular biological function e.g. transcriptional and splicing regulation. In addition the estimation of risk score was also calculated by FASTSNP. The functional effect and estimated risk of EPHA2 nsSNPs are shown in Table 2 (see supplementary material). Three nsSNPs exhibited medium-high risk score (risk score = 3-4). The functional nsSNPs were rs11543934, rs2291806 and rs1058370. The eight nsSNPs showed low-medium risk (risk score=2-3). The risk score of the remaining four nsSNPs was not predicted. The most important finding detected by FASTSNP were the two nsSNPs namely rs2291806 and rs1058370 to have high possible functional effect. These were also found polymorphic by I-Mutant 2.0; SIFT as well as by PolyPhen server.

Structural modeling of mutant protein
Mapping the deleterious nsSNPs into protein structure information was obtained from Single Amino Acid Polymorphism database (SAAPdb) [19]. The available structure for EPHA2 gene is reported to be having a PDB id 1MQB. According to this resource, the mutations mainly occurred for 1MQB at one SNP id, namely rs2291806. The mutation was at the residue position E825K. The mutation for 1MQB at the position 825 was performed by SWISS-PDB viewer to get modeled structures. Then, energy minimizations was performed by NOMAD-Ref server [32] for the native type protein (PDB 1MQB) and the mutant type protein (1MQB) E825K. It can be seen from Table 3 that total energy for the native type structure (PDB 1MQB) and the mutant type structure 1MQB E825K are found to be −7476.263 and −13238.058 kcal/mol, respectively. Table 3 (see supplementary  material), also shows that the RMSD values between the native type (1MQB) and the mutant type (1MQB) E825K is 2.36 A˚. Since the RMSD values and total energy after energy minimization are very high for the mutant type structure as compared to the native type structure (1MQB), we may presume that, this mutation causes a significant change in the mutant structure of the protein with respect to the native protein structure. This mutation was also predicted to be functionally significant based on I-Mutant, SIFT, PolyPhen and FASTSNP results.

Discussion:
Our analysis revealed 16 SNPs as nonsynonymous out of which 6 nsSNPs, namely rs11543934, rs2291806, rs1058371, rs1058370, rs79100278 and rs113882203 were found to be least stable by I-Mutant 2.0 with DDG value of > −1.0. nsSNPs, namely rs35903225, rs2291806, rs1058372, rs1058370, rs79100278 and rs113882203 showed a highly deleterious tolerance index score of 0.00 by SIFT server. Four nsSNPs namely rs11543934, rs2291806, rs1058370 and rs113882203 were found to be probably damaging with PSIC score of ≥ 2. 0 by PolyPhen server. Three nsSNPs namely, rs11543934, rs2291806 and rs1058370 were found to be highly polymorphic with a risk score of 3-4 with a possible effect of Non-conservative change and splicing regulation by FASTSNP. The total energy and RMSD value was higher for the mutant-type structure compared to the native type structure. A major interest in human genetics is to distinguish mutations that are functionally neutral from those that contribute to disease. Amino acid substitutions currently account for approximately half of the known gene lesions responsible for human inherited disease. Therefore, the identification of nsSNPs that affect protein functions and relate to disease is an important task. The effect of many nsSNPs will probably be neutral as natural selection will have removed mutations on essential positions. Assessment of non-neutral SNPs is mainly based on phylogenetic information (i.e. correlation with residue conservation) extended to a certain degree with structural approaches. However, there is increasing evidence that many human disease genes are the result of exonic or noncoding mutations affecting regulatory regions [33,46]. Much attention has been focused on modeling by different methods the possible phenotypic effect of SNPs that cause amino acid changes, and only recently has interest focused on functional SNPs affecting regulatory regions or the splicing process. Moreover, because of their widespread distribution on the species genome, SNPs become particularly important and valuable as genetic makers in the research for the diseases and corresponding drug [47]. Currently, millions of human SNPs have reported by high-throughput methods. The vast number of SNPs causes a challenge for biologists and bioinformaticians although they provide lot information about the relationships between individuals [47]. Besides numerous ongoing efforts to identify millions of these SNPs, there is now also a focus on studying associations between disease risk and these genetic variations using a molecular epidemiological approach. This plethora of SNPs points out a major difficulty faced by scientists in planning costly population-based genotyping, which is to choose target SNPs that are most likely to affect phenotypic functions and ultimately contribute to disease development [33,47,48].
Currently, most molecular studies are focusing on SNPs located in coding and regulatory regions, yet many of these studies have been unable to detect significant associations between SNPs and disease susceptibility. To develop a coherent approach for prioritizing SNP selection for genotyping in molecular studies, an evolutionary perspective to SNP screening is applied. The hypothesis is that, amino acids conserved across species are more likely to be functionally significant. Therefore, SNPs that change these amino acids might be more likely to be associated with disease susceptibility. It is becoming clear that application of the molecular evolutionary approach may be a powerful tool for prioritizing SNPs to be genotyped in future molecular epidemiological studies [33,48]. Therefore, our analysis will provide useful information in selecting SNPs of EPHA2 gene that is likely to have potential functional impact.

Conclusion:
In our analysis, we found out that nsSNP (rs2291806) as less stable, deleterious, probably damaging and have high risk score. The mutant protein structures of this nsSNP also showed very high energy and RMSD values compared to the native type structure. We therefore concluded that this nsSNP as potentially functional polymorphic. To those conducting largescale population-based epidemiologic studies, the idea of prioritizing nsSNPs in the investigation of association of SNPs with disease risk is of great interest. The use of these servers to select potentially polymorphic nsSNPs for epidemiology studies can be an efficient way to explore the role of genetic variation in disease risk and to curtail cost. Furthermore, predicted impact of these nsSNPs can be tested with the use of animal models or cell lines to determine if functionality of the protein has indeed been altered.