Modelling and Characterization of Glial Fibrillary Acidic Protein

Glial Fibrillary Acidic Protein (GFAP) is an intermediate-filament (IF) protein that maintains the astrocytes of the Central Nervous System in Human. This is differentially expressed during serological studies in inflamed condition such as Rheumatoid Arthritis (RA). Therefore, it is of interest to glean molecular insight using a model of GFAP (49.88 kDa) due to its crystallographic nonavailability. The present study has been taken into consideration to construct computational protein model using Modeller 9.11. The structural relevance of the protein was verified using Gromacs 4.5 followed by validation through PROCHECK, Verify 3D, WHAT-IF, ERRAT and PROVE for reliability. The constructed three dimensional (3D) model of GFAP protein had been scrutinized to reveal the associated functions by identifying ligand binding sites and active sites. Molecular level interaction study revealed five possible surface cavities as active sites. The model finds application in further computational analysis towards drug discovery in order to minimize the effect of inflammation.


Background:
Glial Fibrillary acidic protein (GFAP) is an intermediate filament protein having molecular weight 49.88 kDa and is found to be present in the glial cells of the Central Nervous System (CNS) [1]. It was first isolated from human multiple sclerosis plaques in 1971 [2]. This protein is used as a classical marker for astrocytes in the vertebral central nervous system [3] and it plays role in the de-differentiation of the axon and maintaining the strength of the astrocytes [4]. In our previous study we have identified this protein to be significantly up regulated during the chronic inflammation of Rheumatoid Arthritis (RA) [5]. RA is an autoimmune inflammatory disease that affects the joints in systemic manner. In order to understand the etiology of this disease, the understanding of the insights of protein structure may play an important role. But lack of crystallographic or NMR deduced model structure of GFAP restricted the complete understanding of protein role towards RA. To better understand the function of a protein, 3 dimensional (3D) structure of protein is needed. In case of unavailability of experimentally determined protein structure, we made an attempt to build a model of the protein of interest using in-silico methods [6,7]. From an earlier study of the protein we came to an assured point of the rod domain of GFAP being conserved [8]. In the present study, we generated a model based on comparative protein modeling. The model generated has been subjected for its Molecular Dynamic (MD) and protein quality analysis using Gromacs 4.5. This robust check helped us in validating the schematically prepared protein model.

Methodology
The whole experiment had been carried out in Windows7 and Ubuntu 10.1 Operating systems. Comparative modeling has been carried out by Modeller 9.11 package [9] in Windows while molecular dynamic simulation was done in Gromacs 4.5 [10] in the Ubuntu platform. The models were visualized in Pymol, the active sites were predicted using Molegro Virtual Docker [11].

Sequence Retrieval
The sequence of Glial Fibrillary Acidic Protein (GFAP), accession no AAB22581.1 GI: 251802; of Homo sapiens was retrieved in FASTA format from NCBI protein database.

Template Identification
The program BLAST-P [12] has been used to detect similar crystallographic protein structures of GFAP. The parameters set for the search include all default and maintained the search for highly similar sequences/structure. The BLOSUM 62 [13] matrix was used for scoring the similarity. The better identical sequences having low gap percentage and high identity as well as higher number of positives, were chosen for templates. The template structures were then downloaded from the RCSB-PDB. The identification of the templates was based on the results of the query coverage, since there were no homologous proteins available in the PDB database. So the templates were selected on the basis of following criteria: 1) The proteins that share common ancestor with GFAP; 2) Short query coverage must have high identities; 3) Low identities must have high positives and large query coverage. The sequences of the chosen templates including the query were then subjected for multiple sequence alignment and 2D secondary structure alignment for analysis of secondary structural variation among the proteins.

Modeling of GFAP and quality analysis studies
Modeling was performed using Modeller 9.11. This program models protein tertiary structure by satisfaction of spatial restraint using standard parameters sets. The generated threedimensional model includes all non-hydrogen main-chain and side-chain atoms. Generated model has been refined using energy minimization techniques to optimize stereochemistry and to remove bumps and steric clashes among non-bonded interactions using the commands of Modeller 9.11. Scripts generated five default models of the protein with different DOPE (Discrete Optimized Protein Energy) score. The model with lowest DOPE score was selected due to less steric clashes. This model is further refined by energy minimization by steepest descent method by applying AMBER03 Force Field [14] and solvating the protein in water. The force of 1000 kj/mol was applied for the purpose. The process of energy minimization was carried out using GROMACS 4.5 [15] for 5 nano second for observing the stability, computing the velocity temperature coupling and relaxing the system by pressure temperature coupling. The protein behavior was measured by evaluating Root mean square deviation (RMSD), Root mean square fluctuation (RMSF), and Radius of gyration (Rg). RMSD is the measure of query structure deviation and fluctuation from the crystal structure respectively [16]. The radius of gyration of a protein is a measure of its compactness. If a protein is stably folded, it is likely to maintain a relatively steady value of Rg. If a protein unfolds, its Rg will change over time. Final structure was validated by Ramachandran plot using Rampage. It is an online tool which validates the dihedral angles present among the amino acids based on the Ramachandran plot [17].

Figure 2: a)
Plot representing the NVT ensemble (Temperature (k) vs Time (ps)) Image is showing that system attain 310K temperature during the initial run and remain stable during equilibration process; b) Plots depicts the NPT graph (pressure (bar) vs time (ps)) Figure is indicating that the pressure is fluctuating widely during the equilibration phase; c) Plot (rmsd (nm) vs time (ns)). The region at 0.8 nm indicating that system tends to be equilibrated in dynamic behaviour and can be used to analyze the molecular features; d) Plot showing the radius of Gyration (-Rg (nm) vs Time (ns)). Plot is representing the stability of protein over the course of given time   Figure is showing marked differences after overlapping the structure before and after the energy minimization. Structure was more stable after the energy minimization.

Results & discussion:
In the present communication we have identified the 3D structure of GFAP that can be used to reveal its role in pathogenesis of RA in near future. Due to the absence of crystallographic structure of this protein, it is necessary to predict the 3D structure in order to understand the function of a particular protein. In case of GFAP, there are some already predicted structures in various servers and databases but due to lack of complete sequence coverage we have to predict the complete structure of this protein. Similarity search for the template identification have been shown in Table 1 (see supplementary material).
The search resulted that the GFAP protein sequence is similar to vimentin, lamin, keratin 5 and keratin 14 of human and also similar to the Staphylococcus aueris, Francisella tularensis and Bacillus subtilis. Phylogenetic tree was constructed using clustalW for global alignment followed by the proml of phylip which reveals that the vimentin and GFAP were probably sharing common ancestors. This envisages that these two proteins probably arose from genes that are paralogous. Although 3S4R, 3SSU, 3G1E, and 1GK7 share the common ancestors with GFAP, but the length aligned was very poor i.e. 90, 90, 36 and 37 amino acids respectively out of 432 amino acids. But it covers query length with maximum identity percentage of 59%, 60%, 70% and 71% respectively. Therefore, the query coverage of 3S4R, 3SSU, 3G1E, and 1GK7 when aligned to GFAP sequence is 21%, 21%, 8% and 8% respectively which is not sufficient for the purpose of modeling. Hence, 17 suitable templates for the present study have been selected based on the previously discussed criteria (Figure 1).

Model generation
A multiple sequence alignment (MSA) generated using a python script 'salign.py'. The MSA was converted into a sequence profile that lists the likelihood of the 20 standard amino acid residue types at every position in a given MSA. Alignments based on sequence profiles rather than single sequences have been shown to be significantly more accurate [18]. The 'salign' command of salign.py generated a matrix of pairwise alignment, as shown in the Table 2 (see supplementary material), where the raw quality score of the multiple alignment was found to be 17.9. Quality Score (QS) is the average number of structurally equivalent residue pairs. It had RMS cut-off of 1.000. Two residues are considered to be equivalent when they have closer than RMS CUTOFF. There were 136 number of unique protein pairs. The multiple alignment file has been written in protein information resource (PIR) format.
Modeller generated 5 models on the basis of DOPE score that has been considered as the best scoring function [19]. The structure showing highest stability with least score (-27480.207031) was selected for further analysis. The model generated may have steric hindrance or might possess sidechain bumps that affect the backbone folding. Therefore, stereochemistry was satisfied with energy minimization allowing repositioning of the amino acids in the 3D space solvating the model in water. The solvation was virtually executed in GROMACS by generating a cubic box that places the model at the center of the box (c) and at least 1.0 nm from the box edge (-d 1.0). Solute box specified with distance 1.0 nm means that there are at least 2.0 nm between periodic images of a protein. The whole system was found to possess a net charge of -13e. Since life does not exist at a net charge, we had to add ions to our system. Thus 13 positive Na + ions were added to neutralize the system.
The energy minimization was carried out using steepest descent method of Gromacs 4.5 package, applying Amber 03 Force Field [15] and the force applied was 1000kj/mol. This process is converged at 2593 steps and the energy was minimized to -3.2263365e+06. The DOPE score of this new model was found to be -32518.039063 indicating the more stable model compared to earlier model. The model was superimposed to determine the structural differences using standard method of computing which apparently resulted to be 0.817 that infer minor changes in positioning of the atoms because of the energy minimization. An attempt was also made to study the stability of the model in the biological environment. The whole process was studied in three steps:

NVT (constant Number of particles, Volume, and Temperature)
The system was equilibrated in NVT ensemble (constant Number of particles, Volume, and Temperature) where temperature 310 K for 100-ps. The temperature coupling algorithm used for NVT simulation was Berendsen-thermostat. Temperature progression analysis can be depicted in graph. From the plot (Figure 2a), it is clear that the temperature of the system quickly reaches the target value (310 K), and remains stable during equilibration process.

NPT (constant Number of particles, Volume, and Pressure)
The resulted final model after NVT simulation was then subjected for NPT simulation. Equilibration of pressure is conducted under an NPT ensemble, wherein the Number of particles, Pressure, and Temperature are all constant. The pressure applied for the study was 1-bar. The pressure progression computed by Gromacs was stated as a graph plot (Figure 2b). When this plot was analyzed we found that the pressure value fluctuated widely over the course of 100-ps equilibration phase. Thus, by measuring the average of the running data we kept the pressure value as 1.05.

Production Molecular Dynamics
It is used as a post-processing tool to strip out coordinates, correct for periodicity, or manually alter the trajectory (time units, frame frequency, etc) for both the least squares fit and for RMSD calculation. The output plot shows that the RMSD relative to the structure present in the minimized form and system has been equilibrated. Further, we calculated RMSD relative to the crystal structure. The report generated in both the plots (Figure 2c) revealed that RMSD tends to levels off to ~0.8 nm (1 Å). This indicates that the structure tends to stabilize itself in due course of time. Subtle differences between the plots indicate that the structure at t = 0 nano second (ns) is slightly different from this crystal structure. This is to be expected, since it has been energy-minimized, and because the position restraints are not completely perfect. The analysis of the Rg for GFAP in our simulation: we can see from the reasonably invariant R g values that the protein remains very stable, in its compact (folded) form over the course of 5 ns at 310 K in 1bar pressure (Figure 2d). This result was expected, that proves the stability of the model.
The generated models consists of 17 (seventeen) α-Helices (H), 9(nine) β-Sheets (B), 23 Loops (L) and 2(two) very short Turns (T). The N-terminal of the protein consists of a loop of 19 amino acids and the C-terminal is a long tail of alternate helices and loops (Figure 3). The overall shape of the protein seems to be '∞' shape. Here the helix number 7 (seven) is supposed to be stabilizing the two parts of the protein. The movement of the protein is supposed to be controlled by seventh, eighth, ninth and tenth helices. The comparative visualization of the amino acids act on each and every position of superimposed structures was energy minimized (Figure 4) and subtle variations are recorded during the production MD for 5ns. These variation show that residues in alpha helices or beta sheets have low stability to remain in the earlier stretch form. However, Rampage gave a favorable result upon submission of both the protein structures. Rampage results of secondary protein models before and after molecular dynamics. The results obtained from the initial model were 386 residues in favoured region; 20 residues in allowed region and 24 residues in outlier regions. In contrast, the final model consist 340 residues in favored area; 66 residues in allowed region and 23 residues at outlier region ( Figure 5). It signifies that after the first round molecular dynamic study of the model for 5ns it tend to form a stable state with less steric hindrance compared to the previous model. Thus, upon an increment in the duration, the model might be able to show a permanent stable confirmation. The probable ligand binding pockets were predicted using Molegro Virtual Docker (MVD). There are 5 pockets on the surface of the protein (Figure 6). Residues involved in the surface pockets have been listed in Table 3  28.1 Sub_ID = Subject ID, %idty = Percentage of Identity, % +ve = percentage of Positives, aln_len = alignment length, q. start= position of the starting residue of the query, q. end= position of the ending residue of the query