Pre-docking filter for protein and ligand 3D structures.

Virtual drug screening using protein-ligand docking techniques is a time-consuming process, which requires high computational power for binding affinity calculation. There are millions of chemical compounds available for docking. Eliminating compounds that are unlikely to exhibit high binding affinity from the screening set should speed-up the virtual drug screening procedure. We performed docking of 6353 ligands against twenty-one protein X-ray crystal structures. The docked ligands were ranked according to their calculated binding affinities, from which the top five hundred and the bottom five hundred were selected. We found that the volume and number of rotatable bonds of the top five hundred docked ligands are similar to those found in the crystal structures and corresponded with the volume of the binding sites. In contrast, the bottom five hundred set contains ligands that are either too large to enter the binding site, or too small to bind with high specificity and affinity to the binding site. A pre-docking filter that takes into account shapes and volumes of the binding sites as well as ligand volumes and flexibilities can filter out low binding affinity ligands from the screening sets. Thus, the virtual drug screening procedure speed is increased.


Background:
Virtual screening techniques are becoming increasingly more important in drug discovery.A popular method for virtual screening is molecular docking [1, 2], which selects small-molecule structures from databases such as ChemBank [3], ChemPDB [4], KEGG [5], and NCI [6] and docks them into the protein binding site [7].These processes involve the prediction of binding energies and analysis of molecular binding modes, which are time consuming and computationally expensive.The twodimensional (2D) fingerprint technique, a virtual screening method which measures the structural similarity of molecules has been developed to address the above problems [8].The similarity search is based upon the "similar property principle", which states that molecules that are structurally similar are likely to have similar properties [9].This technique uses a ligand with known chemical properties, inhibitory activities, or binding modes for a target of interest as a reference for searching similar ligands in the database regardless of the shape and size of the protein binding site.The accuracy of this method depends on which similarity coefficient is used [10-12], and the Tanimoto coefficient is most popularly employed [13].Based on the "lock-and-key" principle, we propose a novel pre-docking procedure that matches the sizes of the ligand with the protein binding site, and optimizes the gridbox size before docking.This simple procedure dramatically reduces the size of screening ligand sets, significantly reducing time and effort required for virtual drug screening.

Preparation of ligand structures
The ligands used in this study were downloaded from Ligand.Info Meta-Database [14].Ligands set-1 consisted of 2344 structures from ChemBank and set-2 consisted of 4009 structures from ChemPDB.The downloaded ligands in the SDF format were first converted to the Protein Data Bank (PDB) format using Open Babel [15].The Gasteiger charges and rotatable bonds were then assigned to the PDB ligands using AutoDockTool [16].All rotatable bonds were allowed to move freely.

Preparation of protein structures
Twenty-one protein X-ray crystal structures from the Protein Data Bank [17] were downloaded.The proteins and their PDB structure identifiers (PDB ID) are given in Table 1 (supplementary material).Of the twenty-one protein structures, sixteen have co-crystallized ligands (Xray ligand) in the binding site.The ligand contained in each protein structure was removed from the binding site and saved to a new file.The missing atoms in each protein structure were searched for and fixed using SwissPDB [18].The Gasteiger charges and the solvation term were then added to the protein structure using the AutoDockTool.Ligand SMILES string similarity search The ligands extracted from the X-ray crystal structure obtained from the previous step were converted to the SMILES string format, and used as an input for similarity search against the ligands in the ChemBank and the ChemPDB sets in step of the Preparation of ligand structures using Tanimoto coefficient cutoffs of 0.5, 0.6, and 0.7, respectively.

Calculation of ligand molecular volume
The volumes of the ligands in the screening set were calculated using Mol_Volume version 1.0 [19].The van der Waals radii value for each atom type was derived from the CHARMM 22 force field.The radius of the spherical probe (R_PROBE) was set to 2.0 Å, and the GRID_STEP was set to 0.5 Å.The volumes of the ligands extracted from the X-ray crystal structures were calculated using the same protocol.The extracted ligand name (X-ray ligand) and its calculated volumes are shown in Table 2 (supplementary material).

Calculation of protein binding site molecular volume
Protein binding site volumes were calculated using the CASTp server (http://sts-fw.bioengr.uic.edu/castp)[20].The solvent probe radius used for volume calculation was 1.4 Å. CASTp identifies all surface pockets with the chosen volume values, and then displays them on the computer screen.Pockets calculated by CASTp that matched the pocket resolved by crystallography were selected, and the volume of that calculated pocket was taken as the volume of the protein binding site.

Protein-ligand docking Grid-box generation
The grid parameter file of each protein was generated using AutoDockTool.A grid-box was generated that was large enough to cover the entire protein binding site and accommodate all ligands to move freely.The number of grid points in x, y, and z-axes were 60×60×60.The distance between two connecting grid points was 0.375 Å.The center of the ligand in the X-ray crystal structure was used as the center of the grid-box.For protein structures that do not have ligands in the binding site, the center of the binding site was estimated from the structure and taken as the center of the grid-box.

Ligand docking
AutoDock4 and a Lamarckian Genetic Algorithm (LGA) [21] were used for protein-fixed ligand-flexible docking calculations.Ten search attempts (ga_run parameter) were performed for each ligand.The maximum number of energy evaluations before the termination of LGA run was 2500000 and the maximum number of generations of the LGA run before termination was 27000.Other docking parameters were set to the software's default values.After docking, the ligands were ranked according to their protein-ligand affinity (calculated inhibitory constant, Ki).

Discussion: Size and shape of protein binding site
In this study, we categorized protein binding sites according to their sizes and shapes.Protein binding sites were classified as small (less than 1200 Å 3 ) or large (greater than or equals to 1200 Å 3 ).Protein binding site shapes were classified as either simple or complex.The protein binding site classifications are shown in Table 1 (supplementary material).Sixteen protein structures had co-crystallized ligands bound in the binding site.The calculated volumes show that the majority of ligands are larger (305-5922 Å 3 ) than the binding sites (1040-2690 Å 3 ) in particular those in the small binding site group.However, the average volumes of the ligands (1684 Å 3 ) and the binding sites (1638 Å 3 ) are very similar.The protein binding site typically accommodates 50-70% of the ligand, with the remainder of the ligand occupying pockets adjacent to the binding site.For example, 50% of the GDP ligand (volume = 1460 Å 3 ) was contained within the small binding pocket (volume = 594 Å 3 ) of the "Filamenting temperature-sensitive mutant Z" protein (PDB ID: 1RQ7), while the rest of GDP ligand occupied pockets close to or floating over the binding site.This suggests that the optimal ligand size can potentially exceed the binding site volume.

X-ray ligand docking and ranking
To verify the docking procedure utilized in this work, we re-docked the original X-ray ligand back to its corresponding protein binding site.The X-ray ligands along with all other ligands in the screening set were ranked according to the calculated Ki.The X-ray ligands were ranked in the top ten percentiles and were also able to move back to the original positions with the root mean square deviations of less than 3 Å.

Docked ligand size and flexibility
The top 500 and the bottom 500 ligands ranked according to the Ki value for each protein were selected for further analysis.A scatter plot of the molecular volumes and the number of active torsion bonds for these ligands is shown in Figure 1.The top 500 ligands are clearly coincident with the sixteen X-ray ligands, which occupy volumes of 800-2800 Å 3 , whereas the bottom 500 ligands occupy volumes outside this range, with 95% much smaller (300-900 Å 3 ).There does not appear to be any correlation between the number of active torsion bonds and calculated Ki; however, the majority of the top 500 ligands have twenty or fewer active torsion bonds.These data suggest that ligands with high binding affinity are constrained by their size (volume 800-2800 Å 3 ) and flexibility (20 or fewer active torsion bonds).For untested ligands, these parameters could be useful to prioritize docking calculations, so that priority is given to ligands of optimal size and flexibility.

Optimal size of the grid-box
In this study, a very large grid-box (10830 Å 3 , 22.125 Å on each side) was used because we wanted to ensure that the grid-box could cover the entire binding site, and that all ligands in the screening sets had enough space to enter and move freely in the grid-box.The volume of the grid-box was 10830 Å 3 while the volume of the largest protein binding site was only 5921.8 Å 3 (PDB ID: 1N8W).We hypothesized that using a very large grid-box would allow the binding of some ligands to extend beyond the actual binding site, with non-specific binding into adjacent pockets.We tested this hypothesis by generating minimal Figure 1: The volume and the number of active torsion bonds of the top 500 (red) and the bottom 500 docked ligands (blue) ranked according to the calculated inhibitory constant (Ki).Of the twenty-one protein X-ray crystal structures used in this study, sixteen structures had ligand bound in the binding site.The top 500 ligands generally had structural profiles in terms of volume and number of active torsion bond similar to those of the X-ray ligands (green) while the bottom 500 ligands were, on average, 800-900 Å 3 smaller than the X-ray ligands.
Filtering of the top 500 docked ligands was performed to test how ligands occupy space beyond the protein binding site.Six thresholds of decreasing stringency, allowing progressively more of the ligand atoms to be outside the minimal grid-box were used (Table 4 in supplementary material).It is clear that on average, the top 500 ligands cannot fit entirely within the minimal grid-boxes extending outside them, since 10.8% of the ligands were rejected, even when a very relaxed 30% threshold was employed (Table 4 in supplementary material).Visual inspection of the docked structures revealed that the protein binding sites contain at least one opening space, and parts of the docked ligands were always outside of the minimal box on this side.The rejected ligands might be either too large or too long to fit entirely within the minimal box, or their chemical properties may not match perfectly well with the binding pocket so that parts of their structures bind preferentially with adjacent pockets.
Reducing the grid-box size would significantly reduce CPU time for docking calculation, an important consideration for drug-discovery when potentially millions of compounds are screened.However, it is clear from our data that this would also increase the false negative rate, leading to some high binding-affinity ligands to be missed.These false-negatives would likely include molecules with long linear shapes, or with branches which extend beyond the target binding site and bind to adjacent pockets, in particular on the opening space side (see above).We propose that the optimal grid-box size allows approximately two-thirds of ligand molecule to occupy the target binding site, with the remaining one third able to bind with adjacent pockets.Grid-boxes of this size provide the optimal balance between the number of screening ligands and the CPU time required for docking.

SMILES strings similarity of the docked ligands and the X-ray ligands
We further explored whether the top 500 and the bottom 500 ligands docked on each protein were chemically similar to the X-ray ligand extracted from the proteinligand co-crystal structures using the SMILES strings similarity search.The results show that, in general, more of the top 500 ligands matched with the X-ray ligands than of the bottom 500 ligands (Table 5 in supplementary material).On the other hand, even at a Tanimoto coefficient of 0.5, only thirteen ligands in the top500 list matched with the X-ray ligand.This indicates that most of the potential hits were chemically dissimilar to the X-ray ligands, yet similar in size as discussed above.Conversely, at a Tanimoto coefficient of 0.5, six out of the bottom 500 ligands matched with the X-ray ligand, suggesting that these ligands although similar to the X-ray ligand have chemical properties that are unfavorable to interactions with the binding site.1: Shape and size of the protein binding sites is shown.The binding sites were divided into groups according to their shapes (simple and complex) and sizes (small; volume <1200 Å 3 and large; volume ≥1200 Å 3 ).Table 5: SMILES string similarity search of X-ray ligands in sixteen protein-ligand co-crystal X-ray structures against the top 500 and the bottom 500 docked ligands using Tanimoto coefficients (TC) of 0.5, 0.6, and 0.7, respectively.

Table 2 :
Details of protein and ligand X-ray crystal structures used in this study.

Table 3 :
The coordinates of the grid-box center and the number of grid points on the x, y, and z axes in grid-boxes of minimum size is given.The distance between grid points was 0.375 Å.

Table 4 :
Percent of ligands rejected from the top500 ligand list docked with twenty-one protein binding sites and minimal grid-boxes (see Table3).Six thresholds for rejection were tested, in which the percentage of ligand atoms outside of the grid-box was varied.