In this project you will explore the link between 3D
structure and function by analyzing the environment around ion binding sites.
For clarification, we define a site as a 3D location in a protein
structure.
CAsites: A list of calcium binding
sites (positive set).
CAnonsites: A list of sites that do not
bind calcium (negative set).
All data files are of the following form (tab delimited):
PDBid Xposition Yposition Zposition
Download the necessary PDB files from SitePDBFiles.zip and NonsitePDBFiles.zip.
There will be two parts to this assignment and you may therefore wish to split your code accordingly.
For the first part, you will construct a simple model of calcium binding sites. A "binding site" is a region of a protein where a particular molecule or atom (the "ligand") can stably associate with that protein. Our model will consist of the probabilities of finding certain amino acid residues at various distances from a binding site. This is a reasonable model because the identity of the amino acids near a putative binding site and the precise distance that they are from that site play a large role in determining the local chemical/physical milieu near the site. This chemical environment, in turn, determines the propensity for the ligand to spend time in that region of the protein.
For each site or nonsite in the provided lists, examine the amino acid makeup in concentric "shells" (the 3D equivalent of a ring) around that site or nonsite. We will use a model with five shells, each 1.5 angstroms thick. That is, for a given location, we will examine five different regions of space: the region between 0 and 1.5 angstroms away from the location (shell 0), the region between 1.5 and 3 angstroms away from the location (shell 1), and so on. (Imagine spherical Russian nesting dolls...) Pedants among you may assume that the inner boundary of each shell is inclusive and the outer boundary is exclusive. It doesn't really matter since nothing will fall exactly on the boundary anyway.
Our model will then consist of a table. For each shell (0 through 4) and each amino acid, record the number of sites and (separately) the number of non-sites that have one or more of that amino acid in that shell. (For those keeping track at home, this is a 3-dimensional table of dimension 5x20x2, containing integer counts.) When you're done, the table will be able to answer questions like "How many sites had alanine residues between 3 and 4.5 angstroms (shell 2) from the center of the site?" or "How many non-sites had glycine residues less than 1.5 angstroms away (shell 0) from the center?"
For this assignment, an amino acid is "in" a particular shell if its alpha carbon atom falls into that region of space. You can ignore the other atoms.
Then for each shell, calculate the fraction of the sites and non-sites that contain each type of residue, and the fraction of sites and non-sites that do not contain each type of residue. This fraction (or "frequency") gives us an estimate of the probability that a given amino acid will appear at a given distance from a site or a non-site. However, when we estimate probabilities from frequencies we usually need to perform smoothing. Very simply, smoothing means adding pseudo-counts ("fudge factors" if you will) to your data to ensure that none of the probabilities are zero. (In general, probabilities of zero are very dangerous in modeling. You can see why this is mathematically in the formula for S below.) Smoothing models can get arbitrarily complex, but in this case we are going to be doing something very simple. When calculating the frequencies, add 0.1 to each count of sites (non-sites) with or without a residue in a certain position, and then instead of dividing by the total number of sites/non-sites, divide by the total + 0.2. (We add 0.2 to the denominator to account for the fact that we've increased the total count by 0.2 by adding pseudocounts of 0.1 to to both the count of locations with that characteristic and the cound of those without.)
For example, if residue LYS appears in the second shell of only two non-binding sites, the estimated probability (or "smoothed frequency") of LYS in shell 2 for non-binding sites will be (2 + 0.1) / (number-of-non-binding-sites + 0.2). The estimated probability if LYS NOT being in shell 2 for non-binding sites is then ( (number-of-non-binding-sites - 2) + 0.1 ) / (number-of-non-binding-sites + 0.2). Note that these two probabilites are disjoint and mutually exhaustive, and thus must sum to 1. The enterprising among you will either use this fact as a check on their math, or as a way to simplify their code.
In the second part, you will make use of the frequencies just collected to identify potential CA-binding sites. We treat the presence of a particular residue type in a particular shell as a feature that may help in identifying a site. We combine all the features into a scoring function using a Naïve Bayes approach. Specifically, given a location, we iterate over all the features and compute subscores using:
![]()
The score is a sum over all the features of which there should be 5*20 =100:

Locations with higher scores stand a higher chance of being Ca-binding sites.
To understand a bit of the statistical reasoning that underlies this, see our brief writeup.
Use this scoring method to score all the sites and non-sites given above (the ones used in the training). Also, write code to scan a PDB file along a 3D grid and output the 100 top scoring positions along the grid. The grid points should be spaced at 2 Angstrom intervals and should span the whole protein. The easiest way to do this is to scan from (min(x)-6, min(y)-6, min(z)-6) up to (max(x)+6, max(y)+6, max(z)+6), where the max and the min is taken over the coordinates of all the residues in the PDB file. Use the code to scan the following PDB file: 1GGZ.pdb, and report the most likely locations of potential CA-binding sites in the protein. Note that our method is likely to produce several high scoring locations around the true site.
1. Create 5 tab-delimited text files that give the fraction of the sites and non-sites containing each type of residue for each shell. Each row will represent one residue, and the columns will be sites and non-site. Your files should look exactly like this one. (with different numbers, of course), and should be named "shell1.txt", "shell2.txt", etc. Use ascii text and tabs to delimit columns and give numbers to 4 decimal places. (If you were to plot this data in Excel, it would look something like this: ExampleOutput.gif .
2. Produce two files for sites and non-sites that are identical to CAsites and CAnonsites, but contain an additional column that specifies the score for each of the given sites and non-sites.
Tar up the following files and submit them to CWP under "Programming Code
Submission":
Take quiz at CWP under the "Assignments and Projects" directory.