Biomedical Informatics 214 (also listed as Computer Science 274)
Representations and Algorithms for Computational Molecular Biology

Spring 2007

Project 3
Identification of Functional Sites in Protein Structures
Due electronically by 5:00pm, Monday, 5/14/07

Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone.

Contents:
  1. Introduction
  2. Program specification
  3. Data input
  4. Deliverables
  5. Submission instructions
  6. Tips and examples



Files to download:
All files linked from this page appear in these two locations:
  1. http://helix-web.stanford.edu/bmi214/project3/files/
  2. http://helix-web.stanford.edu/bmi214/project3/example_files/



Introduction

 

Proteins function by interacting with other molecules, such as small chemicals, ions, and proteins. The 3D position or region in the protein 3D structure where this interaction takes place is called the "functional site" of the protein. Given a protein which is known to interact with some molecule, we want to predict where the functional site is.

 

One commonly accepted hypothesis is that the functional site provides some favorable environment for the interaction to occur. Therefore, if we can learn from the known interactions what are the features of the interacting environment, then we can, given a protein with unknown functional site, check which position in the protein has those features, and that position will be a candidate functional site. We can also use statistical analysis to measure how likely it is to be a true functional site.

 

What are some examples of "features of the interacting environment"? This could be many things: examples are the presence of amino acids of a certain type, hydrophobicity, positive or negative charge, number of oxygen atoms, polarity, density of water molecules in the region, etc.

 

Program specification

 

In this project, you are asked to implement a method to predict the functional sites of calcium binding proteins according the rationale described above. (A "binding site" is a region of a protein where a particular molecule or atom (the "ligand") can stably associate with that protein.)


We have a set of proteins with known calcium binding sites. The features of the interacting environment we want to look at are the amino acids composition around the binding sites. After learning, your method should be able to identify positions in a protein  which are highly likely to be the functional sites - perhaps one position, perhaps a few. The method consists of two phases:

 

1. Learning phase

 

In order to describe the interacting environment, we divide the environment of a functional site into 5 concentric shells (a shell is the 3D equivalent of a ring) with the functional site as the center. Each shell is 1.5 angstroms thick. That is, for a given functional site, shell 0 (yes, starts from 0!) is the space between 0 to 1.5 angstroms away from the functional site, shell 1 is the space between 1.5 to 3 angstoms away from the functional site, and so on (see the picture on the right). You may assume the inner boundary of each shell is inclusive and the outer boundary is exclusive, though it doesn't really matter since atoms seldom fall exactly on the boundary anyway. 

 

Protein structures are most commonly described using PDB (Protein Data Bank) files. PDB files contain information about the protein, experimental methods used to obtain the structure, and 3D coordinates (x,y,z) for atoms in the structure. A functional site is also described as a point in 3D space using 3D coordinates.

  

As mentioned in the introduction, there are many properties that can be used as features. But in this project, we only use the presence of each amino acid (20 amino acids in total) in each shell. Therefore, there are 20*5 features. An amino acid is in a shell if its alpha carbon atom falls into that shell space. You can ignore other atoms. (In PDB files, the alpha carbon atom is named as "CA".)

 

We will learn what characterizes a functional site by comparing known functional sites (referred to as "sites") to non-functional sites (which we will refer to as "non-sites"). Terminology:

  • Sites= sites known to have a certain function, such as to bind calcium
  • Non-sites= sites we know are not functionally active, in the case of calcium binding this would be sites known NOT to bind calcium


In order to decide the likelihood that an unknown site is in fact a functional site, we need to first learn the probability of each feature in sites, and in non-sites.


The binding sites, non-sites and corresponding PDB files can be downloaded from the Data Input section. The probability of a feature (say, having amino acid LYS in shell 2) in sites can be estimated by the frequency of sites with this feature--the number of sites containing LYS in shell 2 divided by the total number of sites. A similiar calculation is done for the probability of a feature in non-sites. We also need to compute the probability of not having a feature in sites and non-sites, which is just 1 minus the probability of having the feature. (Hence, 4 probabilities can and will be calculated and used.)
 
However, when we estimate probabilities from frequencies, we usually need to add in a "fudge factor" by adding pseudo-counts, a process known as "smoothing". This will ensure that none of the probabilities are zero (in general, probabilities of zero are very dangerous in modeling). Smoothing models can get arbitrarily complex, but in this case we will do something very simple. When calculating the frequencies, add 0.1 to the numerator (which is the count of sites or non-sites with/without a residue in a certain position), and add 0.2 to the denominator (which is the total number of sites or non-sites). For example, if amino acid LYS appears in shell 2 of only two non-sites, the estimated probability (or "smoothed frequency") of LYS in shell 2 for non-sites will be (2 + 0.1) / (# non-sites + 0.2). The estimated probability of LYS not being in shell 2 for non-sites is then 1-[(2 + 0.1)/(# non-sites + 0.2)], or (# non-sites - 2 + 0.1)/(# non-sites + 0.2).

 

So, by the end of phase 1, you would have obtained 20*5*4 probabilities:

 

For 1<=i<=20, 1<=j<=5: 

  • Probability of having amino acid i in shell j in sites: P(AA i in shell j | site)
  • Probability of not having amino acid i in shell j in sites: P(AA i not in shell j | site)
  • Probability of having amino acid i in shell j in non-sites: P(AA i in shell j | non-site)
  • Probability of not having amino acid i in shell j in non-sites: P(AA i not in shell j | non-site)



2. Prediction phase

 

In this phase, you need to use the probabilities of features obtained above to predict whether a position (query position) is a calcium binding site. We combine all features into a scoring function using a Naive Bayes approach. Intuitively, if the probability of having a certain amino acid in a certain shell is high in sites, but low in non-sites, and if this amino acid is found in that shell around the query position, then the query position is highly likely to be a functional site.

 

Thus, the scoring function for a feature k (say, the feature is "amino acid i in shell j") is:

 

If amino acid i is in shell j: Sk = log(P(AA i in shell j | site)) - log(P(AA i in shell j | non-site))

Otherwise: Sk = log(P(AA i not in shell j | site)) - log(P(AA i not in shell j | non-site))

 
The final score for a query position is the sum score over all features (remember that we have 20*5 features in total): S = sum(Sk) for 1<=k<=100


What if we only know a protein interacts with some molecules, but have no knowledge about where the functional site might be? In this case, we need to come up with a set of candidate query positions. If we don't have any a priori knowledge, what we can do is build a 3D grid over the protein structure, and then check each grid point. You are asked to use this grid method to find the 100 top scoring positions in protein 1GGZ (pdb file: 1GGZ.pdb). The grid points should be spaced at 2 angstrom intervals and should span the whole protein: from (min(x)-6,min(y)-6,min(z)-6)    up to   (max(x)+6, max(y)+6, max(z)+6)

where the max and min are taken over the coordinates of all the atoms in the PDB file.
That is:
  1. Find the min and max values of x, y and z.
  2. Start from x=min(x)-6, y=min(y)-6, z=min(z)-6, increase each axis by 2, stop when x>max(x)+6, or y>max(y)+6, or z>max(z)+6. Be sure and cover the entire grid at all 2 angstrom increments in each direction.




We also ask that you perform the following, for grading purposes:

Use the scoring method described here to score all the sites and non-sites used in part 1 (the learning phase, where we calculated probabilities).

Note that this is NOT a step in the algorithm, or related in any way to prediction of unknown calcium-binding sites. This is just a step we ask you to do so that we can see your scores and grade them.


 

 


 

 

Data Input

  • CAsites.txt: A list of calcium binding sites (positive set)
  • CAnonsites.txt: A list of sites that do not bind calcium (negative set).
    Positive and negative site files are of the following format (tab delimited):
    PDBid    Xposition    Yposition    Zposition

  • SitePDBFiles.zip: and
  • NonsitePDBFiles.zip
    contain the necessary PDB files associated with each site

<< back to top


Deliverables

Note: All numbers (frequencies, coordinates, and scores) should be to 3 decimal places

1. (15%) 5 tab-delimited text files that give the probabilities (smoothed frequencies) of the sites and non-sites containing each type of residue for each shell (e.g. shell_0.txt, shell_1.txt, etc). Note: shell numbers should start from 0 to 4. Each row will represent one residue, and the columns will be sites and non-sites. Your file should look exactly like this one (shell_example.txt) (with different numbers, of course), and should be named "shell_0.txt", "shell_1.txt", etc. (If you were to plot this data in Excel, the resulting frequency histogram would look something like this:  (shell_example.gif).

2. (10%) Two files for sites and non-sites that are identical to CAsites.txt and CAnonsites.txt, but contain an additional column with the score for each of the given sites and non-sites (these files are to be named "CAsites_scores.txt" and "CAnonsites_scores.txt"). The files should be tab-delimited too.

3. (10%) One tab-delimited text file with the top 100 scoring positions for 1GGZ.pdb, called "1GGZ_100.txt". It should have (PDBfile, Xposition, Yposition, Zposition, score) for each position (example output).

4. (30%) Runnable commented source code. Don't forget all files should include your name.

5. (5%) README file called "README.txt" with your pseudocode and instructions for running your code

6. (30%) Answers to the Project 3 quiz ( download PDF here ). Please submit your answers only (no copy of the questions) in a text file named exactly "YourSunetId_quiz3.txt". For example, if you Sunet Id is peggyyao, your file name should be peggyyao_quiz3.txt. In the file, please write your Sunet Id and your full name.
 
Note: We will deduct 20% of full score if you do not follow any of the instructions! If you are not clear what we mean, please ask us.
Submission instructions

Log in to an elaine machine:

ssh sunetid@elaine.stanford.edu

Place all of your data files from 1-6 above into ONE directory, e.g.

~/biomedin214/p3/submit

Please only include those listed above. Do not tar them up. Delete executable binaries if you have them (e.g. if you used Java/C/C++); we will compile your source code.

cd into your submit folder:

cd ~/biomedin214/p3/submit

Run the class submission script:

/usr/class/biomedin214/bin/submit p3

Note the space before p3. Be sure to run the script from within your submission folder (where all your files to be submitted are)!

You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submissions. Your submissions date will be the final submission received, and late days will be charged accordingly. Don't forget the quiz!

<< back to top


Tips and examples

Example files
To help you test your code, we're providing an example set of files analogous to all of the input we're giving you and the output we expect from you for the actual project. Run your code with the example input files and check your results against the example outputs. For scores, we expect a bit of variability depending on how the math is implemented and how your programming language handles numbers and rounding, so don't be concerned if the scores you get are slightly off from the example output. As long as the variation is consistent and is within ~ 2 units, you're fine. What's important are the relative scores and frequencies within your output (e.g. for the structure scan, as long as your top 10 scoring points are the same and in the same order, the scores don't have to match exactly).

Example input files | Example output files | Site PDB files | Nonsite PDB files

Note: This example is adapted from another calcium model that was automatically generated from a 1-D EF-hand motif. EF-hands are very common secondary structures that are known to bind calcium in proteins. 1-D motifs, or short sequence patterns, are often used to describe known functional sites in proteins (ligand binding sites, enzyme active sites, etc) and DNA (transcription factor binding sites). Although 3-D motifs can provide better descriptions of many protein sites, it's sometimes hard to find examples of 3-D motifs for training. Work by Mike Liang in the Altman lab focused on automating the process of creating training sets and building models by using 1-D motifs (which are plentiful) as a starting point. Learn more by reading his paper below.

Liang MP, Brutlag DL, Altman RB. (2003) Automated construction of structural motifs for predicting functional sites on protein structures. Pacific Symposium on Biocomputing pp.204-215. (download PDF here)

Parsing PDB files
You should not be spending most of your time parsing files! PDB files can be painful to parse, but they also have a strict structure that you should exploit. Unlike many other files, PDB files are NOT tab or space delimited; rather, the different fields are specified by column numbers (i.e. indices). For example, the X coordinate of an atom in the ATOM section of the file will always be in columns 31-38. The first 6 columns denote the section of the file - for example, EXPDTA denotes the line(s) containing the experimental method used. Note that column indices start at 1 and not 0 as in most programming languages, and that not all 6 columns need be occupied by characters (6 is just the max width of that field). All of the information you will need should be in the ATOM section of the PDB files.

See the PDB's detailed description of the PDB file format, and section 9 especially.


For more information:
If you would like to understand more of the statistical reasoning that underlies the scores used by the Naive Bayes approach, see this writeup: (stat_explain.html)


<< back to top