|
|
Biomedical Informatics 214 (also listed as Computer Science 274) Representations and Algorithms for Computational Molecular Biology Spring 2007 |
|
Project 3
Identification of Functional Sites in Protein Structures
Due electronically by 5:00pm, Monday, 5/14/07
Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone. Contents:
Files to download: All files linked from this page appear in these two locations:
Introduction
Proteins function by interacting with other molecules, such as small chemicals, ions, and proteins. The 3D position or region in the protein 3D structure where this interaction takes place is called the "functional site" of the protein. Given a protein which is known to interact with some molecule, we want to predict where the functional site is.
One commonly accepted hypothesis is that the functional site provides some favorable environment for the interaction to occur. Therefore, if we can learn from the known interactions what are the features of the interacting environment, then we can, given a protein with unknown functional site, check which position in the protein has those features, and that position will be a candidate functional site. We can also use statistical analysis to measure how likely it is to be a true functional site.
What are some examples of "features of the interacting environment"? This could be many things: examples are the presence of amino acids of a certain type, hydrophobicity, positive or negative charge, number of oxygen atoms, polarity, density of water molecules in the region, etc.
In this project, you are asked to implement a method to predict the
functional sites of calcium binding proteins according the rationale
described above. (A "binding site" is a region of a protein where a
particular molecule or atom (the "ligand") can stably associate with
that protein.)
We have a set of proteins with known calcium binding sites. The features of the interacting environment we want to look at are the amino acids composition around the binding sites. After learning, your method should be able to identify positions in a protein which are highly likely to be the functional sites - perhaps one position, perhaps a few. The method consists of two phases:
1. Learning phase
In order to describe the interacting environment, we divide the environment of a functional site into 5 concentric shells (a shell is the 3D equivalent of a ring) with the functional site as the center. Each shell is 1.5 angstroms thick. That is, for a given functional site, shell 0 (yes, starts from 0!) is the space between 0 to 1.5 angstroms away from the functional site, shell 1 is the space between 1.5 to 3 angstoms away from the functional site, and so on (see the picture on the right). You may assume the inner boundary of each shell is inclusive and the outer boundary is exclusive, though it doesn't really matter since atoms seldom fall exactly on the boundary anyway.
Protein structures are most commonly described using PDB (Protein Data Bank) files. PDB files contain information about the protein, experimental methods used to obtain the structure, and 3D coordinates (x,y,z) for atoms in the structure. A functional site is also described as a point in 3D space using 3D coordinates.
As mentioned in the introduction, there are many properties that can be used as features. But in this project, we only use the presence of each amino acid (20 amino acids in total) in each shell. Therefore, there are 20*5 features. An amino acid is in a shell if its alpha carbon atom falls into that shell space. You can ignore other atoms. (In PDB files, the alpha carbon atom is named as "CA".)
We will learn what characterizes a functional site by comparing known
functional sites (referred to as "sites") to non-functional sites (which
we will refer to as "non-sites"). Terminology:
In order to decide the likelihood that an unknown site is in fact a functional site, we need to first learn the probability of each feature in sites, and in non-sites.
The binding sites, non-sites and corresponding PDB files can
be downloaded from the
Data
Input section. The probability of a feature (say, having amino
acid LYS in shell 2) in sites can be estimated by the frequency of
sites with this feature--the number of sites containing LYS in shell 2
divided by the total number of sites. A similiar calculation is done for
the probability of a feature in non-sites. We also need to compute the
probability of not having a feature in sites and non-sites,
which is just 1 minus the probability of having the feature. (Hence, 4
probabilities can and will be calculated and used.)
So, by the end of phase 1, you would have obtained 20*5*4 probabilities:
For 1<=i<=20, 1<=j<=5:
2. Prediction phase
In this phase, you need to use the probabilities of features obtained above to predict whether a position (query position) is a calcium binding site. We combine all features into a scoring function using a Naive Bayes approach. Intuitively, if the probability of having a certain amino acid in a certain shell is high in sites, but low in non-sites, and if this amino acid is found in that shell around the query position, then the query position is highly likely to be a functional site.
Thus, the scoring function for a feature k (say, the feature is "amino acid i in shell j") is:
If amino acid i is in shell j: Sk = log(P(AA i in shell j | site)) - log(P(AA i in shell j | non-site)) Otherwise: Sk = log(P(AA i not in shell j | site)) - log(P(AA i not in shell j | non-site))
What if we only know a protein interacts with some molecules, but have
no knowledge about where the functional site might be? In this case, we
need to come up with a set of candidate query positions. If we don't
have any a priori knowledge, what we can do is build a 3D grid over the
protein structure, and then check each grid point. You are asked to
use this grid method to find the 100 top scoring positions in protein
1GGZ (pdb file: 1GGZ.pdb). The grid points should be spaced at 2
angstrom intervals and should span the whole protein: from
(min(x)-6,min(y)-6,min(z)-6) up to (max(x)+6,
max(y)+6, max(z)+6) That is:
We also ask that you perform the following, for grading purposes: Use the scoring method described here to score all the sites and non-sites used in part 1 (the learning phase, where we calculated probabilities). Note that this is NOT a step in the algorithm, or related in any way to prediction of unknown calcium-binding sites. This is just a step we ask you to do so that we can see your scores and grade them.
Data Input
<< back to top
Deliverables
Note: All numbers (frequencies, coordinates, and scores) should be to 3 decimal places 1. (15%) 5 tab-delimited text files that give the probabilities (smoothed frequencies) of the sites and non-sites containing each type of residue for each shell (e.g. shell_0.txt, shell_1.txt, etc). Note: shell numbers should start from 0 to 4. Each row will represent one residue, and the columns will be sites and non-sites. Your file should look exactly like this one (shell_example.txt) (with different numbers, of course), and should be named "shell_0.txt", "shell_1.txt", etc. (If you were to plot this data in Excel, the resulting frequency histogram would look something like this: (shell_example.gif). 2. (10%) Two files for sites and non-sites that are identical to CAsites.txt and CAnonsites.txt, but contain an additional column with the score for each of the given sites and non-sites (these files are to be named "CAsites_scores.txt" and "CAnonsites_scores.txt"). The files should be tab-delimited too. 3. (10%) One tab-delimited text file with the top 100 scoring positions for 1GGZ.pdb, called "1GGZ_100.txt". It should have (PDBfile, Xposition, Yposition, Zposition, score) for each position (example output). 4. (30%) Runnable commented source code. Don't forget all files should include your name. 5. (5%) README file called "README.txt" with your pseudocode and instructions for running your code 6. (30%) Answers to the Project 3 quiz ( download PDF here ). Please submit your answers only (no copy of the questions) in a text file named exactly "YourSunetId_quiz3.txt". For example, if you Sunet Id is peggyyao, your file name should be peggyyao_quiz3.txt. In the file, please write your Sunet Id and your full name.
Note: We will deduct 20% of full score if you do not
follow any of the instructions! If you are not clear what we mean,
please ask us.
Submission instructions
Log in to an elaine machine:
ssh sunetid@elaine.stanford.edu
Place all of your data files from 1-6 above into ONE directory, e.g.
~/biomedin214/p3/submit
Please only include those listed above. Do not tar them up. Delete executable binaries if you have them (e.g. if you used Java/C/C++); we will compile your source code. cd into your submit folder:
cd ~/biomedin214/p3/submit
Run the class submission script:
/usr/class/biomedin214/bin/submit p3
Note the space before p3. Be sure to run the script from within your submission folder (where all your files to be submitted are)! You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submissions. Your submissions date will be the final submission received, and late days will be charged accordingly. Don't forget the quiz! << back to top
Tips and examples
Example files To help you test your code, we're providing an example set of files analogous to all of the input we're giving you and the output we expect from you for the actual project. Run your code with the example input files and check your results against the example outputs. For scores, we expect a bit of variability depending on how the math is implemented and how your programming language handles numbers and rounding, so don't be concerned if the scores you get are slightly off from the example output. As long as the variation is consistent and is within ~ 2 units, you're fine. What's important are the relative scores and frequencies within your output (e.g. for the structure scan, as long as your top 10 scoring points are the same and in the same order, the scores don't have to match exactly). Example input files | Example output files | Site PDB files | Nonsite PDB files Note: This example is adapted from another calcium model that was automatically generated from a 1-D EF-hand motif. EF-hands are very common secondary structures that are known to bind calcium in proteins. 1-D motifs, or short sequence patterns, are often used to describe known functional sites in proteins (ligand binding sites, enzyme active sites, etc) and DNA (transcription factor binding sites). Although 3-D motifs can provide better descriptions of many protein sites, it's sometimes hard to find examples of 3-D motifs for training. Work by Mike Liang in the Altman lab focused on automating the process of creating training sets and building models by using 1-D motifs (which are plentiful) as a starting point. Learn more by reading his paper below. Liang MP, Brutlag DL, Altman RB. (2003) Automated construction of structural motifs for predicting functional sites on protein structures. Pacific Symposium on Biocomputing pp.204-215. (download PDF here) Parsing PDB files You should not be spending most of your time parsing files! PDB files can be painful to parse, but they also have a strict structure that you should exploit. Unlike many other files, PDB files are NOT tab or space delimited; rather, the different fields are specified by column numbers (i.e. indices). For example, the X coordinate of an atom in the ATOM section of the file will always be in columns 31-38. The first 6 columns denote the section of the file - for example, EXPDTA denotes the line(s) containing the experimental method used. Note that column indices start at 1 and not 0 as in most programming languages, and that not all 6 columns need be occupied by characters (6 is just the max width of that field). All of the information you will need should be in the ATOM section of the PDB files. See the PDB's detailed description of the PDB file format, and section 9 especially. For more information: If you would like to understand more of the statistical reasoning that underlies the scores used by the Naive Bayes approach, see this writeup: (stat_explain.html) << back to top |
|