| Biomedical Informatics 214 (also listed as Computer Science
274) Representations and Algorithms for Computational Molecular Biology Spring 2006 |
|
|
Project 3 Identification of Functional Sites in Protein Structures Due electronically by 5:00pm, Monday, 5/15/06 Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone. Contents:
Introduction In this project you will explore the link between 3D
structure and function by analyzing the environment around ion binding sites.
For clarification, we define a site as a 3D location (x, y, z) in a protein
structure. Protein structures are most commonly described using PDB (Protein Data Bank) files. PDB files contain information about the protein, experimental methods used to obtain the structure, and coordinates for atoms in the structure (recall that a protein is made up of amino acids which are made up of atoms). For this project, you will be using these structure files to characterize the amino acid environment around given positive and negative calcium binding sites. Data Input
CAsites: A list of calcium binding sites (positive set) CAnonsites: A list of sites that do not bind calcium (negative set). Positive and negative site files are of the following form (tab delimited): PDBid Xposition Yposition Zposition Download the necessary PDB files associated with each site from SitePDBFiles.zip and NonsitePDBFiles.zip. << back to top Program Specifications There will be two parts to this assignment and you may therefore wish to split your code accordingly. For the first part, you will construct a simple model of calcium binding sites. A "binding site" is a region of a protein where a particular molecule or atom (the "ligand") can stably associate with that protein. Our model will consist of the probabilities of finding certain amino acid residues at various distances from a binding site. This is a reasonable model because the identity of the amino acids near a putative binding site and the precise distance that they are from that site play a large role in determining the local chemical/physical environment near the site. This chemical environment, in turn, determines the propensity for the ligand to spend time in that region of the protein.
For each site or nonsite in the provided lists, examine the amino acid makeup in concentric "shells" (the 3D equivalent of a ring) around that site or nonsite. We will use a model with five shells, each 1.5 angstroms thick. That is, for a given location, we will examine five different regions of space: the region between 0 and 1.5 angstroms away from the location (shell 0), the region between 1.5 and 3 angstroms away from the location (shell 1), and so on (see figure at right). Pedants among you may assume that the inner boundary of each shell is inclusive and the outer boundary is exclusive. It doesn't really matter since nothing will fall exactly on the boundary anyway.
Our model will then consist of a table. For each shell (0 through 4) and each amino acid (there are 20 of these), record the number of sites and (separately) the number of non-sites that have one or more of that amino acid in that shell. This will result in a 3-dimensional 5x20x2 table containing integer counts. When you're done, the table will be able to answer questions like, "How many sites had alanine residues between 3 and 4.5 angstroms (shell 2) from the center of the site?" or "How many non-sites had glycine residues less than 1.5 angstroms (shell 0) from the center?" For this assignment, an amino acid is "in" a particular shell if its alpha carbon atom falls into that region of space. You can ignore the other atoms. (Check the notes from the intro to biology section for quick background on amino acids and protein structure.) Then, for each shell, calculate the fractino of the sites and non-sites that contain each type of residue, and the fraction of sites and non-sites that do not contain each type of residue. This fraction (or "frequency") give us an estimate of the probability that a given amino acid will appear at a given distance from a site or a non-site. However, when we estimate probabilities from frequencies we usually need to add in a "fudge factor" by adding pseudo-counts to our data. This will ensure that none of the probabilities are zero (In general, probabilities of zero are very dangerous in modeling. You can see why this is mathematically in the formula for S below). Smoothing models can get arbitrarily complex, but in this case we will do something very simple. When calculating the frequencies, add 0.1 to each count of sites and non-sites with or without a residue in a certain position, and then instead of dividing by the total number of sites or non-sites, divide by the total + 0.2. For example, if residue LYS (lysine) appears in shell 2 of only two non-sites, the estimated probaiblity (or "smoothed frequency") of LYS in shell 2 for non-sites will be (2 + 0.1) / (# nonsites + 0.2). The estimated probability of LYS not being in shell 2 for non-sites is then (# nonsites - 2 + 0.1) / (# non-sites + 0.2). Note that these two probabilities are disjoint and mutually exhaustive, and thus must sum to 1. The enterprising among you will either use this fact as a check on their math, or as a way to simplify their code. In the second part, you will make use of the frequencies just collected to identify potential calcium binding sites. We treat the presence of a particular residue type in a particular shell as a feature that may help in identifying a site. We combine all features into a scoring function using a Naive Bayes approach. Specifically, given a location, we iterate over all the features and computer subscores using: ![]() The score is a sum over all the features, of which there should be 5x20 = 100: ![]() Locations with higher scores stand a higher chance of being calcium binding sites. To understand a bit of the statistical reasoning that underlies this, see our brief writeup. Use this scoring method to score all the sites and non-sites used in training (CAsites and CAnonsites). Also, write code to scan PDB file using a 3D grid over the structure and output the 100 top scoring positions from the grid (we would want to do this if we wanted to know where potential calcium binding sites might be in a protein). The grid points should be spaced at 2 Angstrom intervals and should span the whole protein. The easiest way to do this is to scan from (min(x)-6, min(y)-6, min(z)-6) up to (max(x)+6, max(y)+6, max(z)+6), where the max and min are taken over the coordinates of all the atoms in the PDB file. Use the code to scan the following PDB file: 1GGZ, and report the most likely locations of potential calcium binding sites in the protein. Note that our method is likely to produce several high scoring locations around the true site. << back to top Deliverables Note: All numbers (frequencies, coordinates, and scores) should be to 3 decimal places 1. (15%) 5 tab-delimited text files that give the fraction of the sites and non-sites containing each type of residue for each shell (e.g. shell0.txt, shell1.txt, etc). Each row will represent one residue, and the columns will be sites and non-sites. Your file should look exactly like this one (with different numbers, of course), and should be named "shell_0.txt", "shell_1.txt", etc. (If you were to plot this data in Excel, the resulting frequency histogram would look something like this. 2. (10%) Two files for sites and nonsites that are identical to CAsites and CAnonsites, but contain an additional column with the score for each of the given sites and non-sites (e.g. CAsites_scores.txt and CAnonsites_scores.txt). 3. (10%) One tab-delimited text file with the top 100 scoring positions for 1GGZ.pdb, called 1GGZ_100.txt. It should have (PDBfile, Xposition, Yposition, Zposition, score) for each position (example output). 4. (30%) Runnable commented source code 5. (5%) README file with your pseudocode and instructions for running your code 6. (30%) Answers to the Project 3 quiz ( doc | PDF ) Submission instructions Log in to an elaine machine: ssh sunetid@elaine.stanford.edu Place all of your data files from 1-6 above into one directory, e.g. ~/biomedin214/p3/submit Please only include those listed above. Delete executable binaries if you have them (e.g. if you used Java/C/C++); we will compile your source code. cd into your submit folder: cd ~/biomedin214/p3/submit Run the class submission script: /usr/class/biomedin214/bin/submit p3 Note the space before p3. Be sure to run the script from within your submission folder (where all your files to be submitted are)! You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submissions. Your submissions date will be the final submission received, and late days will be charged accordingly. Don't forget the quiz! << back to top Tips and examples Example files To help you test your code, we're providing an example set of files analogous to all of the input we're giving you and the output we expect from you for the actual project. Run your code with the example input files and check your results against the example outputs. For scores, we expect a bit of variability depending on how the math is implemented and how your programming language handles numbers and rounding, so don't be concerned if the scores you get are slightly off from the example output. As long as the variation is consistent and is within ~ 2 units, you're fine. What's important are the relative scores and frequencies within your output (e.g. for the structure scan, as long as your top 10 scoring points are the same and in the same order, the scores don't have to match exactly). Example input files | Example output files | Site PDB files | Nonsite PDB files Note: This example is adapted from another calcium model that was automatically generated from a 1-D EF-hand motif. EF-hands are very common secondary structures that are known to bind calcium in proteins. 1-D motifs, or short sequence patterns, are often used to describe known functional sites in proteins (ligand binding sites, enzyme active sites, etc) and DNA (transcription factor binding sites). Although 3-D motifs can provide better descriptions of many protein sites, it's sometimes hard to find examples of 3-D motifs for training. Work by Mike Liang in the Altman lab focused on automating the process of creating training sets and building models by using 1-D motifs (which are plentiful) as a starting point. Learn more by reading his paper below. Liang MP, Brutlag DL, Altman RB. (2003) Automated construction of structural motifs for predicting functional sites on protein structures. Pacific Symposium on Biocomputing pp.204-215. (PDF) Parsing PDB files You should not be spending most of your time parsing files! PDB files can be painful to parse, but they also have a strict structure that you should exploit. Unlike many other files, PDB files are NOT tab or space delimited; rather, the different fields are specified by column numbers (i.e. indices). For example, the X coordinate of an atom in the ATOM section of the file will always be in columns 31-38. The first 6 columns denote the section of the file - for example, EXPDTA denotes the line(s) containing the experimental method used. Note that column indices start at 1 and not 0 as in most programming languages, and that not all 6 columns need be occupied by characters (6 is just the max width of that field). All of the information you will need should be in the ATOM section of the PDB files. See the PDB's detailed description of the PDB file format, and section 9 especially. << back to top | |