MIS 214/CS 274 Project 2
Threading and Distances

Due at midnight on Thursday, May 18, 2000

Objectives:

Don’t forget to submit your answers!

Introduction:

Threading:

As discussed in the course, one way to predict the 3D structure of an amino acid sequence is to find other sequences with which it is highly similar (e.g., has high alignment score). As long as the two sequences are highly similar we can be reasonably confident that the 3-dimensional structures of their proteins have not diverged very much. This suggests that, if the two sequences are similar enough, we can just replace the amino acids of some protein (whose structure is known) with the amino acids from another protein (whose structure is unknown). The borrowed structure is usually a good fit when the percent of identical residues is approximately 30% or higher.

When the percent identity is below 30%, it becomes difficult to distinguish good matches from random matches. However, it has been shown that two sequences with much less than 25% identity can still adopt essentially identical shapes. It seems that although almost all of the amino acids change, the environment in which the amino acids find themselves is quite similar, and so the same structure results. In order to capture this idea, Bowie et al (Science 253, p. 164-170, July 12, 1991, course reader) published a paper in which they took a known structure and instead of looking at the amino acid at position i, they looked at the properties of the amino acid at position i. The properties they chose to consider were the type of secondary structure, the degree of hydrophobicity, and whether the amino acid was on the surface or the interior of the molecule. A known structure of myoglobin and an alignment of all the globins enabled them to assess two things: (1) the type of environment occupied by each amino acid in myoglobin, and (2) the tendency for different amino acids to be in certain environments. They found 18 environments suitable for describing the different types of places in which amino acids find themselves. They were able to estimate a fairly reliable score matrix (which associates an amino acid with an environment-type) because many globin sequences are known and each sequence could be aligned with the known structure to determine which environment a particular amino acid occupied.

In this project, you will use the dynamic programming code developed in Project 1 (or download a working version) to see how good threading is as a method for predicting structure versus standard alignment of the two sequences. You will decide whether a protein is "GLOBIN" or is "NOT GLOBIN" based on its threading score.

We have gathered together a bunch of sequences for you in the file "p2_sequences.txt". Please see the annotations in this file. The convention we use is UPPERCASE LETTERS represent amino acids (e.g., VALQRK...) and lowercase letters represent environments (letters a...r). At the top of the p2_sequences.txt file is the amino acid sequence from one particular myoglobin which we will refer to as the "original" sequence. Also in the file, you'll find a string of environment types associated with each of the amino acids in that structure. The file "p2_thread_template.txt" contains the score matrix for amino acids and environments, as well as the environment string. Just paste in an amino acid sequence to use it.

You will compare the ability of threading and regular dynamic programming to distinguish globins from non-globins. (Note: The scores will NOT be directly comparable, but the degree to which globins have higher scores than non-globin controls will be used to assess distinguishing power). The file "p2_standard_template.txt" contains the PAM-250 match matrix, and can also be used by simply pasting an amino acid sequence of interest into it. Threading usually works well, but is limited because you:

  1. Need to know at least one 3D structure in the family of interest.

  2. Need to have a large amount of variants aligned to that structure so you can compute amino acid affinities for environments (although this information may be reusable across proteins).

  3. Still are not modeling the detailed interactions that may make or break a protein.

  4. You are depending on the environment not changing over evolution, and this may happen.

To get credit:

  1. Show all alignment scores computed

  2. Construct histograms and find reasonable cut-off scores

  3. List your prediction of "GLOBIN" or "NOT GLOBIN" for the unknowns

  4. Answer the questions

Distances:

1. Write a subroutine to compute the distance between two atoms. Get the file with backbone atom positions of protein 2tir from 2tir.dat. The file contains the backbone atoms only with carbon, c-alpha and nitrogen (labeled as c, a, and n respectively). The format on each line is amino acid type, amino acid sequence number, atom type, x, y and z coordinates.

2. Using any technology available to you, create a plot of the distance matrix for 2tir. Specifically, plot a two-dimensional matrix in which the sequence of the protein is at the top row and along the left column. At each position, row i and column j, place a dot (or * or something) if the C-alpha atoms of the two amino acids at positions i and j of the protein are at a distance of 6 Å or less. For example, you could use the program "gnuplot" which is installed on all leland machines.

3. Can you recognize the helix and sheet from your dot plot?

Early Start for Project 3:

In Project 3 you will be asked to apply 2 machine learning techniques to the analysis of microarray data. This project will require matrix manipulation and, although it can be done in any programming language (or even MS Excel if you are really good at it), we encourage you to use Matlab (or something similar). MatLab is an excellent tool for quickly developing mathematically intense programs which you might find very useful in the future, so it is worth taking a look at while taking this course. You can run MatLab interactively or you can write scripts to execute on MatLab. Stanford has a site license and it is installed on all Sweet Hall machines. MatLab has many useful functions, reading a tutorial would be a good way to get started. Or you could wait till project 3 is posted (around 5/17) to decide whether it is worth the effort to learn another language. If you have questions, please write to Soumya. We will decide whether to host an informal MatLab lab session based on your email responses.

Don’t forget to submit your answers for Project 2!



For questions, please write to Josh Stuart.