|
|
Biomedical Informatics 214 (also listed as Computer Science
274) Representations and Algorithms for Computational Molecular Biology Spring 2007 |
Due electronically by 5:00pm, Sat April 21, 2007
Note that programming projects should be completed individually. You may
discuss algorithms with others, but the coding should be done alone.
Remember to consult the
FAQ periodically
as it will be updated with answers to common questions.
Contents:
Implement the dynamic programming algorithm with affine gap penalty, as described in class and on page 29 in the Durbin et al textbook (equations 2.16). Your program should be implemented such that it can be used for either global or local alignment (the input file has an indicator on which alignment it wants).
A copy of the pages on dynamic programming in the Durbin et al textbook appear on the course website. Your code should follow exactly the equations on page 29, i.e., there should be 3 scoring matrices, M, Ix, and Iy. A few things to note:
1. Sequence x (or sequence 1) is on the top of the scoring matrix, while sequence y (or sequence 2) is on the left side of the scoring matrix. Therefore, the i-th character in x corresponds to the i-th column in the scoring matrix, while the j-th character in y corresponds to the j-th row in the scoring matrix. In the equations on page 29 (equations 2.16), the first index is the column index, and the second index is the row index. This is opposite to what we usually see in programming languages. In most programming languages, the first index of a matrix is the row index, and the second index is the column index. Hence, M(i-1,j) in the equation should be written as something like M[j][i-1] in your code.
2. The Durbin et al textbook assumes same gap penalty for both sequences. However, we have different values for the two sequences. Therefore, in the Ix equation, d should be dy, and e should be ey, because Ix means inserting a gap in sequence y. Similarly, in the Iy equation, d should be dx, and e should be ex.
3. Be sure that you have a strong conceptual understanding of the
algorithm before you attempt to encode it. It is wise to begin with
pseudocode for this project, especially since you are required to
submit it with your code (see "Submission Contents" section
below).
Hint: remember the differences between local alignment and global alignment:
The input to the program is a file with lines containing the following information.
AATGC AGGC 0 0 0 0 0 4 ATGC 4 ATGC 1 1 A A 1 1 2 A T 0 1 3 A G 0 1 4 A C 0 2 1 T A 0 2 2 T T 1 2 3 T G 0 2 4 T C 0 3 1 G A 0 3 2 G T 0 3 3 G G 1 3 4 G C 0 4 1 C A 0 4 2 C T 0 4 3 C G 0 4 4 C C 1
And an annotated version of example input file. (This is just for your reference; your program will not need to read such files.)
AATGC ;sequence A (length = # of columns) AGGC ;sequence B (length = # of rows) 0 ;0 = global, 1 = local 0 0 0 0 ;gap open penalty for A (dx), gap extension for A (ex), open for B (dy), extend B (ey) 4 ;number of letters in alphabet for sequence A = NA ATGC ;letters in alphabet for sequence A 4 ;number of letters in alphabet for sequence B = NB ATGC ;letters in alphabet for sequence B 1 1 A A 1 ; 1 2 A T 0 ;These are the entries in the match matrix between 1 3 A G 0 ; alphabet for A and alphabet for B. Letters appear 1 4 A C 0 ; in same order as lines 6 and 8. 2 1 T A 0 ; There are NA * NB entries in this matrix, and each 2 2 T T 1 ; entry is a row. 2 3 T G 0 ; First col is row number in match matrix 2 4 T C 0 ; Second col is col number in match matrix 3 1 G A 0 ; Third col is letter from alphabet A 3 2 G T 0 ; Fourth col is letter from alphabet B 3 3 G G 1 ; Fifth col is match score between the two letters 3 4 G C 0 ; 4 1 C A 0 ; Thus, this matrix gives positive score only for 4 2 C T 0 ; exact matches. There are many other matrices 4 3 C G 0 ; also possible. 4 4 C C 1
For simplicity, you may assume that the input sequences will never be longer than 300 letters and the symbol alphabet size is never larger than 26.
The resulting output file for the above example:
3.0 ATGC AGGC AATG_C A__GGC ATG_C A_GGC AATGC AG_GC AATGC A_GGC
ATGC AGGCwith the leading A in sequence 1 excluded because it is aligned with an end gap. This is still a global alignment!
An alternative representation of the first alignment is
AATGC _AGGCwhich is equivalent, and still has a score of 3.0. Your program may produce either type of output, so long as end gaps are not penalized.
Examples: Test your program on the provided example input files and check your output against the provided corresponding sample output files, which may be downloaded here.
Quiz: You must also run your program against several quiz files, which may be downloaded here. The questions for Project 1 Quiz are based on these files.
Your submission will include four main items:
Log in to an elaine machine:
ssh sunetid@elaine.stanford.edu
Place all of your relevant files (the four sections outlined above) into one directory. E.g.
~/biomedin214/p1/submitDo not include irrelevant files; please only include those listed above. Delete executable binaries if you have them (e.g. if you used java/C/C+); we will compile your source code.
cd into your submit folder.
cd ~/biomedin214/p1/submit
Run the class submission script:
/usr/class/biomedin214/bin/submit p1Note the space before p1. Be sure to run the script from within your submission folder! (The folder containing the four main items.)
You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submission. Your submission date will be the final submission received, and late days will be charged accordingly.
References (see the
course website for
pdfs)
Books: Both the
Durbin et al. text and
Mount text have good
descriptions of these
algorithms.
Gotoh article (1982)