Biomedical Informatics 214 (also listed as Computer Science 274)
Representations and Algorithms for Computational Molecular Biology

Spring 2007

Programming Project 1

Due electronically by 5:00pm, Sat April 21, 2007

Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone.
Remember to consult the FAQ periodically as it will be updated with answers to common questions.

Contents:

  1. Introduction
  2. Program specifications
  3. Submission contents
  4. Submission instructions
  5. References

Introduction

Implement the dynamic programming algorithm with affine gap penalty, as described in class and on page 29 in the Durbin et al textbook (equations 2.16). Your program should be implemented such that it can be used for either global or local alignment (the input file has an indicator on which alignment it wants).

 

A copy of the pages on dynamic programming in the Durbin et al textbook appear on the course website. Your code should follow exactly the equations on page 29, i.e., there should be 3 scoring matrices, M, Ix, and Iy. A few things to note:

 

1. Sequence x (or sequence 1) is on the top of the scoring matrix, while sequence y (or sequence 2) is on the left side of the scoring matrix. Therefore, the i-th character in x corresponds to the i-th column in the scoring matrix, while the j-th character in y corresponds to the j-th row in the scoring matrix. In the equations on page 29 (equations 2.16), the first index is the column index, and the second index is the row index. This is opposite to what we usually see in programming languages. In most programming languages, the first index of a matrix is the row index, and the second index is the column index. Hence, M(i-1,j) in the equation should be written as something like M[j][i-1] in your code.

 

2. The Durbin et al textbook assumes same gap penalty for both sequences. However, we have different values for the two sequences. Therefore, in the Ix equation, d should be dy, and e should be ey, because Ix means inserting a gap in sequence y. Similarly, in the Iy equation, d should be dx, and e should be ex.

 

3. Be sure that you have a strong conceptual understanding of the algorithm before you attempt to encode it. It is wise to begin with pseudocode for this project, especially since you are required to submit it with your code (see "Submission Contents" section below). 
 

Hint: remember the differences between local alignment and global alignment:

  1. Local never records a score in the score matrix below 0.
  2. When looking for best match, local looks for the best score anywhere in the matrix, while global looks only in last row and last column.
  3. Local requires negative scores in the match matrix (note: the "match matrix" is the matrix which specifies what the score is for matching one amino acid or nucleotide with another, this is NOT the same as the "score matrix" which stores the scores for best alignments at any position; the "score matrix" is the one you are building up from top left to bottom right). Without negative scores for mismatches an alignment's score would never decrease and the local alignment would never terminate.

Program specifications

Input

The input to the program is a file with lines containing the following information.

  1. A sequence of letters indicating sequence A
  2. A sequence of letters indicating sequence B
  3. An indication of whether local (1) or global (0) alignment is sought.
  4. A set of gap penalties for introducing gaps into A or B.
  5. The symbol alphabets to be used (e.g. ATGC for DNA strands and 21 single-letter abbreviations for the proteins, but there could be other symbols)
  6. Lines showing the score between an element in B and one in A.

An example input file:

 

AATGC
AGGC
0
0 0 0 0
4
ATGC
4
ATGC
1 1 A A 1
1 2 A T 0
1 3 A G 0
1 4 A C 0
2 1 T A 0
2 2 T T 1
2 3 T G 0
2 4 T C 0
3 1 G A 0
3 2 G T 0
3 3 G G 1
3 4 G C 0
4 1 C A 0
4 2 C T 0
4 3 C G 0
4 4 C C 1

And an annotated version of example input file. (This is just for your reference; your program will not need to read such files.)

 

AATGC         ;sequence A (length = # of columns)
AGGC          ;sequence B (length = # of rows)
0             ;0 = global,  1 = local
0 0 0 0       ;gap open penalty for A (dx), gap extension for A (ex), open for B (dy), extend B (ey)
4             ;number of letters in alphabet for sequence A = NA
ATGC          ;letters in alphabet for sequence A
4             ;number of letters in alphabet for sequence B = NB
ATGC          ;letters in alphabet for sequence B
1 1 A A 1     ;
1 2 A T 0     ;These are the entries in the match matrix between
1 3 A G 0     ; alphabet for A and alphabet for B.  Letters appear
1 4 A C 0     ; in same order as lines 6 and 8.
2 1 T A 0     ; There are NA * NB entries in this matrix, and each
2 2 T T 1     ;    entry is a row.
2 3 T G 0     ; First col is row number in match matrix
2 4 T C 0     ; Second col is col number in match matrix
3 1 G A 0     ; Third col is letter from alphabet A  
3 2 G T 0     ; Fourth col is letter from alphabet B
3 3 G G 1     ; Fifth col is match score between the two letters
3 4 G C 0     ;
4 1 C A 0     ; Thus, this matrix gives positive score only for
4 2 C T 0     ; exact matches.  There are many other matrices
4 3 C G 0     ; also possible.
4 4 C C 1

For simplicity, you may assume that the input sequences will never be longer than 300 letters and the symbol alphabet size is never larger than 26.

 

Output

The output of the program is a file with lines containing the following information:

 

  1. score for the best alignment.
  2. sequence A (with necessary gaps)
  3. sequence B (with necessary gaps) such that aligned characters from A and B are in the same column.

The resulting output file for the above example:

 

3.0

ATGC
AGGC

AATG_C
A__GGC

ATG_C
A_GGC

AATGC
AG_GC

AATGC
A_GGC


Implementation notes:



Inputs/Outputs

Examples: Test your program on the provided example input files and check your output against the provided corresponding sample output files, which may be downloaded here.

Quiz: You must also run your program against several quiz files, which may be downloaded here. The questions for Project 1 Quiz are based on these files.



Submission contents

Your submission will include four main items:

 

  1. Source code.

     

  2. README.txt. This should be a text file that includes four sections.

     

  3. A file called project1_quiz, containing answers to Project 1 Quiz. This file must be

     

  4. A folder called "alignments" which contains your inputs and outputs for the aligned sequences resulting from the quiz questions. This folder should contain files named
    alignment1.input through alignment10.input
    and
    alignment1.output through alignment10.output



Submission instructions


Log in to an elaine machine:

ssh sunetid@elaine.stanford.edu

Place all of your relevant files (the four sections outlined above) into one directory. E.g.

~/biomedin214/p1/submit
Do not include irrelevant files; please only include those listed above. Delete executable binaries if you have them (e.g. if you used java/C/C+); we will compile your source code.

cd into your submit folder.

cd ~/biomedin214/p1/submit

Run the class submission script:

/usr/class/biomedin214/bin/submit p1
Note the space before p1. Be sure to run the script from within your submission folder! (The folder containing the four main items.)

You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submission. Your submission date will be the final submission received, and late days will be charged accordingly.





References (see the course website for pdfs)

Books: Both the Durbin et al. text and Mount text have good descriptions of these algorithms.
Gotoh article (1982)








Back to main page.


biomedin214-spr0607-staff@lists.stanford.edu