MIS 214/CS 274 Programming Project 1

Dynamic Programming Sequence Alignment

Due 11:59 pm Monday, April 17 2000

Don't forget to visit the Test Cases and Submission Page which is required for this assignment.

Introduction

The first programming project is to implement the dynamic programming algorithm for sequence alignment, parameterized in such a way that it can be used for either global or local sequence alignment. You will then use it to do a database search in the next assignment.

You should create a pseudocode description of the algorithm for yourself first, based on the discussion in the Gribskov handout (or Needleman-Wunsch article). You do not have to implement the caching strategy (described in the Gotoh article) that takes the complexity from O(N3) to O(N2). The bold student may, for extra credit, attempt to write the more efficient version, using the scheme reported by Gotoh (see handout).

Remember, the difference between local alignment and global alignment are simple:

    1. local requires negative scores in the match matrix
    2. local never records a score in the score matrix below 0.0
    3. when looking for best match, local looks for best score anywhere in matrix (global looks only in last row and last col for best score).
Program Specifications

The input to the program is a file with lines containing the following information: (See sample files align-example.input and align-example.input-annotated):

    1. A sequence of letters indicating sequence A
    2. A sequence of letters indicating sequence B
    3. An indication of whether local (1) or global (0) alignment is sought.
    4. A set of gap penalties for introducing gaps into A or B.
    5. The symbol alphabets to be used (e.g. ATGC for DNA strands and 21 single-letter abbreviations for the proteins, but there could be other symbols)
    6. A matrix which shows the score for a match between an element in B and one in A.
For simplicity, you could assume that the input sequences will never be longer than 300 letters and the symbol alphabet size is never larger than 26. alignDNA-template.dat and alignProtein-template.dat are file templates for you to insert the appropriate DNA or protein sequences to create inputs for your program.

The output of the program is a file with lines containing the following information: (See align-example.output)

    1. score for the alignment.
    2. sequence A (with necessary gaps)
    3. sequence B (with necessary gaps) such that aligned characters from A and B are in the same column.
If you want to report more than one equivalent alignment, then you can leave a blank line and then print the alignments in a similar manner. See align-all-example.output.

To Get Credit:
 

1. Submit your pseudocode description (in a readme.txt file which will be submitted) of the algorithm that describes what you coded on one page or less. We will use this as a primary way to debug the conceptual basis for your code, since we allow so many different languages.

2. Submit a listing of your entire source code. We expect to see comments and explanations of what is going on. Especially if you expect partial credit for "almost working" code.

3. Submit the results of the code on test cases and answer additional questions that will be posted later.

4. 5% extra for an O(n^2) algorithm.

Test Cases and Submission Page



For questions, please contact Josh Stuart.