What is "average number of gap characters"? Is that the number of real characters per gap, the average number of gaps per real character, or the average number of gaps per (real character + gaps)? Thanks. -- John Bauer
I read that as average number of gaps between all the alignments. Ex: if alignment 1 has 3 gaps in first sequence and alignment 2 has 4 gaps in first sequence. Then the average number of gaps for sequence 1 is 3.5. -- Harendra G
That's kind of what I thought, but I wasn't sure. Suppose there are an average of x gaps in a sequence of y characters... do they want x, x/y, y/x, x/(x+y), y/(x+y), or some other ratio? I guess there are several equivalent ways of expressing the same information, and hopefully any of them will be accepted. -- John Bauer
Harendra's interpretation is correct, but if you interpreted it differently but still report a reasonable average-gaps-for-sequence1 metric, this will be accepted (if it's clear how you computed it and it's correct). -- Jesse
Help with biological questions... Can someone please point me to some good resources on answering the biological questions in the quiz? Thanks --Stefan K
I'm not using actual references to come up with my answers to #19 or #20. I'm just thinking about... what changes in a DNA sequence would mean the DNA sequences are still similar, but the proteins are extremely different, and on the other hand, what changes in a protein sequence would mean the proteins are still similar. -- John Bauer
Ok easy question, what is this question asking? "Please fill in the three matrices (the insertion, deletion, and match matrices). Be sure to include tracebacks (all possible ones) for aligning the sequence in alignment0.input. Name all your matrices and use the according names/symbols for the tracebacks." i.e. what's the output that is wanted? the match matrix, the gap costs, and what else? -- Stefan K
Sorry for the confusion. Ideally, it should look something like we did in class for the score matrices M, Ix, and Iy. When it says "show all tracebacks" it just means show all arrows/pointers on paths that lead to the maximum scores. (you don't have to draw ALL pointers, just ones that will be used in some traceback of a highest scoring alignment). So ideally it should look like http://www.cs.rice.edu/~nakhleh/COMP571/Slides/Lecture4.pdf page 23 (red arrows only), but you'll have some inter-matrix pointers since you have 3 matrices.
If its too tough to illustrate, you can write out pointers like Ix(1,2)->M(2,3) --Jesse
I would like to echo Stefan's confusion here. My plan, upon reading this question, was to show my M, Ix, and Iy matrices, with arrows drawn on them to show the traceback pointers. But the phrase "include tracebacks...for aligning the sequence" could suggest that the paths for the resulting final alignments should be shown too. And we have been using the term "match matrix" for the static character-to-character value map that we read in; I wouldn't think we would show that. Indeed, what IS this question asking? -- Keith B.
I would like clarification on this too. For example, this document (http://www.cs.rice.edu/~nakhleh/COMP571/Slides/Lecture4.pdf) has slides with matrices annotated with tracebacks. By all possible ones do you want something like example on page 12/27 where every single arrow is draw (even non optimal ones) or do you want something more like the red arrow only but all the possible red arrows (optimal ones). If you wanted optimal traceback since our arrows would jump between matrices, should we also draw that? Something similar to (http://bix.ucsd.edu/bioalgorithms/presentations/Ch06_Alignment.ppt) on slide 41/45. -- Harendra G
Is this acceptable for tracing out an alignment? It shows the path and is amenable to notepad. ==> pIx[2][3] = 'M' T/- ==> pM[2][2] = 'M' A/G ==> pM[1][1] = 'M' A/A to give AAT.../AG-... where p means a pointer matrix; 'M' means the value in that box came from matrix M. I also included score matrices (M,Ix,Iy) and pointer matrices (pM,pIx,pIy) - Grace Tang
For Quiz alignment3.input, I have an alignment which matches the given output except my alignment lacks the first character in each sequence: an E-G match which has a zero value. This confuses me, since, if the traceback algorithm is to stop when it gets to a zero value, how can it proceed to add another zero valued score?
Indeed, Jesse a few posts ago said "It should be apparent that the difference between your output and the solution output is that yours has a leading prefix that contributes a net of 0 points to the total alignment score. We ask that in this situation, you only report the shorter one." In this case, MY alignment is the shorter one, compared to the expected answer, which has a leading prefix which contributes 0 points to the total score.
Is there something here that I am missing? -- Keith B.
Quiz alignment 3 is a global alignment, and so one of the strings has the start with a location on the boundary. -- John Bauer
Right it's global alignment. In this case since it's free to mismatch E-G, the alignment goes from 0,0 (* aligned to *) through 1,1 ( E aligned to G), but you have to output these aligned characters since it's global --jesse
Thank you both for directing me to the blindingly obvious. Perhaps I've been crawling around the details of this project for too long. :) A simple tweak to my traceback function did the trick. -- Keith B.
Following up on an earlier question: if all of our answers to the quiz must be in project1_quiz.txt, how are we supposed to include our answer to question 1, which requires potentially scanning a hand-drawn picture or creating something in word? Can we attach it as a separate PDF? -- John Melas-Kyriazi
We prefer to have everything in one file. If you feel you cannot do this in a text file, a pdf of everything is sufficient. -- TiffanyChen
I have all of my alignments matching the expected output but one, but there are some weird discrepancies between the max score from the matrices and the scores from the resulting matches. Usually, the matrix score is less than alignment score, but for the one that doesn't work, it is greater, which is probably why it isn't working. I've got my Epsilon compares in, so it can't be that. Does anyone have any hints on tweaking a mostly running system to get the scores to come out right? Also, what values are people using for Epsilon? -- Keith B.
I have been using .001, which seems to work well. Perhaps try staying consistent with your comparisons: i.e. always use (x - epsilon <= y). If that doesn't fix the problem, perhaps the discrepancies in float operations are not the issue? -- John Melas-Kyriazi
You may have your ex/dx and ey/dy switched. I had the same problem as you and fixed it by switching my ex/dx and ey/dy. I think Nick (read above) had the same confusion also. --Jenny Chen
Yeah, I would try switching ex/dx and ey/dy. That fixed mine. --Nick Perry
Yup, switching the gap penalties cleared everything up. Thank you, folks. -- Keith B.
Just to clarify, are Questions 15 and 16 on the quiz asking for us to discuss the differences between alignment8 vs. alignment9, or the individual alignments generated in alignment9? -- Victoria Y.
8 versus 9 -- TiffanyChen
Are we allowed to discuss the answers to the questions in the quiz with other students? - Kriti B
Just go by the guidelines in the Partner Policy -- TiffanyChen
For local alignments can we assume that we will only be starting the traceback from one location? - Kriti B.
No. You should start the traceback from all locations that have the same score... and be sure not to check exact match for doubles, but check approximate match instead. -- John Bauer
Thanks! Ya it took me a while to figure out that I needed to do approximate matches. - Kriti B
In running on the example0 set (no gap penalties), I get the five expected results plus two more:
AA_TGC A_TGC
A_G_GC AG_GC
These both have the same score as the other five, so what am I missing in not excluding them? - Keith B.
You should not alternate between skipping in A and skipping in B. -- John Bauer
Yeah, you should not be transitioning directly from Ix to Iy or vice versa. Doing so introduces these alternating gaps you're observing -- Jesse
About Alignment9 in Quiz. We have the official results like this: ...TRK_LAQKLGIEQPTL TRAAL__MMGINRGTL and TRK_LAQKLGIEQPTLYWHVK TRAAL__MMGINRGTL__RKK
Should we regard the second pais as a real alignment? I mean they both reach the max score, but the second one just has an additional tail. If we reverse the sequences and do the alignment, I guess it is 0 at L-L and we should not continue. -- Jianbin W.
Technically, we should only consider the shorter one. But since the example alignments have contained both, we'll allow the extra tail alignment too. So if you match the example alignments you're fine, but if you go ahead and implement the code to remove the extraneous alignment, you will not be penalized for not matching the example alignments. --Jesse
My recursive algorithm doesn't output all the possible alignments for some of the test cases, but the ones that it does output are correct and the maximum score it calculates is also correct. I'm pretty sure I'm doing tracebacks for the maximum value if it shows up multiple times, as well as "hopping" to multiple matrices if the path branches off. Would anyone like to suggest other possible things I could be doing wrong? (I know, there are probably a ton of things I could be doing wrong...) -- Victoria Y.
If the maximum value x shows up multiple times, are you making sure to go backwards for each cell with a value of at least (x - epsilon)? Are you doing the same thing when building the tables? It would be easier to tell what possible errors are happening if you say which test cases are missing which results. -- John Bauer
Thanks for the suggestion. Unless I'm doing the epsilon factor incorrectly, I don't think that's the problem. My code doesn't work for the alignments from alignment_example3, alignment_example4, and alignment8. For alignment_example4, I only have these two
AAAGTTGCA__TTGGCATTG
ATAC_TGCACGTT_GCGTTG
AAAGTTGCA__TTGGCATTG
ATAC_TGCACGTTG_CGTTG
For alignment_example3 I only have
GTTCAAA_TCTG_CATGT
GCCCACCCTC_GACGTGT
GTTCAAATCT_G_CATGT
GCCCACC_CTCGACGTGT
And for alignment8 I only have
GACCCGCAAGCTGGCGCAGAAGCTGGGAATAGAACAGCCG____ACACTTTACTGGCATGTGAAA
GACCCG__TGCT_GCGCTGATGATGGGCAT____CAACCGTGGTACGCT______GC__GTAAAA
GACCCGCAAGCTGGCGCAGAAGCTGGGAATAGAACAGCCG____ACACTTTACTGGCATGTGAAA
GACCCG__TGCTG_CGCTGATGATGGGCAT____CAACCGTGGTACGCT______GC__GTAAAA
-- Victoria Y.
Keeping the rest of the file the same, what do you get if you change the first two lines to ATTATTA/AGAGA or ATTATTATTA/AGAGAGA? The first example should have 4 results, and the second should have 8. Regardless, it seems like the best way to debug is to first print out the tables and make sure the links are correct where the splits should be. If they are, move on to finding some way of printing out the recursion as you do it, making sure you recurse every time there is a split. -- John Bauer
I had exacly the same problem and after I switched to abs(x - epsilon) everything went well. It is easy to compare your results and the "offical answer" and find the path you missed. Like example 4, you missed a choice after the alignment between A7B6, examine the pointer at this position and really compare the numbers at A6B5. Hopefully it helps. -- Jianbin W.
You were both totally right about the epsilon thing! There was part of my code where I discovered that I forgot to do it, which I've fixed now. Thanks for the help! -- Victoria Y.
I am still a little confused about the way we should do the alignment. Should we build a pointer matrix for each of the 3 score matrices(M, Ix and Iy) (so that we have 3 pointer matrices)? And when we trace back the alignment, we actually keep changing from one score matrix to another according to the 3 pointer matrices?Then why do we need the final score matrix(max of M, Ix and Iy)? Just for the alignment start point? Thanks. -- Jianbin
The way I did it was by making 3 pointer matrices as you described, for each of the score matrices. I did not use a "final score" matrix - do we actually need that for anything? -- Victoria Y.
I agree with Victoria; I did not need a "final" score matrix for anything, I simply used the three I had built already, starting from the M Matrix with the traceback. And yes, you are "hopping" from pointer matrix to pointer matrix during your traceback. -- John Melas-Kyriazi
I asked them about my possible output for problem 1, and they made it sound like those three matrices and the links were what they wanted, not the "final score" matrix such as alignment_example2.score. In fact, that file isn't even on the website any more. I think it might have been a red herring. -- John Bauer
You do not need a final score matrix, although you do need to determine a maximum value (as explained in Durbin and Eddy) to commence your traceback(s). Please remember that we request traceback details in the problem set. -- TiffanyChen
So in the instructions, you discuss the "gap penalty" line of the text file.
I quote: ";gap open penalty for A (dx), gap extension for A (ex), open for B (dy), extend B (ey)". Here's what I don't get... when you are opening a gap on A, that means that you are aligning B with a gap: this seems like dy, not dx. Similarly, when you are opening a gap on B, you are aligning A with a gap -- this is what I refer to as dx (see p. 30 of Durbin-Eddy). Can someone clear this up?
Thanks! -- John Melas-Kyriazi
-UPDATE- when I use my system, rather than the described one, I get the correct sequence alignments and scores for the example problems.
My Ix tables use dy & ey, and my Iy table uses dx and ex. I am treated Ix as matching a character in A against no character from B, which is the description in Durbin. Perhaps what they mean when they say "gap open penalty for A (dx)" is that dx/ex is when you match a character in B against a blank spot in A. Instruction step #2 is a bit more clear. -- John Bauer
Think of it this way - When you align a letter in B with a gap in A, you have introduced a gap in A. Therefore, this alignment should result in the penalty for opening a gap in A. Because we know (from the nice picture in Durbin) that Iy is the matrix representing a gap being put in sequence A, then the Iy matrix, as John above noted, should have the dx, ex values (the penalties for gaps in A). Similar logic results in having Ix use the ey,dy values. I would double check your pointer matrix; if indeed you switched ey/dy for ex/dx, you may have the Ix and Iy switched as well. This would mean in your pointer matrix that when you introduced a gap in A, you went to your Ix matrix, which does not match the implied meaning of these matrices. -- Nick Perry
Upon processing alignment8.input (of for_quiz), I get some extra alignments that don't appear in the supplied output but have maximum score. An example is ... AC__GACCCGCAAGCTGGCGCAGAAGCTGGGAATAGAACAGCCG_ACACTTTACTGGCATGTGAAA
ACCAGACCCG__TGCT_GCGCTGATGATGGGCAT_CAACCGTGGTACGCT______GC__GTAAAA
... with scores ...
1.0 1.0 -1.7 -0.3 1.0 1.0 1.0 1.0 1.0 1.0 -1.6 -0.2 -0.5 1.0 1.0 1.0 -1.6 1.0 1.0 1.0 1.0 -0.5 1.0 1.0 -0.5 1.0 -0.5 1.0 1.0 1.0 1.0 -0.5 1.0 1.0 -1.6 -0.5 1.0 1.0 1.0 -0.5 1.0 -0.5 -0.5 1.0 -1.7 1.0 1.0 -0.5 1.0 1.0 -1.6 -0.2 -0.2 -0.2 -0.2 -0.2 1.0 1.0 -1.6 -0.2 1.0 1.0 -0.5 1.0 1.0 1.0
... yielding a total score of (21.4)
NOTE that the sequences have the same length even though they appear not to (due to font issues)
I have more such examples.
Could anyone confirm this? -- kkatsiap
The same thing is happening for me, and I'm getting the other test cases correct. -- John Bauer
Same here. All the examples I have are for alignment8.input (for_quiz)
Do any of the TA's have comments on this? -- kkatsiap
Your alignments look correct. However, the reason for the difference is that the solution code was written to only output unique alignments -- ones that do not appear as superstrings of other alignments. It should be apparent that the difference between your output and the solution output is that yours has a leading prefix that contributes a net of 0 points to the total alignment score. We ask that in this situation, you only report the shorter one. Sorry for not specifying this sooner. We've made changes to the project instructions to reflect this.
To do this, you should terminate your traceback upon reaching ANY zero value the M matrix during local alignment -- that is, if you are in a position that has pointers to a zero in the M matrix and a zero in the Ix or Iy matrix, you should only traverse the zero in the M matrix and stop there. It should make sense to you that path ending at the 0 in M represents a suffix of some alignment passing through the 0 in the Ix or Iy matrix. Practically speaking, this means that the alignment8 in the for_quiz folder is correct and you should match it. --Jesse
I found my bug. What tripped me up was a numerical error... I was terminating local matches early when they had a score of 0.0, but one particular set of circumstances led to a prefix with a score of +epsilon. Suggestion: make sure *every* numerical operation accounts for such errors. -- John Bauer
I have this same bug, John. I'm not sure what you mean by "every numerical operation" though. Could you elaborate a bit? -- Jenny Chen
What I meant was every time you do a compare, not just an equality comparison (a == b) but also (a <= b) or (a < b), there should be an epsilon involved. For example, (a <= 0) --> (a <= epsilon).
Is the python numpy package supported on the cardinal machines? I took the advice that was in the python notes of trying numpy and installed it on my home machine, but I'm finding it's not there on the cardinal machines... :( - Stefan K
Did you try looking for a solution like this: http://www.nabble.com/install-without-admin-rights-td19833476.html? -- Jesse
Wouldn't this only work for a single user? When you guys test it the code will still not function unless the same libraries are installed. The only other option is submitting a version of numpy with your submission. --Harendra
True. You can use it if you find a way to bundle it with your submission (but please don't do this if it's a huge source base) --Jesse
Argh... lame, hmm is there anyway we can petition to have it on the cardinal machines? -- Stefan K
I don't think numpy is that crucial for this project. Most of the stuff can be done with python's standard lists...they will be slower than numpy's array operations/slightly harder to manipulate, but I doubt we will be aligning very large lists anyway. --Harendra
My code produces the same output that we're given for all the alignments (examples and for_quiz) except for example #6 where the top scoring alignments have a score of 168.3 (instead of 168.4). The alignments ARE different so it's not just an algebra rounding error. Does anyone else get this also? UPDATE: I found my bug. Thanks :) --Jenny Chen
I'm getting the official answer. A rounding error would be very unlikely to get that big a difference. -- John Bauer
For the input file making the match matrix score, can we assume the match values are always going to be integers? - Stefan Krawczyk
No. -- TiffanyChen
Make sure you take this into consideration when you are doing an (A == B) with floating point values. You to check for equality, you should make sure that abs(A -B) < epsilon with epsilon some very small number -- Jesse
We don't need to consider cases of repeated matches, overlap matches and hybrid match conditions (mentioned in Durbin), right? - Evelyn L.
We will be discussing this in class today. Remember, the assignment asks for ends-free alignment, so we will have to take some modifications into account. You won't need to worry about repeated matches or hybrid matches, but overlap matches correspond to global ends free alignment which we are doing. -- TiffanyChen
I would also like to know about matrix initialization. In addition, for example #2, we are provided with the score_matrix. From what I can gather, it appears that this matrix contains the maximum values from M, Ix, and Iy. Am I correct? - David G.
I get the same result. Is that what we're supposed to turn in for question 1? -- John Bauer
We'll get back to you on this -- Jesse
Yes, the score matrix is the max values from M, Ix and Iy. I took this class for a couple weeks last year and I remember the TAs telling us that. --Jenny Chen
I'm confused about how we should initialize the boundary values for the score matrices M, Ix, and Iy. Do we use zeros in all of them? -- Victoria Y.
We should be covering this in class Tuesday. -- Jesse
I think this is the correct answer. However, what happened for me was I found that the end of matches would then use the same output multiple times, as if the same sequence had started in each of M, Ix, and Iy. I got around this problem by putting a small negative number in the boundaries of Ix & Iy. -- John Bauer
Does this mean that along the top row of each of the matrices, for example, where there are no values for M(i,j-1), we initialize the entire row to zeros, or should we define them as multiples of the gap penalties or something like in the example we did in class? I guess I'm just not clear about how the values for the top row and left column should be handled in all the matrices. -- Victoria Y.
We should be covering this in class Tuesday and we'll provide more details after that. -- Jesse
If I understand the situation you're asking about, you are wondering why we would use all 0s for the top and left of the M matrix instead of negative numbers like in class. In the original algorithm, we wanted to penalize the algorithm for not lining up the starts of the two chains. Here (for the global match) we don't care if the algorithm drops several elements of one chain, as long as it uses the correct start for the next chain. That's what it means by "ends free". -- John Bauer
In case anyone is still wondering about this, I initialized the top row/leftmost columns that are impossible to reach with really large negative numbers (ideally infinity, but -1000000 works). So for the M matrix, you shouldn't be able to reach its top row or leftmost column so the negative infinities make sure it'll never happen. For Ix and Iy, you put the infinities in the appropriate row or column, depending on which matrix it is. -- Albert
For question 1 on the quiz, what do you want us to turn in? Do we need to turn in a set of numbers? Should we write out by hand the numbers in our matrices, or should we have our program print out its internal matrices? -- John Bauer
You can use Word, Powerpoint, or something similar to construct the matrix and place arrows over it. Or, you can draw and scan if you like. Or you can do some kind of ASCII method if you can format it nicely -- Jesse
Can we assume the sequences and the alphabets will not contain whitespace characters, so we can use Python methods such as .strip()? -- John Bauer
Yes -- Jesse
Is there a canonical order for the matches? Also, it isn't really specified what exactly is output for the matches. Is that every possible sequence with the same score? -- John Bauer
You do not need to have them in the same order as the output. To compare your output to the example output, try concatenating the pairs of lines for the alignment and then do line sorting on both files to compare them. -- Jesse
If our programs might be tested against data with unequal alphabets, or non-symmetric cost matrices, would you post an example of such an input file so we can test that we are reading everything in the expected order? -- John Bauer
2
AT
3
TGC
Your match matrix would be rectangular. It should be trivial for you to use a non-symmetric match matrix (just change the values in the input file). -- Jesse
I guess my question was if we could get a sample input & output to compare to, just to make sure we read the input specification correctly. Thanks. -- John Bauer
A link for sample input and output is provided in the instructions -- Jesse
