In the instructions, you specifically say to only include in our output folder the results of tests for knn for k = 1, 5, 10, 20, 40, and 50, but not 30. Just a heads up -- i'm not sure if this is a typo or deliberate, but I followed the instructions exactly. -- John Melas-Kyriazi

Yes that's fine, no need to include 30. - Florian

For pseudo code did you want out pseudo code to accurately represent our function names and variables or just be high level? - Harendra

It would be nice but it's not required. The pseudocode is to see how you thought about the problem. - Florian

For question 5b, it seems odd that we're comparing questions 3 and 5 vs. 4 and 5. Is there a typo? - Evelyn

Thanks for pointing this out, you are correct. You would compare question 4 and 5. We will take this typo into consideration when grading. - Florian

Can you clarify this point on the quiz part I: "make sure that the two cancer types are equally represented"? This sounds different from the instructions on Web site. The instructions specify that each training set should have the same proportions (eg 33 ALL, 21 AML), the quiz seems to suggest equal representation (eg 21 ALL, 21 AML). - Guy

Sorry for the confusion, either answer will be accepted. - Florian

I don't mean to rag on you guys here, but since you recommend Python, why not use it to parse similar looking outputs to check if they're the same? I don't think 0.68 should be treated differently from .68, not when a call to float(x) shows they're the same. Furthermore, since Python makes it so much easier to output the 0.68, why not make that the accepted answer? Anyway, what is the accepted format for a floating point number that is 1? "1" or "1.0"? What about 0? Thanks in advance. -- John Bauer

Either 1 or 1.0 is acceptable. 0, 0.0 is acceptable. Sorry but we can't change requirements this late in the game. - Florian

In that case, can you make very explicit requirements for what the strings should look like? Thanks. It looks like you want is: 1) 2 digits of precision on floating point numbers 2) If the 1/100s place is 0, only output 1 digit of precision 3) If the number is larger than 0.0, but smaller than 1.0, drop the leading 0 -- John Bauer

''.strip('0') seems like a good way to handle most of the cases and handle 1 and 0 specially...for python... - Harendra G.

Don't worry too much about special cases with the formatting. If a test fails we will look at it manually. - Florian

For KMeans, is it fair to assume that K, the number of centers, will correspond to the number of centers provided in a centroid file? For example "kmeans 3 yeast.dat 50 yeast_centers.txt" can we assume that yeast_centers.txt will have exactly 3 centers in it? -- Stefan K

Yes you can assume that. - Florian

For KMeans, if a centroid does not have any genes assigned to it at some point during the run of the program, how are we supposed to update its location? Can we just leave it put? Thanks! -- John Melas-Kyriazi

This is reasonable. Later on after other centroids move, some points may become assigned to it once again -- Jesse

There seem to be contradicting responses to the question of making our code general or specific. Are we allowed to implement for the specific case of 28 AML and 44 ALL patients and 4-fold cross validation? Or should our code allow for any number of patients and any-fold cross validation? - Linda

Code input should be as described in the instructions. You can hard-code values that aren't given as input on the console but you should write code so it's easy for a programmer to simple change these values. - Florian

Sorry to be so persistent, but I'd really like a "yes" or "no" on the specific numbers. Thanks! - Linda

Yes, you are allowed to hard-code these numbers. - Florian

Another question for k-NN - The methodology in the instructions says that our program should "given an expression vector for an unclassified patient, find the K closest expression vectors from the classified patients and let them vote on the classification for the query patient...It should return a vector of predictions -- one prediction for each patient in the test set." Does our program actually need to do these things? Because the rest of the instructions imply that is should just do the n-fold cross validation and generate output as shown in the example? - Kriti

The output of your program should be exactly as given in the example. The function within your program that handles KNN should return a vector of predictions so you can investigate it if needed. - Florian

for k-NN, how 'exact' do you'll want the output to be? e.g. is it ok if accuracy prints as 0.68, or does it HAVE to be .68? - Kriti

Sorry if that output format isn't very convenient but we'd like you to please have it exactly as described. - Florian

I guess the question is do you want the answer to a certain number of decimal places? or? -Stefan K

followup to Kriti's question: say we get 1.0 for sensitivity. How should that be displayed (if you don't want 0.68, etc)? Also, how to display 0.0? I think this is a general source of confusion. Thanks! -- John Melas-Kyriazi

I am also still confused how you want to have the output formatted. Thanks! -- Nafis Jamal

Question about k-NN. If i understand the instructions correctly, prediction is based solely on the positive dataset, i.e. if p is 0.5 and k is 10, we predict the positive dataset if > 5 of the 10 nearest neighbors are positive. Else, predict the negative dataset. However, this doesn't account for when the number of (+) and (-) neighbors is < or > than the fraction p -- for example p = 0.9, k = 10, (+) = 8 and (-) = 2. What should be done in this case? Would it be acceptable to perform a different check for that case, i.e. in the above example pick the (+) dataset by virtue of 8 being > than 2? Leaving it as is would pick the (-) dataset and this generates pretty large margins of inaccuracy. Also, what is the acceptable range for accuracy? --David K.

You don't have to implement any further checks, the user is responsible for picking a reasonable value for p. For accuracy - since there is a degree of randomness involved answers will fluctuate a bit, we'll take that into account of course. If you implement the algorithm correctly you'll get full credit! - Florian

When we evaluate the cross validation accuracy vs. K, should we keep the random seed for partitioning all the same. Will there be any bias if we use different random partitioning for different K value? And for the output file, what does it mean by "output should be to <stdout>"? Should we just output the evalueation results like sensitivity or also the prediction for each patient? -- Jianbin

You will want to keep the partitions the same for each of the runs. The output should be to the console (standard output) - most languages' print statement does just that. The output should be exactly (!) as it's written in the description for the project. If anything is unclear, please ask! - Florian

Random comment about k nearest neighbors. If you sort, each new classification is O(n log n). If, instead, you keep a heap with the closest k neighbors as you measure each of the n distances, you can make the time for each classification O(n log k). This is a common interview question, by the way... -- John Bauer

The quiz says that the FAQ contains information on how to save answers as PDFs. Does this FAQ still exist, and more importantly, what did it say about PDFs? -- Sasen

The FAQ on the website has been updated. -- TiffanyChen

For the K-means assignment, what if (in the case where random centroid points are generated), there are centroids where no points have been assigned to that centroid. How do we move the centroid for the next iteration? --Jenny

Please see answer below for Divij's question. -- TiffanyChen

It is still possible for a centroid to have none of the data points associated with it. What I was doing was randomly restarting any abandoned centroids by picking a random data point. -- John Bauer

This is correct, you will want to make sure that you end up with the right number of categories in the end! - Florian

For the K-means what does it mean by "ribosomal genes"? Should we include mitochondrial ribosomal genes? Also how about these three #1479, 1480, 1516? -- Jianbin

Although there are potentially other ribosomal genes, we ask you to simply look at the last ones in the file. Do not be surprised, however, if those other genes cluster with the ones we have purposefully give you. -- TiffanyChen

For the K-means clustering assignment can the cluster centers be any random number or do they have to be within the range of the values of the dataset -- Divij Gupta

Check to make sure your centers are within the min and max of your dataset. -- TiffanyChen

This time we do not have any example answer to check our program like last time, do we? It seems that the example output in K-means is way off the correct clustering. -- Jianbin

There are no sample answers for this project. The example output is random and only demonstrates how output should be formatted. - Florian

How should we break ties for knn? Suppose we are doing 10 nearest neighbors at 0.5, and 5 are ALL and 5 are AML... should we take the closest data point, should we take a random one, is this ALL or AML, or does it not matter? -- John Bauer

You can chose either way as long as it is documented properly. Good questions everybody. - Florian

I notice that specificity and accuracy are increased if we take AML for ties rather than ALL. -- John Melas-Kyriazi

Is it okay to write my code knowing that there are 44 ALL patients and 28 AML patients? Or should my code be generalized to read it any number of ALL and AML patients. Also should the validation code be generalized to do any n-fold validation, or can we code it to only do 4-fold validation?--Jenny Chen

Yes that is fine. Is it good practice to have these values clearly marked and commented so someone reading your code knows what to change. - Florian

Are we allowed to share our results for part I with our classmates? For example, can I say that my results (using python) for random.seed(100) are ...? -- John Bauer

You should not be comparing results with other students. You can discuss ideas and work with other students but solutions must be individual and you need to cite people you discussed your work with. - Florian

For part I, when we average the results, how should we do the averaging? Should we report specificity = sum(specificity) / 4 or specificity = sum(tn) / (sum(tn) + sum(fp))? -- John Bauer

You may use either approach but it should be commented in the code and readme file. - Florian

In class today Russ mentioned an alternative metric for KNN where the "vote" from an adjacent point is weighted by its distance from the unclassified point. This makes a lot of sense, but using this method makes the given "p" input value somewhat meaningless (after all, an adjacent point could be arbitrarily close). Should we try to implement this method using a cumulative ALL and AML vote, or just stick with the prescribed "if more than p percent of the adjacent points are KNN, then it's KNN"? -- Alex Blackstock

For this assignment we'll go with the simpler version of just having the k closest points vote fully on the classification as described in the project description. - Florian

Any recommendations on charting packages that we should use for the quiz answers? I see that Matplotlib and Reportlab are both available for Python, but I don't see that they are preinstalled on the cardinal cluster. I suppose I could just install them on my home machine and take the answers from there, but it'd be nice to integrate the charting into my knn/kmeans Python scripts, which means that they should somehow be available on the cardinal cluster. -- Mark Auburn

Don't worry about integrating charting/graphing into your scripts. Question 3 of the quiz gives you an artificial test-case that you can interpret as points on a plane and you can see your clustering algorithm at work and can interpret the results graphically. For the quiz answer we are looking for a simple drawing of the points given and how your algorithm clusters them. - Florian

Should we expect our code for knn to be tested for problems that aren't in the examples or quizzes? Should we expect that in general? -- John Bauer

Yes, we want you to implement the programs as described in the project description. The quiz asks you to run your programs using some sample data which may be helpful in finding problems with your code. - Florian

For the knn, are we supposed to actually run the algorithm on any test data? It doesn't seem like we are provided with any. Do we just do the cross-validation (which uses the knn algorithm) and do the quiz as instructed? -- David G.

You will only need to answer questions 1 and 2 on the quiz for knn. These deal with n-fold cross-validation which uses the algorithm: it uses n-1 groups as the training set and tests it using a single set. For this you will be using AML.dat and ALL.dat as your sample data. - Florian


Page last modified on May 02, 2009, at 09:08 PM