Biomedical Informatics 214 (also listed as Computer Science 274)
Representations and Algorithms for Computational Molecular Biology

Spring 2007

Programming Project 2

Due electronically by 5:00pm, Monday April 30, 2007

Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone.
Remember to consult the FAQ periodically as it will be updated with answers to common questions.

Contents:

  1. Introduction
  2. K-nearest neighbors
  3. K-means
  4. Submission contents
  5. Submission instructions

Introduction

In this project you will be implementing both K-nearest neighbor (a supervised machine learning algorithm) and K-means (an unsupervised machine learning algorithm) to analyze microarray data.


I. K nearest neighbors

Expression data was taken from 72 patients with acute leukemia, to see if expression alone can differentiate between acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Although symptoms are similar for these two diseases, the response to treatment is quite variable between the two cancer types.  For the first part of this project, you will uset K-nearest neighbors to classify patients with acute leukemia.

Methodology:

The basic idea is the following: given an expression vector for an unclassified patient, find the K "closest" expression vectors from the classified patients and let them vote on the classification for the query patient. If some p percent or more of the K neighbors are ALL patients then classify the unknown patient as ALL, otherwise classify it as AML. Your function should take in the positive training set (ALL patient data), a negative training set (AML patient data), a positive integer for K, and a value between 0 and 1 for p. It should return a vector of predictions -- one prediction for each patient in the test set. Use Euclidean distance as your distance metric.

Deliverables:

You will need to submit a script or source for an executable called "knn" (obviously the extension will vary by language) to perform k-nearest neighbors classification with the two  input file names, k and p (in that order) as command line arguments.  For example for the leukemia dataset the command line arguments should read: "knn ALL.dat AML.dat k p".  Output should be to both <stdout> and a file called "knn.out" exactly as below, where 10 and .75 are used for k and p respectively (other #'s are made up):

k: 10
p: .75
accuracy: .68
sensitivity: .85
specificity: .9


Cross validation:

To assess the accuracy of a classifier, we typically use an approach known as n-fold cross-validation. In this strategy we:

  1. Divide the training set into n groups randomly (Divide the positive set into n groups and the negative set into n groups and then combine them so that each of the final n groups have equal proportions of ALL and AML patients)
  2. We use one of the n subsets as our test set and the other n-1 as our training set. 
  3. We repeat step (2) n times. We assess the test accuracy (sensitivity, specificity, or a combination of the two) by averaging over all the runs. In your runs you should use 4-fold cross-validation in your computations of accuracy.
Some useful items:
                                                                                                Sensitivity  =  TP/(TP+FN)
                                                                                                Specificity  =  TN/(TN+FP)
                                                                                                Accuracy   =  (TP+TN)/total

Where TP = number of true positives, FN = number of false negatives, FP = the number of false positives, and TN = the number of true negatives in the test set.    

Data:

The dataset you will use for classification consists of microarray expression data for 72 leukemia patients. Experiments are described in detail in the paper:

T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science. 99:531-537.

The expression data for the 28 AML patients can be found in the tab delimited file AML.dat. Each column in the file represents a different AML patient. Each of the 7129 rows correspond to a different gene. The expression data for the 44 ALL patients can be found in  ALL.dat. The genes in this data set are in the same order as those in the AML data set. A description of the 7129 genes can be found in human_genes.txt.


Implementation notes for KNN part:



II. K-means

Introduction:

Analysis of microarray data can be on the experiment level (cluster/classify experiments), at the gene level (cluster/classify genes), or both (see biclustering if interested).  In part I you performed k-nearest neighbors to classify microarray experiments based on gene expression. In this section, you will perform K-means clustering on the genes, clustering together genes who have similar expression profiles over a multitude of experimental conditions. 

K-means clustering:

Implement the K-means clustering method discussed in class. The basic idea of K-means clustering is the following:

The data consists of microarrays spanning 79 experiments, including measurements of expression taken after environmental changes were imposed on the yeast. For example, some of these conditions were starvation (causing the yeast to form spores), changing the sugar supply (causing the yeast to ferment rather than respire), and synchronizing the cells to force them to pass through the stages of cell division at the same time. The experiments are described in detail in the paper:

Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of genome-wide expression patterns. PNAS. 95:14863-14868.

The expression patterns for the 2467 genes can be found in the tab delimited data file called yeast.dat. Each column in the file represents one experimental condition. The file yeast_experiments.txt lists the experiments to which each column corresponds. The common gene names and a short and a short functional annotation for the 2467 genes can be found in yeast_gene_names.txt.


Deliverables:

You will need to submit a script or source for an executable called "kmeans" (obviously the extension will vary by language) to perform k-means clustering with the number of clusters (K), filename for the tab delimited data file, the maximum number of iterations allowed and an optional 4th parameter for reading cetroids from a file.  If the filename for the centroids is not specified then the K initial centroids should be generated randomly. The initial centroid file should contain a Kx79 matrix in tab delimited form.  Sample centroid input filefor K=3 can be found here (test_centroids.txt).

Output should be writted to a file called "kmeans.out".   The output should contain the cluster assignment for each gene by index (i.e the first gene in the list is gene #1 etc...):

For example (K=3):
1    2
2    1
3    3
4    1
5    2
6    1
7    1
8    2
9    1

where the first column is the gene number and the second column designates a cluster assignment (the numbering of the clusters is arbitrary as long as two genes in the same cluster have the same cluster number).


Implementation notes for k-means part:


Submission contents:

What you will need to submit:

  1. Source code for knn and kmeans.

  2. A README file. Please call it README.txt This should include three sections.

  3. A file called project2_quiz.pdf, containing answers to Project 2 Quiz. This file should be submitted as a pdf file called project2_quiz.pdf



Submission instructions


Log in to an elaine machine:

ssh sunetid@elaine.stanford.edu

Place all of your relevant files (the three sections outlined above) into one directory. E.g.

~/biomedin214/p2/submit
Do not include irrelevant files; please only include those listed above. Delete executable binaries if you have them (e.g. if you used java/C/C+); we will compile your source code.

cd into your submit folder.

cd ~/biomedin214/p2/submit

Run the class submission script:

/usr/class/biomedin214/bin/submit p2
Note the space before p2. Be sure to run the script from within your submission folder! (The folder containing the three main items.)

You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submission. Your submission date will be the final submission received, and late days will be charged accordingly.