| Biomedical Informatics 214 (also listed as Computer Science
274) Representations and Algorithms for Computational Molecular Biology Spring 2007 |
Due electronically by 5:00pm, Monday April 30, 2007
Note that programming projects should be completed
individually. You may discuss algorithms with others, but the coding
should be done alone.
Remember to consult the FAQ periodically as it will be updated with answers to common questions.
Contents:
In this project you will be implementing both K-nearest neighbor (a supervised machine learning algorithm) and K-means (an unsupervised machine learning algorithm) to analyze microarray data.
Methodology:
The basic idea is the following: given an expression
vector for an unclassified patient, find the K "closest" expression
vectors from the classified patients and let them vote on the classification
for the query patient. If some p percent or more of the K neighbors are
ALL patients then classify the unknown patient as ALL, otherwise classify it
as AML. Your function should take in the positive training set (ALL patient
data), a negative training set (AML patient data), a positive integer for
K, and a value between 0 and 1 for p. It should return a vector of predictions
-- one prediction for each patient in the test set. Use Euclidean distance as
your distance metric.
Deliverables:
You will need to submit a script or source for an executable called "knn" (obviously the extension will vary by language) to perform k-nearest neighbors classification with the two input file names, k and p (in that order) as command line arguments. For example for the leukemia dataset the command line arguments should read: "knn ALL.dat AML.dat k p". Output should be to both <stdout> and a file called "knn.out" exactly as below, where 10 and .75 are used for k and p respectively (other #'s are made up):
k: 10
p: .75
accuracy: .68
sensitivity: .85
specificity: .9
Cross validation:
To assess the accuracy of a classifier, we typically
use an approach known as n-fold cross-validation. In this strategy we:
Where TP = number of true positives, FN = number
of false negatives, FP = the number of false positives, and TN = the number
of true negatives in the test set.
Data:
The dataset you will use for classification consists of microarray expression data for 72 leukemia patients. Experiments are described in detail in the paper:
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science. 99:531-537.
The expression data for the 28 AML patients can
be found in the tab delimited file AML.dat.
Each column in the file represents a different AML patient. Each of the
7129 rows correspond to a different gene. The expression data for the
44 ALL patients can be found in ALL.dat.
The genes in this data set are in the same order as those in the AML data
set. A description of the 7129 genes can be found in human_genes.txt.
where k is the number of neighbors that have a "vote", and p
is the minimum fraction of "votes" needed in order to classify a sample as positive.
In the case when n=4:
Divide the data into 4 equally (or as close to equally) sized sets.
Make sure that the ratio of ALL to AML patients are the same in each set.
You can do this by splitting the ALL and AML data into 4 sets separately
and then combining them. Now take three (n-1) of the four subsets and combine
them to make the training set. Next run KNN on this training set using the
remaining 1 (out of the 4) set to test the classification.
We want to repeat this four times. Each time one of the 4 is left
out as the test set, and the remaining 3 are used as the training set. A
diagram can be found here.
Note: You only need to randomize and split the data once. Once
the data is split you alternate which single subset is used as the test set.
Introduction:
Analysis of microarray data can be on the experiment level (cluster/classify experiments), at the gene level (cluster/classify genes), or both (see biclustering if interested). In part I you performed k-nearest neighbors to classify microarray experiments based on gene expression. In this section, you will perform K-means clustering on the genes, clustering together genes who have similar expression profiles over a multitude of experimental conditions.
K-means clustering:
Implement the K-means clustering method discussed
in class. The basic idea of K-means clustering is the following:
The data consists of microarrays spanning 79 experiments,
including measurements of expression taken after environmental changes
were imposed on the yeast. For example, some of these conditions were
starvation (causing the yeast to form spores), changing the sugar supply
(causing the yeast to ferment rather than respire), and synchronizing the
cells to force them to pass through the stages of cell division at the
same time. The experiments are described in detail in the paper:
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of genome-wide expression patterns. PNAS. 95:14863-14868.
The expression patterns for the 2467 genes can be found in the tab delimited data file called yeast.dat. Each column in the file represents one experimental condition. The file yeast_experiments.txt lists the experiments to which each column corresponds. The common gene names and a short and a short functional annotation for the 2467 genes can be found in yeast_gene_names.txt.You will need to submit a script or source for an
executable called "kmeans" (obviously the extension will vary by language)
to perform k-means clustering with the number of clusters (K), filename
for the tab delimited data file, the maximum number of iterations allowed
and an optional 4th parameter for reading cetroids from a file. If
the filename for the centroids is not specified then the K initial centroids
should be generated randomly. The initial centroid file should contain a
Kx79 matrix in tab delimited form. Sample centroid input filefor K=3 can be
found here (test_centroids.txt).
Output should be writted to a file called "kmeans.out".
The output should contain the cluster assignment for each gene by
index (i.e the first gene in the list is gene #1 etc...):
For example (K=3):
1 2
2 1
3 3
4 1
5 2
6 1
7 1
8 2
9 1
where the first column is the gene number and the
second column designates a cluster assignment (the numbering of the clusters
is arbitrary as long as two genes in the same cluster have the same cluster
number).
This argument specifies the maximum number of iterations we allow before cutting off if convergence is not acheived.
It is useful for testing (and more importantly for us grading) purposes.
Also it allows you to get "pretty good" results in cases when it is taking a long time to converge.
For this project we generally want it set pretty high (I will test it on 50 iterations).
NOTE: Do not rely on the maximum iteration parameter to stop the program. If k-means converges before the maximum number of itereations is reached then it should stop then.
where k is the number of centroids, yeast.dat is the tab
delimited file containing expression data for 2467 genes in 79 experiments,
max.it is the maximum number of iterations allowed, and testCentroids.txt
is an optional parameter specifying a file from which intital centroids
should be read. If this filename is not present then centroids should be
generated randomly.
What you will need to submit:
Log in to an elaine machine:
ssh sunetid@elaine.stanford.edu
Place all of your relevant files (the three sections outlined above) into one directory. E.g.
~/biomedin214/p2/submitDo not include irrelevant files; please only include those listed above. Delete executable binaries if you have them (e.g. if you used java/C/C+); we will compile your source code.
cd into your submit folder.
cd ~/biomedin214/p2/submit
Run the class submission script:
/usr/class/biomedin214/bin/submit p2Note the space before p2. Be sure to run the script from within your submission folder! (The folder containing the three main items.)
You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submission. Your submission date will be the final submission received, and late days will be charged accordingly.