| Biomedical Informatics 214 (also listed as Computer Science
274) Representations and Algorithms for Computational Molecular Biology Spring 2006 |
Due electronically by 5:00pm, Wed April 26, 2006
Note that programming projects should be completed individually. You may discuss algorithms with others, but the coding should be done alone.
Contents:
In this project you will be implementing both K-nearest neighbor (a supervised machine learning algorithm) and K-means (an unsupervised machine learning algorithm) to analyze microarray data.
Methodology:
The basic idea is the following: given an expression
vector for an unclassified patient, find the K "closest" expression
vectors from the classified patients and let them vote on the classification
for the query patient. If some p percent or more of the K neighbors are
ALL patients then classify the unknown gene as ALL, otherwise classify it
as AML. Your function should take in the positive training set (ALL patient
data), a negative training set (AML patient data), a positive integer for
K, and a value between 0 and 1 for p. It should return a vector of predictions
-- one prediction for each row in the test set. Use Euclidean distance as
your distance metric.
Deliverables:
You will need to submit a script or source for an executable called "knn" (obviously the extension will vary by language) to perform k-nearest neighbors classification with the two input file names, k and p (in that order) as command line arguments. For example for the leukemia dataset the command line arguments should read: "knn ALL.dat AML.dat k p". Output should be to both <stdout> and a file called "knn.out" exactly as below, where 10 and .75 are used for k and p respectively (other #'s are made up):
k: 10
p: .75
accuracy: .68
sensitivity: .85
specificity: .9
Cross validation:
To assess the accuracy of a classifier, we typically
use an approach known as n-fold cross-validation. In this strategy we:
Where TP = number of true positives, FN = number of false negatives, FP = the number of false positives, and TN = the number of true negatives in the test set.
Data:
The dataset you will use for classification consists of microarray expression data for 72 leukemia patients. Experiments are described in detail in the paper:
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science. 99:531-537.
The expression data for the 28 AML patients can
be found in the tab delimited file AML.dat.
Each column in the file represents a different AML patient. Each of the
7129 rows correspond to a different gene. The expression data for the
44 ALL patients can be found in ALL.dat.
The genes in this data set are in the same order as those in the AML data
set. A description of the 7129 genes can be found in human_genes.txt.
Introduction:
Analysis of microarray data can be on the experiment level (cluster/classify experiments), at the gene level (cluster/classify genes), or both (see biclustering if interested). In part I you performed k-nearest neighbors to classify microarray experiments based on gene expression. In this section, you will perform K-means clustering on the genes, clustering together genes who have similar expression profiles over a multitude of experimental conditions.
K-means clustering:
Implement the K-means clustering method discussed
in class. The basic idea of K-means clustering is the following:
The data consists of microarrays spanning 79 experiments,
including measurements of expression taken after environmental changes
were imposed on the yeast. For example, some of these conditions were
starvation (causing the yeast to form spores), changing the sugar supply
(causing the yeast to ferment rather than respire), and synchronizing the
cells to force them to pass through the stages of cell division at the
same time. The experiments are described in detail in the paper:
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of genome-wide expression patterns. PNAS. 95:14863-14868.
The expression patterns for the 2467 genes can be found in the tab delimited data file called yeast.dat. Each column in the file represents one experimental condition. The file yeast_experiments.txt lists the experiments to which each column corresponds. The common gene names and a short and a short functional annotation for the 2467 genes can be found in yeast_gene_names.txt.You will need to submit a script or source for an
executable called "kmeans" (obviously the extension will vary by language)
to perform k-means clustering with the number of clusters (K), filename
for the tab delimited data file, the maximum number of iterations allowed
and an optional 4th parameter for reading cetroids from a file. If
the filename for the centroids is not specified then the K initial centroids
should be generated randomly. The initial centroid file should contain a
Kx79 matrix in tab delimited form. Sample centroid input filefor K=3 can be
found here.
Output should be writted to a file called "kmeans.out".
The output should contain the cluster assignment for each gene by
index (i.e the first gene in the list is gene #1 etc...):
For example (K=3):
1 2
2 1
3 3
4 1
5 2
6 1
7 1
8 2
9 1
where the first column is the gene number and the
second column designates a cluster assignment (the numbering of the clusters
is arbitrary as long as two genes in the same cluster have the same cluster
number).
What you will need to submit:
Log in to an elaine machine:
ssh sunetid@elaine.stanford.edu
Place all of your relevant files (the four sections outlined above) into one directory. E.g.
~/biomedin214/p2/submitDo not include irrelevant files; please only include those listed above. Delete executable binaries if you have them (e.g. if you used java/C/C+); we will compile your source code.
cd into your submit folder.
cd ~/biomedin214/p1/submit
Run the class submission script:
/usr/class/biomedin214/bin/submit p2Note the space before p2. Be sure to run the script from within your submission folder! (The folder containing the four main items.)
You may submit multiple times; simply re-run the script. Each new submission overwrites the previous submission. Your submission date will be the final submission received, and late days will be charged accordingly.