BMI 214/CS 274 Programming Project 2

Microarrays

Due 11:55PM, Wednesday, May 4, 2005

Introduction

In this project you will be implementing both K-nearest neighbor (a supervised machine learning algorithm) and K-means (an unsupervised machine learning algorithm) to analyze microarray data.

The ribosome is a large complex of many proteins that facilitates the translation of mRNA into protein; they are the cellular machinery responsible for linking together the correct sequence of amino acids from a sequence of codons. The proportion of each protein present in the complex is coordinated in the cell to ensure the correct number of subunits are available to construct complete ribosome molecules. The cell often regulates the amount of protein by controlling the transcription level of the protein's gene. Therefore, we might expect many of the ribosomal genes to be coordinately regulated -- they should have similar mRNA expression levels. Also, given a number of known ribosomal proteins and their expression patterns across a wide range of experiments, we might be able to find other ribosomal proteins by comparison. Other proteins that have similar expression patterns to the known ribosomal genes may also be part of the complex or may be involved in the process of protein synthesis.

(Note: In this assignment, when we refer to ribosomal proteins we mean those proteins present in the ribosome located in the cytoplasm, not those proteins found in the mitochondrial ribosome. It is important to make this distinction since we do not expect the levels of mitochondrial and cytoplasmic ribosomes to be coordinated.)

Some useful items:

Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)
Accuracy =(TP+TN)/total

Deliverables

What you will need to submit:

  1. A script or source for an executable called "knn" (obviously the extension will vary by language) to perform k-nearest neighbors classification with k and p (in that order) as command line arguments. Output should be to both <stdout> and a file called "knn.out" exactly as below, where 10 and .75 are used for k and p respectively (other #'s are made up):

    k: 10
    p: .75
    accuracy: .68
    sensitivity: .85
    specificity: .92

    Per email
    Re input files- the files won't change- you'll always be using ribo.dat and nonribo.dat. However, hard coding files is not exaclty good coding practice, so we don't really want to encourage that. You have 2 options- please specifiy in your readme which of the following you have done.

    a. you can accept the input files on the command line, e.g. "knn ribo.dat nonribo.dat k p"- in that order (this is the preferred method- makes grading easier for us)

    b. you can assume that ribo.dat and nonribo.dat live in the same directory as your code. If you choose this option please DO NOT submit those files with your code- we'll copy them into the appropriate directory when we do the grading.

  2. A second script or source for an executable called "kmc" to perform k-means clustering. You will use this to answer questions, and we will check code and comments and make sure it runs, but we will not systematically grade any output.
  3. A README document as descibed at the bottom of this page
  4. Cwp quiz. (You might want to look at the quiz before writing your programs to see what sorts of questions will be asked!)
Remember, your code MUST compile and run on leland machines!

Data

The data consists of 79 microarray experiments where the expression levels of 2467 genes in S. Cervisiae (baker's yeast) were measured across 79 different conditions or time points. Of the genes represented on the microarrays, 121 were previously characterized as ribosomal genes -- the remainder are various other genes with known functions. The 79 experiments include measurements of expression taken after environmental changes were imposed on the yeast. For example, some of these conditions were starvation (causing the yeast to form spores), changing the sugar supply (causing the yeast to ferment rather than respire), and synchronizing the cells to force them to pass through the stages of cell division at the same time. The experiments are described in detail in the paper:

Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of genome-wide expression patterns. PNAS. 95:14863-14868.

The current list of known ribosomal genes that were used in this homework can be found at the MIPS database.

The expression patterns for the 121 ribosomal genes can be found in the tab-delimited data file ribo.dat. Each column in the file represents one experimental condition (there are 79 columns in all). The expression patterns for the 2346 non-ribosomeal genes can be found in the data file nonribo.dat. The columns in this file are organized the same way, with 79 columns corresponding to the same order of experiments as in ribosome file. The file experiments.txt lists to which experiment each column corresponds. The common gene name and a short functional annotation for both the ribosomal set and the non-ribosomal set can be found in the files ribo-names.txt and nonribo-names.txt respectively.

K-nearest neighbor classification with microarray data

Implement the K-nearest neighbor strategy discussed in the TA session. The basic idea is the following: given an expression vector for an unclassified gene, find the K "closest" expression vectors from the classified genes and let them vote on the classification for the query gene. If some p percent or more of the K neighbors are ribosomal genes, classify the unknown gene as ribosomal, otherwise classify it as non-ribosomal. Your function should take in a positive training set, a negative training set, a test set, a positive integer for K, and a value between 0 and 1 for p. It should return a vector of predictions -- one prediction for each row in the test set. Use Euclidean distance as your distance metric.

Cross-Validation
To assess the accuracy of a classifier, we typically use an approach known as n-fold cross-validation. In this strategy we:

  1. Divide the training set into n groups randomly (Divide the positive set into n groups and the negative set into n groups).
  2. We use one of the n subsets as our test set and the other n-1 as our training set.
  3. We repeat step 2 n times.

We assess the test accuracy (sensitivity, specificity, or a combination of the two) by averaging over all the runs. In your runs you should use 6-fold cross-validation in your computations of accuracy.

K-means clustering with microarray data

Implement the K-means clustering method discussed in class. The basic idea of K-means clustering is the following:

Your program should take a dataset, an interger for K, and a set of centers (or if not specified, pick the centers randomly), and return a vector indicating to which cluster each data point in the dataset belongs to.

To Get Credit:

  1. Submit your source code and a readme file to CWP under the "Programming Code Submission" directory.
    1. Source code: We expect to see comments and explanations of what is going on. Especially if you expect partial credit for "almost working" code.
    2. Readme file: Include
      • Your name and CWP login name
      • Language used and instruction on how to run your program
      • Pseudocode description of the algorithm (one page or less). We will use this as a primary way to debug the conceptual basis for your code, especially for partial credit.
  2. Take the quiz and provide the test case output at CWP under the "Assignments and Projects" directory.

CWP


Question? Contact: http://helix-web.stanford.edu/bmi214/files-2005/project2/bmi214-spr05-staff@lists.stanford.edu