Introduction
In this project you will be implementing both K-nearest neighbor (a supervised machine learning algorithm) and K-means (an unsupervised machine learning algorithm) to analyze microarray data.
The ribosome is a large complex of many proteins that facilitates the translation of mRNA into protein; they are the cellular machinery responsible for linking together the correct sequence of amino acids from a sequence of codons. The proportion of each protein present in the complex is coordinated in the cell to ensure the correct number of subunits are available to construct complete ribosome molecules. The cell often regulates the amount of protein by controlling the transcription level of the protein's gene. Therefore, we might expect many of the ribosomal genes to be coordinately regulated -- they should have similar mRNA expression levels. Also, given a number of known ribosomal proteins and their expression patterns across a wide range of experiments, we might be able to find other ribosomal proteins by comparison. Other proteins that have similar expression patterns to the known ribosomal genes may also be part of the complex or may be involved in the process of protein synthesis.
(Note: In this assignment, when we refer to ribosomal proteins we mean those proteins present in the ribosome located in the cytoplasm, not those proteins found in the mitochondrial ribosome. It is important to make this distinction since we do not expect the levels of mitochondrial and cytoplasmic ribosomes to be coordinated.)
Some useful items:
Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)
Accuracy
=(TP+TN)/total
Deliverables
What you will need to submit:
k: 10
p: .75
accuracy: .68
sensitivity: .85
specificity:
.92
| Per email Re input files- the files won't change- you'll always be using ribo.dat and nonribo.dat. However, hard coding files is not exaclty good coding practice, so we don't really want to encourage that. You have 2 options- please specifiy in your readme which of the following you have done. a. you can accept the input files on the command line, e.g. "knn ribo.dat nonribo.dat k p"- in that order (this is the preferred method- makes grading easier for us) b. you can assume that ribo.dat and nonribo.dat live in the same directory as your code. If you choose this option please DO NOT submit those files with your code- we'll copy them into the appropriate directory when we do the grading.
|
Data
The data consists of 79 microarray experiments where the expression levels of
2467 genes in S. Cervisiae (baker's yeast) were measured across 79 different
conditions or time points. Of the genes represented on the microarrays, 121 were
previously characterized as ribosomal genes -- the remainder are various other
genes with known functions. The 79 experiments include measurements of
expression taken after environmental changes were imposed on the yeast. For
example, some of these conditions were starvation (causing the yeast to form
spores), changing the sugar supply (causing the yeast to ferment rather than
respire), and synchronizing the cells to force them to pass through the stages
of cell division at the same time. The experiments are described in detail in
the paper:
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.
Cluster analysis and display of genome-wide expression patterns. PNAS.
95:14863-14868.
The current list of known ribosomal genes that were used in this homework can be found at the MIPS database.
The expression patterns for the 121 ribosomal genes can be found in the tab-delimited data file ribo.dat. Each column in the file represents one experimental condition (there are 79 columns in all). The expression patterns for the 2346 non-ribosomeal genes can be found in the data file nonribo.dat. The columns in this file are organized the same way, with 79 columns corresponding to the same order of experiments as in ribosome file. The file experiments.txt lists to which experiment each column corresponds. The common gene name and a short functional annotation for both the ribosomal set and the non-ribosomal set can be found in the files ribo-names.txt and nonribo-names.txt respectively.
K-nearest neighbor classification with microarray data
Implement the K-nearest neighbor strategy discussed in the TA session. The basic idea is the following: given an expression vector for an unclassified gene, find the K "closest" expression vectors from the classified genes and let them vote on the classification for the query gene. If some p percent or more of the K neighbors are ribosomal genes, classify the unknown gene as ribosomal, otherwise classify it as non-ribosomal. Your function should take in a positive training set, a negative training set, a test set, a positive integer for K, and a value between 0 and 1 for p. It should return a vector of predictions -- one prediction for each row in the test set. Use Euclidean distance as your distance metric.
Cross-Validation
To assess the accuracy of a classifier, we typically use
an approach known as n-fold cross-validation. In this strategy we:
We assess the test accuracy (sensitivity, specificity, or a combination of the two) by averaging over all the runs. In your runs you should use 6-fold cross-validation in your computations of accuracy.
K-means clustering with microarray data
Implement the K-means clustering method discussed in class. The basic idea of K-means clustering is the following:
Your program should take a dataset, an interger for K, and a set of centers (or if not specified, pick the centers randomly), and return a vector indicating to which cluster each data point in the dataset belongs to.
To Get Credit: