PRINCIPAL COMPONENTS ANALYSIS TO SUMMARIZE MICROARRAY
EXPERIMENTS:
APPLICATION TO SPORULATION TIME SERIES
Soumya Raychaudhuri, Joshua M. Stuart, and Russ B. Altman
Stanford
Medical Informatics
Stanford University, 251 Campus Drive, MSOB X-215,
Stanford CA 94305-5479
{sxr, stuart, altman} @smi.stanford.edu
The enormous amount of data produced by microarray experiments
can be unwieldy. A given series of microarray experiments produces observations
of differential expression for thousands of genes across multiple conditions.
These large data sets can be summarized with principal components analysis
(PCA), a statistical technique that allows the key variables (or combinations
of variables) in a multidimensional data set to be identified. Principal
components analysis determines those key variables in the data that best
explain the differences in the observations. Here we show the utility of
applying PCA to expression data, where the experimental conditions are
the variables, and the gene expression measurements are the observations.
Thus, each component defines a linear combination of the experimental conditions
that can be used to distinguish genes parsimoniously. Examination of the
components also provides insight into what underlying factors are actually
being measured in the experiment. We applied PCA to the publicly released
yeast sporulation data set (Chu
et al. 1998). In that work, 7 different measurements of gene expression
were made over time. PCA on the time-points suggests that much of the observed
variability in the experiment can be summarized in just 2 components—i.e.
2 variables capture most of the information. These underlying factors appear
to represent (1) overall induction level and (2) change in induction level
over time. A visualization of our results is made available (http://www.smi.stanford.edu/projects/helix/PCArray).
These links will go to VRML files that will show one small line segment for each gene in the data set in a 2D or 3D plot. These genes are then hotlinked to the corresponding open reading frame (ORF) in the Saccharomyces Genome Database.
1. VRML source file with all
yeast genes projected onto first two principal component axes.
2. VRML source file with all
yeast genes projected onto first three principal component axes.
VRML files require a browser plug-in, such as are available
at http://home.netscape.com/plugins/3d_and_animation.html.