Mines ParisTech machine learning practical session: Graphical Models, 29 April 2011

We will make use of R within Emacs+ESS, so to setup your R installation please open a terminal and type

~thocking/bin/mines-ess-setup.sh
emacs graphical-models.R &

Table of Contents

1 Understanding the k-means algorithm

Goal: understand how k-means attempts to find k clusters in an unlabeled point cloud.

The animation package in R can be used to visualize how k-means works.

Exercise 1: execute the following code in the R interpreter and use the interactive controls to play the animation and step through frames. If you run the code several times, do you get the same result?

if(!require(animation)){
  install.packages("animation")
  library(animation)
}
library(animation)
ani.start(loop=FALSE)
kmeans.ani()
ani.stop()

The kmeans.ani() function by default generates data randomly distributed in a square. Try a different number of clusters using a command like kmeans.ani(centers=5).

Exercise 2: apply this clustering to cluster the iris data. How many clusters do you think are a good fit for these data?

## visualize the 4d, 3-class iris data using a scatter plot matrix:
library(lattice)
splom(iris[,1:4])

## visualize k-means clustering on the iris data
library(animation)
ani.start(loop=FALSE)
result <- kmeans.ani(iris[,1:4])
ani.stop()

2 Expectation Maximization (EM) for Gaussian Mixture Models (GMMs)

Goal: visualize and understand the Expectation Maximization (EM) algorithm.

We will use a modified version of this code from Wikipedia to visualize how EM is used to fit GMMs.

Exercise 3: Execute this code to show an animation of the level curves of the 2d Gaussian mixture that is fit to these data using EM. If you change the initial parameter estimates (theta in the code), does it change the results of the algorithm?

Exercise 4: edit the code above to fit a mixture of 3 gaussian distributions to the data. Do you think this is a better fit for these data?

3 An application of Hidden Markov Models (HMMs)

Goal: Use HMMs to analyze chromosomes for copy number variations.

Human cells normally contain 2 copies of each of 23 chromosomes. Cancer cells are characterized by amplifications or deletions of certain regions of chromosomes that contain key genes. In this study, we will examine measurements of copy number of cancer cell chromosomes, and attempt to identify regions of amplification and deletion.

Exercise 5: use the following code to download and plot the cancer cell chromosomal copy number data. Do you notice any regions of amplification or deletion just by looking?

if(!require(GLAD,quietly=TRUE)){
  source("http://bioconductor.org/biocLite.R")
  biocLite("GLAD")
  library(GLAD)
}
data(snijders)
gm13330$Clone <- gm13330$BAC
profiles <- as.profileCGH(gm13330)$profileValues
profiles$Chromosome <- factor(profiles$Chromosome)
library(lattice)
lattice.options(default.args=list(as.table=TRUE))
xyplot(LogRatio~PosBase|Chromosome,profiles,
       strip=strip.custom(strip.names=TRUE))

To start, let's just focus on chromosome 1:

c1 <- subset(profiles,Chromosome==1)
xyplot(LogRatio~PosBase,c1)

Exercise 6: identify regions of gain or loss on chromosome 1 using an HMM. The hidden states will represent copy number: normal, gain, or loss. Each hidden state will have an associated gaussian distribution. Use the following code to fit an HMM to the data from chromosome 1. What happens when you change the nStates parameter to HMMFit()?

if(!require(RHmm,quietly=TRUE)){
  install.packages("RHmm")
  library(RHmm)
}
set.seed(2)
## From help(HMMFit): If you fit the model with only one sample, obs
## is ... a vector (for univariate distributions).
c1.model <- HMMFit(c1$LogRatio,nStates=2)
c1.states <- viterbi(c1.model,c1$LogRatio)
means <- c1.model$HMM$distribution$mean[c1.states$states]
make.df <- function(x,signal){
  data.frame(PosBase=c1$PosBase,LogRatio=x,signal)
}
c1.results <- rbind(make.df(c1$LogRatio,"data"),
                    make.df(means,"mean of gaussian of hidden state"))
xyplot(LogRatio~PosBase,c1.results,groups=signal,type="l",
       auto.key=list(lines=TRUE,points=FALSE))

Exercise 7: fit a HMM to the whole genome using the following code. Does it give the result you expected?

## From help(HMMFit): If you fit the model with more than one sample,
## obs is a list of samples. Each element of obs is then a vector (for
## univariate distributions) ... The samples do not need to have the
## same length.
logratio.list <- lapply(unique(profiles$Chromosome),function(chr){
  subset(profiles,Chromosome==chr)$LogRatio
})
full.model <- HMMFit(logratio.list,nStates=3)
full.states <- viterbi(full.model,logratio.list)
full.results <- do.call(rbind,lapply(seq_along(logratio.list),function(chr){
  df <- subset(profiles,Chromosome==chr)
  mean.df <- df
  mean.df$LogRatio <-
    full.model$HMM$distribution$mean[full.states$states[[chr]]]
  rbind(data.frame(df,signal="data"),
        data.frame(mean.df,signal="mean of 3 state model"))
}))
xyplot(LogRatio~PosBase|Chromosome,full.results,groups=signal,type="l",
       auto.key=list(points=FALSE,lines=TRUE))

Remember that the final model in the EM algorithm is a local maximum of the log-likelihood, so you may need to use a different starting value for the model parameters in order to reach the solution you want. Try the following code:

init <- RHmm:::HMMKMeans(logratio.list,3)
init$distribution$mean <- c(0.5,0,-0.5)
full.model <- HMMFit(logratio.list,nStates=3,control=list(initPoint=init))

Author: Toby Hocking

Date: 2011-05-16 19:30:54 CEST

HTML generated by org-mode 7.4 in emacs 22