Mines ParisTech machine learning practical session: Model selection and clustering, 13 May 2011.

Prof. Moutarde's slides ppt pdf print pdf

In this TP we will

start with an example that illustrates the concept of overfitting.
explore several model selection statistics.
use these model selection statistics to try to automatically pick the right number of clusters for several data sets.

We will make use of R within Emacs+ESS, so to setup your R installation please open a terminal and type

~thocking/bin/mines-ess-setup.sh
emacs graphical-models.R &

1 Introduction: exploring overfitting
2 Several model selection criteria for clustering
3 Choosing the right clustering method

1 Introduction: exploring overfitting

Goal: understand why overfitting occurs, and why we need criteria other than maximum likelihood for model selection.

Exercise 1: randomly generate some 1-dimensional data from a mixture of 3 Gaussians.

set.seed(1)
## simulate some data and plot them
sim <- c(rnorm(20),rnorm(20,5),rnorm(20,8))
plotdata <- function(...){
  hist(sim,freq=FALSE,breaks=10,...)
  points(cbind(sim,0),pch=4)
}
plotdata()

Question: How can we design an algorithm that automatically detects the number of clusters in these N=60 data?

Possible answer: fit a series of models, and pick the one which is most likely.

Exercise 2: fit a Gaussian distribution to these data. Simply calculate the sample mean and covariance of your data, and these 2 parameters define a Gaussian distribution. Now calculate the log-likelihood of your data based on this model.

## maximum likelihood estimation by hand:
mu <- mean(sim)
sigma <- sd(sim)
## plot the estimated distribution:
curve(dnorm(x,mu,sigma),col="red",add=TRUE)
## calculate the log-likelihood by hand:
print(sum(log(dnorm(sim,mu,sigma))))

Exercise 3: In fact, the mclust R package has a function that does the same thing. Install and load the mclust package and use the Mclust() function to fit a 1-component Gaussian Mixture.

if(!require(mclust)){
  install.packages("mclust")
  library(mclust)
}
## G=1 specifies 1 mixture component
## modelNames="E" specifies equal variance mixture components
m1 <- Mclust(sim,G=1,modelNames="E")
print(m1$loglik)

Question: does this agree with the log-likelihood we calculated by hand in Exercise 2?

Exercise 4: calculate the log-likelihood of a model with 2 mixture components.

m2 <- Mclust(sim,G=2,modelNames="E",control=emControl(equalPro=TRUE))
curve(dens(modelName=m2$modelName,data=x,parameters=m2$parameters),
      col="blue",add=TRUE,n=500)
print(m2$loglik)

Question: does the model with 1 or 2 components have a higher log-likelihood? Why?

Exercise 5: what happens to the log-likelihood when we add even more mixture components? Take a look at the following animation and plot to find out.

fit.gmm <- function(G){
  fit <- Mclust(sim,G=G,modelNames="E",control=emControl(equalPro=TRUE))
  plotdata(main=sprintf("mixture with %d components",G))
  curve(dens(modelName=fit$modelName,data=x,parameters=fit$parameters),
        col="blue",add=TRUE,n=500)
  fit$loglik
}
if(!require(animation)){
  install.packages("animation")
  library(animation)
}
ani.start(loop=FALSE,interval=0.5)
log.likelihood <- sapply(1:30,fit.gmm)
ani.stop()
plot(log.likelihood,
     main="adding mixture components increases log likelihood",
     xlab="number of mixture components")

Remember that we use the Expectation Maximization algorithm to find a model which locally maximizes the log-likelihood of the data. So we aren't guaranteed to find the best model, which is why the log-likelihood sometimes decreases when mixture components are added.

In general it is clear that

adding more clusters to the model will always increase the log-likelihood, so the model with N clusters is the "most likely."
using the model with N clusters is not likely to generalize well to new data, so we say that it "overfits."
if we want to avoid overfitting, we need some other criterion to pick the number of clusters with the best generalization ability.

2 Several model selection criteria for clustering

The main difficulty in clustering is choosing the "correct" number of clusters, which is a specific problem of model selection.

Goal: learn some methods of automatically picking the number of clusters of a data set, and write R code that implements this.

This is a topic of ongoing research, so the current state-of-the-art is to say that the optimal number of clusters is problem-dependent. However, there have been some efforts toward automatically picking the "correct" number of clusters. Several methods are discussed in Hastie, Tibshirani, Friedman "Elements of Statistical Learning" (ESL).

Group Exercise: form a group of 2-5 students and assign each student to a model selection criterion below. Each student should write 2 functions

cross.validate.10fold(x,k), bic(x,k), etc. that calculates the model selection statistic for a Gaussian Mixture Model with k components on the matrix of data x.
cross.validate.10fold.guess(x) which repeatedly calls cross.validate.10fold(x,k) for several choices of k, then returns the classs guesses for the chosen model as an integer vector. We will use this function in the next section to try to automatically detect the number of clusters in several datasets.

2.1 Bayesian Information Criterion (BIC)

Chapter 7 of ESL is devoted to the more general topic of model selection and 7.7 introduces the Bayesian Information Criterion (BIC).

Since the 1970s, several model selection criteria (AIC, BIC, Cp, etc.) have been developed. These are motivated using mathematical arguments about the asymptotic behavior of distributions in the exponential family.

2.2 Cross-validation

The idea developed in general in section 7.10 of ESL. However a more specific description is given by Padhraic Smith in Clustering using Monte Carlo Cross-Validation. The basic idea is to split a data set into train and test. We fit the probabalistic clustering model using the training points, and then calculate the log-likelihood of the test points under the model. We pick the number of clusters k which maximizes the likelihood of the data.

How to choose the specific train/test splits? One way is to randomly divide your data into F folds of equal size. Then for each fold f, treat fold f as the test data, and the other folds as the training data. Then you calculate the average log-likelihood of the test data over these F folds. This procedure is called F-fold cross-validation.

Another way to split is by randomly dividing the data in half, and treating one half as train and the other half as test. Repeat several times and take the average to get a mean log-likelihood of the test data. This procedure is called Monte Carlo cross-validation.

2.3 The Gap statistic

In ESL 14.3 Cluster analysis they discuss several methods of clustering and in 14.3.11 Practical issues they discuss the Gap statistic for automatically picking the number of clusters.

3 Choosing the right clustering method

The other major problem in clustering is choosing a clustering model that accurately captures the shape of the data. Not all data come in the form of Gaussian distributions, and so in this section we will explore some other clustering methods that have been proposed to deal with these more exotic data.

There a database of several simple clustering problems available at http://www.uni-marburg.de/fb12/datenbionik/data

Goal: Apply several clustering methods to several different data sets to get an idea about which clusterings work for which kinds of data.

Exercise 1: download and load the data sets into R. The following code downloads the files to your R/FCPS directory, reads each data set and the "real" labels, and stores them as a list in the problems variable.

if(!file.exists("~/R"))dir.create("~/R")
setwd("~/R")
if(!file.exists("FCPS")){
  zipfile <- "FCPS.zip"
  dataurl <- "http://www.uni-marburg.de/fb12/datenbionik/downloads/FCPS"
  download.file(dataurl,zipfile)
  system(paste("unzip",zipfile))
}
setwd("FCPS/01FCPSdata")
lrnfiles <- Sys.glob("*.lrn")
bases <- gsub(".lrn","",lrnfiles)
problems <- lapply(bases,function(N){
  print(N)
  lrnfile <- sprintf("%s.lrn",N)
  clsfile <- sprintf("%s.cls",N)
  x <- read.table(lrnfile,skip=4,header=FALSE,row.names=1)
  colnames(x) <- rownames(x) <- NULL
  skip <- switch(N,Chainlink=3,WingNut=0,1)
  y <- read.table(clsfile,skip=skip,header=FALSE,row.names=1)[,1]
  list(data=as.matrix(x),class=data.frame(truth=y))
})
names(problems) <- bases
problems$iris <- list(data=as.matrix(iris[,-5]),
                      class=data.frame(truth=as.integer(iris$Species)))
if(!require(kernlab)){
  install.packages("kernlab")
  library(kernlab)
}
data(spirals)
sc <- specc(spirals,centers=2)
problems$spiral <- list(data=spirals,class=data.frame(truth=sc@.Data))
problems <- lapply(problems,structure,class="clusterprob")

Exercise 2: run these 2 simple clustering algorithms on each of the data sets.

### This should be a named list of clustering functions.
### Each function should take a matrix of data to cluster 
### and return a vector of class guesses as integers.
clusterings <-
  list(kmeans2=function(x){
    fit <- kmeans(x,2)
    fit$cluster
  },spectral2=function(x){
    sc <- specc(x, centers=2)
    sc@.Data
  })
## run all algorithms on all problems
for(method in names(clusterings)){
  cat(method)
  clusterfun <- clusterings[[method]]
  for(dataname in names(problems)){
    cat("",dataname)
    mat <- problems[[dataname]]$data
    if(length(grep("spectral",method))&&nrow(mat)>300){
      cat("...") ##skip, would take too long!
    }else{ ## store results in $class element
      problems[[dataname]]$class[,method] <- clusterfun(mat)
    }
  }
  cat("\n")
}

Exercise 3: make a table that shows how well each algorithm worked on each dataset. We use the normalized Rand index for measuring how well the clustering agrees with the true labels. It varies between 0 and 1, and bigger values mean better clusterings. Here we multiply by 100 and round to look at the table, so the score varies between 0 and 100.

norm.rand.index <- function(klass,guess){
  n <- table(klass,guess)
  ch2 <- function(x)x*(x-1)/2
  sumi <- sum(ch2(rowSums(n)))
  sumj <- sum(ch2(colSums(n)))
  expected <- sumi*sumj/ch2(sum(n))
  numerator <- sum(ch2(n))-expected
  denominator <- (sumi+sumj)/2-expected
  numerator/denominator
### measure of correspondence between partitions (Hubert and Arabie
### 1985) 1=perfect, 0=completely random, or just the same label for
### every point.
}
goodness <- sapply(names(clusterings),function(method){
  sapply(problems,function(L){
    cl <- L$class
    if(method%in%names(cl)){
      norm.rand.index(cl[,"truth"],cl[,method])
    }else NA
  })
})
print(round(goodness*100))

Question: which clustering method seemed to work for which data sets?

Exercise 4: to understand the scores, it is useful to look at the class guesses by plotting the data. Load the following function into R and then plot the class guesses using plot(problems$iris), plot(problems$spiral), etc.

plot.clusterprob <- function(L){
  require(lattice)
  comb <- do.call(rbind,lapply(names(L$class),function(N){
    data.frame(method=N,class=L$class[,N],L$data)
  }))
  if(ncol(L$data)==2){
    xyplot(X2~X1|method,comb,groups=class,aspect="iso")
  }else if(ncol(L$data)==3 && require(rgl)){
    for(N in colnames(L$class)){
      rgl.open()
      rgl.bg(color="white")
      plot3d(L$data,col=L$class[,N],aspect=1,main=N)
    }
  }else{
    splom(~comb[,-(1:2)]|method,comb,groups=class,aspect="iso")
  }
}

Exercise 5: go back to the clusterings variable we defined in Exercise 2. Add some more clustering algorithms, using the automatic selection procedures you developed earlier, and the following standard algorithms:

K-means is implemented in the kmeans() function.
Gaussian Mixture Models using library(mclust).
Hierarchical clustering is implemented using the hclust(d,method="average") function, where d is the matrix of data to cluster and method is the linkage criterion.
Spectral clustering is implemented in specc() from library(kernlab).

Question: are you able to define an algorithm that works for all the data sets?

Author: Toby HOCKING

Org version 7.5 with Emacs version 22

Validate XHTML 1.0