--- title: "Introduction to supervised changepoint detection" author: Toby Dylan Hocking, McGill University, Montreal, Canada output: html_document: toc: true toc_depth: 2 --- See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial Before executing the code in this tutorial Rmd, make sure to install the required packages: ```{r packages} packages.R <- if(file.exists("packages.R")){ "packages.R" }else{ "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R" } source(packages.R) options(width=100) ``` # What is the difference between unsupervised and supervised changepoint detection? There are two features of a data set that make it possible to do a supervised changepoint analysis: * The data consist of **several sequences** with similar signal/noise patterns, which are treated as separate changepoint detection problems. * Some data sequences have **labels** which indicate specific regions where the model should predict presence or absence of changepoints. The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence). ## Neuroblastoma data set We will use the neuroblastoma data set to explore supervised changepoint detection. ```{r neuroblastoma} data(neuroblastoma, package="neuroblastoma") str(neuroblastoma) lapply(neuroblastoma, head, 10) ``` The neuroblastoma data set consists of two data.frames: * profiles contains the noisy data (logratio), which is approximate DNA copy number (measured at chromosome, position), for a particular patient (profile.id). Every unique combination of (profile.id, chromosome) defines a separate multiple changepoint detection problem. The logratio measures how many copies of each part of the human genome are present in the patient's tumor. Normally there are two copies of each chromosome (logratio=0), but in cancer samples there can be gains (changes up) and losses (changes down) of specific regions or entire chromosomes. * annotations contains the labels, which indicate presence (breakpoint) or absence (normal) of changepoints in specific regions (profile.id, chromosome, min, max) of the data. These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment: ```{r clinical} followup <- data.frame( profile.id=paste(c(10, 8, 4, 6, 11, 1)), status=c("ok", "relapse", "relapse", "ok", "ok", "relapse")) followup ``` ## Unsupervised: no labels, model selection via theory When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in `neuroblastoma$profiles$logratio`. We begin by selecting the profiles of the six patients for which we have follow-up data. ```{r six} rownames(followup) <- followup$profile.id followup$status.profile <- with(followup, paste(status, profile.id)) some.ids <- rownames(followup) library(data.table) someProfiles <- function(all.profiles){ some <- subset(all.profiles, paste(profile.id) %in% some.ids) status.profile <- followup[paste(some$profile.id), "status.profile"] some$status.profile <- ifelse( is.na(status.profile), paste(some$profile.id), status.profile) data.table(some) } six.profiles <- someProfiles(neuroblastoma$profiles) ``` For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel). ```{r plotProfiles, fig.width=10} library(ggplot2) gg.unsupervised <- ggplot()+ ggtitle("unsupervised changepoint detection = only noisy data sequences")+ theme( panel.margin=grid::unit(0, "lines"), panel.border=element_rect(fill=NA, color="grey50") )+ facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+ geom_point(aes(position/1e6, logratio), data=six.profiles, shape=1)+ scale_x_continuous( "position on chromosome (mega bases)", breaks=c(100, 200))+ scale_y_continuous( "logratio (approximate DNA copy number)", limits=c(-1,1)*1.1) print(gg.unsupervised) ``` Note the facet titles on the right: * The top three profiles are "ok" because those children completely recovered. * The bottom three profiles are "relapse" because the neuroblastoma cancer came back, and required another treatment at the hospital. **Question:** can you visually identify any differences between the "ok" and "relapse" profiles? How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later. ## Supervised: learn a penalty function with minimal incorrect labels In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below. ```{r plotLabels, fig.width=10} six.labels <- someProfiles(neuroblastoma$annotations) change.colors <- c( "1change"="#ff7d7d", breakpoint="#a445ee", normal="#f6c48f" ) gg.supervised <- gg.unsupervised+ ggtitle(paste( "supervised changepoint detection = data + labels", "that indicate specific regions with or without changepoints"))+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, fill=annotation), alpha=0.5, color=NA, data=six.labels)+ scale_fill_manual("label", values=change.colors)+ theme(legend.position="bottom") print(gg.supervised) ``` **New function:** `geom_tallrect` is like `geom_rect` but the ymin is always the bottom of the panel and the ymax is always the top of the panel. These are useful for drawing labeled intervals of x values. It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis: * The data consist of **several sequences** with similar signal/noise patterns, which are treated as separate changepoint detection problems. * Some data sequences have **labels** which indicate specific regions where the model should predict presence or absence of changepoints. As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels: * Every breakpoint/positive label is a region where the model should predict at least one changepoint. If the model predicts no changepoints in a region with a breakpoint label, that is considered a false negative. * Every normal/negative label is a region where the model should predict no changepoints. If the model predicts one or more changepoints in a region with a normal label, that is considered a false positive. * The overall goal is find a changepoint model that minimizes the number of incorrectly predicted labels, which is defined as the sum of false positives and false negatives. ## Label error of an unsupervised changepoint model In this section we show that typical unsupervised changepoint models result in prediction errors with respect to the labels in the neuroblastoma data. We begin by using the `changepoint::cpt.mean` function to compute an unsupervised changepoint model for each data sequence. To compute the number of incorrect labels later, make sure to save a description of the model in a "tidy" data.frame (observations on rows and variables on columns). In the code below we * fit the model inside `by=list(status.profile, chromosome)` which adds those two columns to the resulting `data.table`. * add the `pen.name` column to identify the model. * use the `changes` list column to save the changepoint positions (in the same units as the label positions: bases on the chromosome). ```{r unsupervised-models} pen.name <- "SIC0" (unsupervised.models <- six.profiles[, { fit.pelt <- changepoint::cpt.mean( logratio, penalty=pen.name, method="PELT") end <- fit.pelt@cpts before.change <- end[-length(end)] after.change <- before.change+1L data.table( pen.name, pen.value=fit.pelt@pen.value, changes=list( as.integer((position[before.change]+position[after.change])/2) )) }, by=list(profile.id, status.profile, chromosome)]) ``` Next, we convert the list column `changes` to tall format (one row for each changepoint). ```{r unsupervised-changes} (unsupervised.changes <- unsupervised.models[, data.table( change=changes[[1]] ), by=list(profile.id, status.profile, chromosome, pen.name)]) ``` **New function:** we now introduce the `penaltyLearning::labelError` function, which computes the number of incorrect labels for a set of predicted changepoints. Its inputs are 3 data.frames: * `models` one row per changepoint model. * `labels` one row for each label, with columns for label location and type. * `changes` one row for each predicted changepoint, with a column for the predicted changepoint position. ```{r unsupervised-error-list} unsupervised.error.list <- penaltyLearning::labelError( unsupervised.models, six.labels, unsupervised.changes, problem.vars=c("status.profile", "chromosome"), model.vars="pen.name", change.var="change") ``` The result of `labelError` is a list of two data tables: * `model.errors` has one row per (sequence,model), with the total number of incorrect labels. We use these totals in the `ggtitle` of the plot below. * `label.errors` has one row per (sequence,model,label), with the fp/fn status of each label. We use these to visualize false negative and false positive predictions using the `geom_tallrect` with `aes(linetype=status)` below. ```{r gg-unsupervised-bic, fig.width=10} gg.supervised+ theme(legend.box="horizontal")+ unsupervised.error.list$model.errors[, ggtitle(paste0( "Unsupervised ", pen.name, " penalty has ", sum(errors), " incorrect labels: ", sum(fp), " FP + ", sum(fn), " FN"))]+ geom_vline(aes( xintercept=change/1e6), color="green", size=1, linetype="dashed", data=unsupervised.changes)+ geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=unsupervised.error.list$label.errors)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1)) ``` Note that in the plot above we also plotted the predicted changepoints using a green dashed `geom_vline`. Labels with inconsistent changepoints are shown with a black border in the `geom_tallrect`, with `scale_linetype_manual` to show the type of prediction error: * **False positives** are labels with too many predicted changepoints, drawn with a solid black border (none here but we will see some later). * **False negatives** are labels with too few predicted changepoints, drawn with a dotted black border. * **Correct** labels with no border have an acceptable number of predicted changepoints. **Remark:** the result above (BIC/SIC predicts too few changepoints, resulting in false negatives) is specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions will commit changepoint detection errors which are visually obvious. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy. **Exercise 1 during class:** try changing the `pen.name` variable to another theretically-motivated unsupervised penalty. Other options from `help(cpt.mean)` are SIC (without a zero), MBIC, AIC, and Hannan-Quinn. Can you find a penalty with improved changepoint detection accuracy? Also try changing the `some.ids` variable (to some profile IDs from 1 to 603) to see the changepoints and label errors for another set of profiles. ## Creating labels via visual inspection To quickly create labels for many data sequences, I recommend using a GUI in which you can use the mouse to draw label rectangles on the scatterplots. I wrote [annotate_regions.py](https://pypi.python.org/pypi/annotate_regions/1.0) for labeling the neuroblastoma data set, and [SegAnnDB](https://github.com/tdhock/SegAnnDB) for labeling general DNA copy number profile data. You may want to write another GUI for your particular data -- make sure it supports visualizing the data sets and interactive labeling. Even if you don't have GUI software, you can always just plot the data in R, and then write the labels in the R code. Begin by plotting one data set by itself to see the details of the signal and noise. ```{r zoom} zoom.profile <- 4 #Exercise: change this value! zoom.chromosome <- 14 #Exercise: change this value! zoom.pro <- subset( neuroblastoma$profiles, profile.id==zoom.profile & chromosome==zoom.chromosome) zoom.gg <- ggplot()+ geom_point(aes(position/1e6, logratio), data=zoom.pro, shape=1)+ scale_y_continuous( "logratio (approximate DNA copy number)") print(zoom.gg) ``` Based on your visual interpretation of the plot above, you can create a set of labels. * For each region that contains at least one changepoint, create a "breakpoint" label (no predicted changes inside is a false negative). * For each region that contains no changepoints, create a "normal" label (one or more predicted changes inside is a false positive).. * For each region that contains exactly one changepoint, create a "1change" label (no predicted changes inside is a false negative, and two or more is a false positive). The region for each label can be as small or as large as you want. Because the goal of learning is to minimize the number of incorrectly predicted labels, it is important to make sure that each label is a correct representation of the changepoint model you want. Avoid labeling any regions for which you are not sure about the label. Save your labels to a data.frame with the label type and positions, as below. ```{r manual-label-fun} label <- function(min, max, annotation){ data.frame( profile.id=paste(zoom.profile), chromosome=paste(zoom.chromosome), min, max, annotation) } zoom.labels <- rbind(# Exercise: change these values! label(70e6, 90e6, "1change"), label(20e6, 60e6, "normal")) zoom.gg.lab <- zoom.gg+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, fill=annotation), alpha=0.5, color=NA, data=zoom.labels)+ scale_fill_manual("label", values=change.colors) print(zoom.gg.lab) ``` The plot above shows the noisy data along with the labels. Note that it is possible to have multiple labels per data sequence, even though that is not the case for the neuroblastoma data set. # A typical supervised changepoint analysis In this section we explain how to perform a supervised changepoint analysis on a labeled data set. ## Overview of supervised changepoint computations A typical supervised changepoint analysis consists of the following computations: * [Changepoint models of increasing complexity](#changepoint-models-of-increasing-complexity). For each of several labeled segmentation problems (data sequences that are separate but have a similar signal/noise pattern), use your favorite changepoint detection package to compute a *sequence* of models of increasing complexity (say from 0 to 20 changepoints). * [Label error](#label-error). Use `penaltyLearning::labelError` to compute the number of incorrect labels for each labeled segmentation problem and each changepoint model. The goal of learning is to minimize the number of incorrectly predicted labels. * [Model selection](#model-selection) use `penaltyLearning::modelSelection` to compute the exact path of models that will be selected for every possible non-negative penalty value. * [Outputs](#outputs). Use `penaltyLearning::targetIntervals` to compute a target interval of log(penalty) values that predicts the minimum number of incorrect labels for each segmentation problem. Create a target interval matrix (one row for each segmentation problems, 2 columns) which can be used as the output in the supervised machine learning problem. * [Inputs](#inputs). Compute a feature matrix (segmentation problems x features) using `penaltyLearning::featureMatrix`. Or do it yourself using simple statistics of each segmentation problem (quantiles, mean, number of data points, estimated variance, etc). * [Learning and prediction](#learning-and-prediction). Use `survival::survreg` or `penaltyLearning::IntervalRegressionCV` to learn a penalty function. * [Evaluation](#evaluation). use `penaltyLearning::ROChange` to compute ROC curves using labels in a test set (that were not used for training). The AUC (area under the ROC curve) and the percent incorrect labels in the test set can be used to evaluate the prediction accuracy of your model. The mathematical details are described in [our ICML'13 paper, Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression](http://proceedings.mlr.press/v28/hocking13.html). ## Changepoint models of increasing complexity To begin we will perform the supervised changepoint computations on just one of the labeled neuroblastoma data sequences, ```{r zoom-gg-lab} zoom.gg.lab <- ggplot()+ geom_point(aes(position/1e6, logratio), data=zoom.pro, shape=1)+ scale_y_continuous( "logratio (approximate DNA copy number)")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, fill=annotation), alpha=0.5, color=NA, data=zoom.labels)+ scale_fill_manual("label", values=change.colors) print(zoom.gg.lab) ``` We use the code below to compute a sequence of maximum likelihood Gaussian models with $s\in\{1,\dots,7\}$ segments. In contrast, a typical unsupervised analysis will include computation of *one* changepoint model. In supervised analysis we need to compute *a sequence of changepoint models of increasing complexity* in order to determine which are consistent and inconsistent with the labels. Remember the goal of the machine learning is to predict a model with complexity consistent with the labels (minimize the number of incorrectly predicted labels). **New function:** `Segmentor3IsBack::Segmentor` computes maximum likelihood segmentations from 1 to Kmax segments. Inputs are the data sequence to segment, the likelihood (model=2 means normal/Gaussian), and the maximum number of segments (Kmax). For a supervised changepoint analysis, Kmax should be rather large, but how do you know if it is large enough? You should check to make sure there is at least one model with a false positive (a labeled region with too many predicted changepoints), if that is possible (not possible for data sequences with only breakpoint labels). These false positives in the training labels/models are used to learn a penalty function that avoids predicting too many changepoints. More details below in the [Outputs](#outputs) section. ```{r zoomSegmentor} max.segments <- 7 (fit <- Segmentor3IsBack::Segmentor( zoom.pro$logratio, model=2, Kmax=max.segments)) ``` The Segmentor `fit` object is an S4 class with slots * `breaks` an integer matrix of end position of each segment. * `parameters` a numeric matrix of the mean of each segment. Next, we convert the model to tidy format (a `data.frame` with one row per segment). The code below is specific to the `Segmentor` function; if you use a different changepoint detection package/function, you will have to write some other code to tidy the data. ```{r zoom-segs} zoom.segs.list <- list() zoom.loss.vec <- rep(NA, max.segments) for(n.segments in 1:max.segments){ end <- fit@breaks[n.segments, 1:n.segments] data.before.change <- end[-n.segments] data.after.change <- data.before.change+1 pos.before.change <- as.integer( (zoom.pro$position[data.before.change]+ zoom.pro$position[data.after.change])/2) start <- c(1, data.after.change) chromStart <- c(zoom.pro$position[1], pos.before.change) chromEnd <- c(pos.before.change, max(zoom.pro$position)) seg.mean.vec <- fit@parameters[n.segments, 1:n.segments] zoom.segs.list[[n.segments]] <- data.table( profile.id=paste(zoom.profile), chromosome=paste(zoom.chromosome), n.segments, # model complexity. start, # in data points. end, chromStart, # in bases on chromosome. chromEnd, mean=seg.mean.vec) data.mean.vec <- rep(seg.mean.vec, end-start+1) stopifnot(length(data.mean.vec)==nrow(zoom.pro)) zoom.loss.vec[n.segments] <- sum((zoom.pro$logratio-data.mean.vec)^2) } (zoom.segs <- do.call(rbind, zoom.segs.list)) ``` Above we show the data.table of optimal segment means. Below we compute the data.table of predicted changepoint positions, then plot the models. ```{r zoom-changes} (zoom.changes <- zoom.segs[1 < start, data.table( profile.id, chromosome, n.segments, changepoint=chromStart)]) zoom.gg.models <- zoom.gg.lab+ theme( panel.margin=grid::unit(0, "lines"), panel.border=element_rect(fill=NA, color="grey50") )+ facet_grid(n.segments ~ .)+ geom_vline(aes( xintercept=changepoint/1e6), data=zoom.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=zoom.segs, size=1, color="green") print(zoom.gg.models) ``` The plot above shows a panel for each model from 1 to 7 segments (0 to 6 changepoints). Each segment mean is drawn using a green horizontal `geom_segment` and each changepoint is drawn using a green dashed `geom_vline`. For more details about the `Segmentor` function, read the references below. To save time during class we will skip to the next section, "Label error." * Let there be $n$ separate data sequences to segment. For example in the previous section we plotted $n=144$ data sequences, each a separate changepoint detection problem. * For any data sequence $i\in\{1,\dots,n\}$, let $d_i$ be the number of data points in data sequence $i$, and let $\mathbf z_i\in\mathbb R^{d_i}$ be the vector of data points to segment for that sequence. * The Segment Neighborhood problem with $s$ segments for data sequence $i$ is to find the most likely $s-1$ changepoints: $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) \text{ subject to} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1})=s-1. $$ For the Normal homoscedastic model we have * the optimization variable $\mathbf m = (m_1 \cdots m_{d_i})$ is a vector of $d_i$ real-valued segment mean variables. * the loss function $\ell$ is the square loss $\ell(z,m)=(z-m)^2$. * the indicator function $I$ is 1 for every change-point where $m_{j} \neq m_{j+1}$, and 0 otherwise. * $\lambda>0$ is a non-negative penalty constant. * the objective function is the total loss over all $d_i$ data points, which is convex. * the constraint is that the mean $\mathbf m$ must have exactly $s-1$ changes, which is non-convex. * overall the optimization problem is non-convex, which means that gradient-based algorithms are not guaranteed to compute the global minimum. However the dynamic programming algorithms discussed below can be used to compute the global minimum in a reasonable amount of time. A few references about the algorithm implemented in the `Segmentor` function: * the "Segment Neighborhood" problem was originally described by [Auger and Lawrence (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2706400). They proposed an $O(S_{\text{max}} d^2)$ algorithm for computing the most likely models with $s\in\{1,\dots,S_{\text{max}}\}$ segments, for $d$ data points. An R implementation of their algorithm is available as `changepoint::cpt.mean(method="SegNeigh")` but it is very slow for large $d$, due to the quadratic time complexity. * [Rigaill (2010) discovered a Pruned Dynamic Programming Algorithm (PDPA)](https://arxiv.org/abs/1004.0887) which solves the same "Segment Neighborhood" problem. Its $O(S_{\text{max}}d\log d)$ time complexity makes it much faster for larger data sets. The original R implementation `cghseg:::segmeanCO` only works for the Gaussian/normal likelihood. * [Cleynen et al (2014) generalized the PDPA to other loss/likelihood models](https://almob.biomedcentral.com/articles/10.1186/1748-7188-9-6). Their R function `Segmentor3IsBack::Segmentor` implements the $O(S_{\text{max}}d\log d)$ PDPA for Poisson, Normal homoscedastic (uniform variance -- change in mean), Negative Binomial, Normal Heteroscedastic (uniform mean -- change in variance), and Exponential likelihood models. **Exercise after class:** instead of using `Segmentor` to compute a sequence of models, try using `changepoint::cpt.mean(penalty="CROPS")`, [arXiv:1412.3617](https://arxiv.org/abs/1412.3617). Then re-do the rest of the analyses below. Another option is `cghseg:::segmeanCO`. Using any of these three packages you should be able to get the same results. ## Label error Next, we compute the number of incorrect labels for each model, using the `penaltyLearning::labelError` function that we saw above. We then plot the label errors. ```{r zoomlabelerror} zoom.models <- data.table( profile.id=paste(zoom.profile), chromosome=paste(zoom.chromosome), loss=zoom.loss.vec, n.segments=as.numeric(1:max.segments)) zoom.error.list <- penaltyLearning::labelError( zoom.models, zoom.labels, zoom.changes, change.var="changepoint", problem.vars=c("profile.id", "chromosome")) zoom.gg.models+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), data=zoom.error.list$label.errors, fill=NA)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1)) ``` It is clear from the plot above that each model has a different number of labels which are correctly predicted. Models with too few changepoints cause false negatives, and models with too many changepoints cause false positives. However there is a subset of models with minimal incorrect labels. The goal of a supervised changepoint detection algorithm is to predict one of those models. ## Model selection Rather than using a regression model to directly predict the integer-valued number of segments/changepoints, we will instead learn a regression model that predicts the real-valued log(penalty). The model selection function is a mapping from penalty values to model size (in segments/changepoints). Notice that `zoom.models` contains a `loss` column, which is the sum of squared residuals of each model. Below we use that to compute the models that will be selected for every possible penalty value. **New function:** In the code below we use the `penaltyLearning::modelSelection` function, which takes as input a `data.frame` with one row per changepoint model. There should be at least the following columns (but there can be others): * IDs for the data sequence (here, `profile.id` and `chromosome`). * complexity, such as number of changepoints/segments. This column name can be specified via the `complexity` argument, as below. * loss, such as the residual sum of squares or negative log likelihood. This should decrease as model complexity increases, and the column name can be specified via the `loss` argument (default is `"loss"` which is present in `zoom.models` so no need to specify in the code below). The output is a data.frame with one row per model that is selected for at least one penalty value. There are columns for the min/max penalty that will select each model, as shown below. ```{r zoomSelection} print(zoom.models) (zoom.selection <- penaltyLearning::modelSelection( zoom.models, complexity="n.segments")) ggplot()+ geom_segment(aes( min.log.lambda, n.segments, xend=max.log.lambda, yend=n.segments), size=1, data=zoom.selection)+ xlab("log(penalty) = log(lambda)")+ scale_y_continuous(breaks=zoom.selection$n.segments) ``` We plot the model selection function in the figure above, which shows that it is a piecewise constant non-increasing function. Intuitively, the larger the penalty, the fewer the number of changepoints/segments (and vice versa). More concretely, let $L_{i,s}$ be the loss (sum of squared residuals) of the model with $s\in\{1,\dots,s_{\text{max}}\}$ segments for data sequence $i$. The model selection function is defined for every non-negative penalty $\lambda\geq 0$ as $$ s_i^*(\lambda) = \text{arg min}_s L_{i,s} + \lambda s $$ ## Outputs The output in the machine learning problem is an interval of log(penalty) values that achieves minimum incorrect labels. Remember that we have computed `zoom.error.list$model.errors` which contains the number of incorrect labels $e_i(s)\in\{0,1,2,\dots\}$ for this data sequence $i$ and every number of segments $s$. We can thus compute the label error as function of the penalty, $$ E_i(\lambda) = e_i[s^*_i(\lambda)]. $$ In R this computation amounts to a join with `zoom.selection` (which came from `penaltyLearning::modelSelection`). ```{r zoomploterrors} zoom.error.join <- zoom.error.list$model.errors[J(zoom.selection), on=list( profile.id, chromosome, n.segments, loss)] zoom.errors.tall <- data.table::melt( zoom.error.join, measure.vars=c("n.segments", "errors")) zoom.gg.errors <- ggplot()+ geom_segment(aes( min.log.lambda, value, xend=max.log.lambda, yend=value), size=1, data=zoom.errors.tall)+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(variable ~ ., scales="free")+ scale_y_continuous("", breaks=0:max.segments)+ xlab("log(penalty) = log(lambda)") print(zoom.gg.errors) ``` The code above plots the number of incorrect labels $E_i(\lambda)$, clearly showing that it is a piecewise constant function that takes integer values (like the 01 loss in binary classification). The goal of the machine learning algorithm is to provide predictions that minimize this function (on test data). Below we compute the output for the machine learning problem, a target interval of penalty values that yield changepoint models with minimal incorrect labels. **New function:** the `penaltyLearning::targetIntervals` function computes the upper and lower limits of the target interval, which is the largest range of log(penalty) values that results in minimum incorrectly predicted labels. Its input parameter is a data.frame with one row per model (from `penaltyLearning::modelSelection`), with an additional column `errors` for the number of incorrect labels. The `problem.vars` argument indicates the data sequence ID columns. The function returns a data.table with one row per labeled data sequence, with columns for the lower and upper bounds of the target interval (`min.log.lambda`, `max.log.lambda`). ```{r zoomplottargets} print(zoom.error.join) (zoom.target <- penaltyLearning::targetIntervals( zoom.error.join, problem.vars=c("profile.id", "chromosome"))) zoom.target.tall <- data.table::melt( zoom.target, measure.vars=c("min.log.lambda", "max.log.lambda"), variable.name="limit")[is.finite(value)] zoom.gg.errors+ geom_text(aes( ifelse(limit=="min.log.lambda", value-1, value+1), errors+1, label=paste( "false", ifelse(limit=="min.log.lambda", "positives", "negatives"), "\ntoo", ifelse(limit=="min.log.lambda", "many", "few"), "changes"), hjust=ifelse( limit=="min.log.lambda", 0, 1)), data=data.frame(zoom.target.tall, variable="errors"), vjust=-1)+ geom_point(aes( value, errors, fill=limit), shape=21, size=4, data=data.frame(zoom.target.tall, variable="errors"))+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white"))+ theme( legend.position="bottom", legend.box="horizontal") ``` The plot above emphasizes the target interval, which is used as the output in the machine learning problem: * the black dot shows the lower limit of the target interval. If a smaller log(penalty) is predicted, then there are too many changepoints (false positives). * the white dot shows the upper limit of the target interval. If a larger log(penalty) is predicted, then there are too few changepoints (false negatives). The target interval is different for each labeled data sequence, as we show below. We will compute the target interval for each of the labeled data sequences in the six profiles from the beginning of the tutorial. The first step is to compute changepoints and model selection functions, which we do via a for loop over each data sequence below. ```{r sixSegmentor} six.problems.dt <- unique(six.profiles[, list( profile.id, chromosome, status.profile)]) setkey(six.profiles, profile.id, chromosome) six.segs.list <- list() six.selection.list <- list() for(problem.i in 1:nrow(six.problems.dt)){ meta <- six.problems.dt[problem.i,] pro <- six.profiles[meta] max.segments <- min(nrow(pro), 10) fit <- Segmentor3IsBack::Segmentor( pro$logratio, model=2, Kmax=max.segments) rss.vec <- rep(NA, max.segments) for(n.segments in 1:max.segments){ end <- fit@breaks[n.segments, 1:n.segments] seg.mean.vec <- fit@parameters[n.segments, 1:n.segments] data.before.change <- end[-n.segments] data.after.change <- data.before.change+1 pos.before.change <- as.integer( (pro$position[data.before.change]+pro$position[data.after.change])/2) start <- c(1, data.after.change) chromStart <- c(pro$position[1], pos.before.change) chromEnd <- c(pos.before.change, max(pro$position)) data.mean.vec <- rep(seg.mean.vec, end-start+1) rss.vec[n.segments] <- sum((pro$logratio-data.mean.vec)^2) six.segs.list[[paste(problem.i, n.segments)]] <- data.table( meta, n.segments, start, end, chromStart, chromEnd, mean=seg.mean.vec) } loss.dt <- data.table( meta, n.segments=1:max.segments, loss=rss.vec) six.selection.list[[problem.i]] <- penaltyLearning::modelSelection( loss.dt, complexity="n.segments") } six.selection <- do.call(rbind, six.selection.list) six.segs <- do.call(rbind, six.segs.list) six.changes <- six.segs[1 < start] ``` Inside of each iteration of the loop above, we used * `Segmentor3IsBack::Segmentor` to compute the optimal changepoint models, saving the tidy result data.frame of segments to an element of `six.segs.list`. * `penaltyLearning::modelSelection` to compute the model selection functions, saving the result to an element of `six.selection.list`. Below, we use * `penaltyLearning::labelError` to compute the number of incorrect labels for each segmentation model. * `penaltyLearning::targetIntervals` to compute the target interval of penalty values for each labeled data sequence. ```{r sixPlotErrors} six.error.list <- penaltyLearning::labelError( six.selection, six.labels, six.changes, problem.vars=c("profile.id", "chromosome")) six.targets <- penaltyLearning::targetIntervals( six.error.list$model.errors, problem.vars=c("profile.id", "chromosome")) ymax.err <- 1.2 six.targets.tall <- data.table::melt( six.targets, measure.vars=c("min.log.lambda", "max.log.lambda"), variable.name="limit", value.name="log.lambda")[is.finite(log.lambda)] six.gg.errors <- ggplot()+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(profile.id ~ chromosome, labeller=function(df){ if("chromosome" %in% names(df)) df$chromosome <- paste0("chr", df$chromosome) if("profile.id" %in% names(df)) df$profile.id <- paste("profile", df$profile.id) df })+ geom_segment(aes( min.log.lambda, errors, xend=max.log.lambda, yend=errors), size=1, data=six.error.list$model.errors)+ geom_point(aes( log.lambda, errors, fill=limit), shape=21, size=3, alpha=0.5, data=six.targets.tall)+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white"))+ theme( legend.position="bottom", legend.box="horizontal")+ scale_y_continuous( "incorrectly predicted labels", limits=c(0,ymax.err), breaks=c(0,1))+ scale_x_continuous( "predicted log(penalty) = log(lambda)") print(six.gg.errors) ``` The plot above shows the error function and target intervals for each labeled data sequence. Each column of panels/facets represents a different chromosome, and each row represents a different profile/patient. It is clear from the plot above that a model with minimal incorrect labels will predict a different penalty for each data sequence. **Question:** discuss with your neighbor for 1 minute, why is there only one limit per data sequence? (only one target limit point per panel in the plot above) **Exercise 2 during class:** create a set of labels via visual inspection, and visualize how changing the labels affects the computation of the target interval. ## Inputs To train a machine learning model, we first need to compute a vector $\mathbf x_i\in\mathbb R^p$ of exactly $p$ features for each data sequence $i$. We then learn a function $f(\mathbf x_i)=\log\lambda_i\in\mathbb R$. To compute a feature matrix, use the code below. **New function:** the `penaltyLearning::featureMatrix` function computes a feature matrix for some data sequences. Inputs are a data.frame of data sequences (`six.profiles` below), along with column names of IDs (`problem.vars`) and data (`data.var`). Output is a numeric matrix with one row per data sequence and one column per feature. ```{r sixFeatureMatrix} print(six.profiles) feature.mat <- penaltyLearning::featureMatrix( six.profiles, problem.vars=c("profile.id", "chromosome"), data.var="logratio") str(feature.mat) ``` The feature matrix is 144 rows (one for each data sequence) by 365 columns/features. Which features should we use to predict log(penalty) values? If you have a lot of labels, you can use `penaltyLearning::IntervalRegressionCV` which uses L1-regularization to automatically select relevant features -- more details in the next section. For now we will discuss the simpler unsupervised BIC/SIC model, which corresponds to solving the following changepoint optimization problem. $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{(\log d_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}). $$ The BIC/SIC penalty is $\lambda_i = \log d_i$, where $d_i$ is the number of data points to segment in sequence $i$. In the changepoint package version 2.2 this is `penalty="SIC0"`. In our framework, we can write the BIC/SIC criterion in terms of $p=1$ feature, $x_i = \log\log d_i$ -- this is the column `n.loglog` of the feature matrix. The unsupervised BIC/SIC penalty function is thus a linear model $\log\lambda_i = f(x_i)=w x_i + \beta$ with intercept $\beta=0$ and slope $w=1$. ```{r six-BIC-dt} slope <- 1 intercept <- 0 bic.line.dt <- data.table( slope, intercept, model="BIC/SIC") six.BIC.dt <- six.profiles[, list( feature=log(log(.N)), model="BIC/SIC" ), by=list(profile.id, chromosome)] six.BIC.dt[, pred.log.lambda := feature*slope + intercept] ``` The plot below visualizes the target interval limits in this feature space. ```{r sixLimitsPlot} six.features.tall <- six.BIC.dt[six.targets.tall, on=list( profile.id, chromosome)] six.gg.limits <- ggplot()+ geom_point(aes( feature, log.lambda, fill=limit), shape=21, data=six.features.tall)+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white"), breaks=c("max.log.lambda", "min.log.lambda"))+ scale_x_continuous( "feature")+ scale_y_continuous( "log(penalty)=log(lambda)") print(six.gg.limits) ``` In the plot above, each point represents a finite limit of a target interval. The goal of learning is to find a regression function which is below the white dots (upper limits) and above the black dots (lower limits). This is a machine learning problem called "regression with left, right, and interval censored outputs" or "interval regression" for short. Note that this is not the same problem as binary classification, for which the vertical axis is a second feature. In this context we can plot the BIC/SIC model as a line with slope 1 and intercept 0, and we can visualize its errors using the vertical line segments below. ```{r sixBIC} six.features.targets <- six.BIC.dt[six.targets, on=list(profile.id, chromosome)] six.features.targets[, residual := 0] six.features.targets[{ is.finite(min.log.lambda)|is.finite(max.log.lambda) }, residual := { penaltyLearning::targetIntervalResidual( cbind(min.log.lambda, max.log.lambda), pred.log.lambda) }] model.colors <- c( "BIC/SIC"="#1B9E77",#green survreg="#D95F02",#orange IntervalRegressionCV="#7570B3")#dark purple six.gg.bic <- six.gg.limits+ scale_color_manual(values=model.colors)+ geom_abline(aes( slope=slope, intercept=intercept, color=model), size=1, data=bic.line.dt)+ geom_segment(aes( feature, pred.log.lambda, xend=feature, yend=pred.log.lambda-residual, color=model), data=six.features.targets) print(six.gg.bic) ``` **New function:** the `penaltyLearning::targetIntervalResidual` function computes the residual of a predicted log(penalty) with respect to a target interval (zero if prediction is inside the interval, positive for penalty too large, negative for penalty too small). The first argument is a 2-column numeric matrix of target intervals, and the second argument is a numeric vector of predicted log(penalty) values. The output is a numeric vector of residuals, useful for showing the errors in regression plots such as the one above. We can see from the plot above that there are 5 data sequences for which the BIC/SIC predicts a penalty value outside of the target interval (penalty too large = too few changepoints). It is clear that the number of errors could be reduced by simply decreasing the intercept (moving the line down). However there is no line in this feature space that perfectly separates the white and black points. We can also visualize the number of incorrect labels for the BIC/SIC model by plotting the data, segmentation and labels below. Note that this BIC/SIC model was computed using `Segmentor3IsBack::Segmentor`, but it yields exactly the same model as `changepoint::cpt.mean` (compare with previous figure above). ```{r figBIC, fig.width=10} six.BIC.dt[, pred := pred.log.lambda] six.BIC.selection <- data.table(six.selection)[six.BIC.dt, on=list( profile.id, chromosome, min.log.lambda < pred, max.log.lambda > pred)] six.BIC.labels <- six.error.list$label.errors[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.BIC.changes <- six.changes[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.BIC.segs <- six.segs[six.BIC.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("BIC/SIC model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.BIC.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.BIC.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.BIC.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` **Question:** discuss with your neighbor for a minute, why are there only four false negatives in this plot, but there are five predictions outside the target interval in the previous plot with the regression line? Hint: look at the plot of the error as a function of penalty. Another way to visualize the predicted penalty values is with the dots in the error function plot below. ```{r six-gg-errors-bic} six.BIC.errors <- six.error.list$model.errors[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.gg.errors.bic <- six.gg.errors+ geom_point(aes( pred.log.lambda, errors, color=model), size=2, data=six.BIC.errors)+ scale_color_manual(values=model.colors) print(six.gg.errors.bic) ``` It is clear from the plot above that * Most of the BIC/SIC penalty values occur where the error function is zero. These are predictions which are correct with respect to the labeled regions. * Some of the penalty values occur where the error function is non-zero. These are prediction errors with respect to the labels. **Exercise after class:** can you think of any other features which would be useful for predicting log(penalty) values? In other words, what features of a data sequence should be related to the number of changepoints we should predict? Some ideas are quantiles/mean/sd of the observed data points. Change the definition of `feature` in `six.BIC.dt` above, and re-do the plots in this section. Using your feature, is it possible to learn an affine function with fewer errors -- are the white and black dots linearly separable? **Exercise after class:** In the changepoint package version 2.2 `penalty="SIC0"`means $\lambda_i = \log d_i$ (same as the BIC/SIC we explored in this section). Verify that the predicted penalty values in this section are the same, by adding a `geom_point` for predicted log(penalty) values to the plot with the regression line above. Hint: when we computed `changepoint::cpt.mean` in the first part of the tutorial, we saved the predicted penalty in the `pen.value` column of `unsupervised.models`. Join that data.table with `six.features.targets` to get the predicted penalty for the labeled data sequences. **Exercise after class:** In the changepoint package version 2.2 `penalty="SIC"`means $\lambda_i = 2\log d_i$ (twice as much penalty as the BIC/SIC in this section). What is the slope/intercept of this model in the same feature space? Re-do the plots in this section. Is the model more or less accurate? To check your work, use a `geom_point` with predictions from `changepoint::cpt.mean`, as in the previous exercise. ## Learning and prediction In the previous section we explored the unsupervised BIC/SIC penalty, which uses 1 feature, and does not learn any parameters using the labels. In this section, we will explain two supervised penalty learning methods that can be used to increase changepoint detection accuracy. ### survreg (un-regularized interval regression) We begin by using the same feature as the BIC/SIC penalty, but relaxing the assumption that the intercept $\beta=0$ and slope $w=1$. Instead of taking these parameter values for granted based on the theoretical arguments of the BIC/SIC penalty, we can use the `survival::survreg` function to optimize these parameters based on the labels. The penalty function is $\log\lambda_i = f( x_i) = \beta + w\log\log d_i$ which is equivalent to solving the following penalized changepoint detection problem, $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) +\underbrace{e^\beta(\log d_i)^w }_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), $$ where the intercept $\beta$ and the slope $w$ are defined as the solution to the following optimization problem, $$ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R, \sigma\in\mathbb R_+ } \sum_{i=1}^n -\log[ F(\overline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)- F(\underline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma) ], $$ where * the optimization variables are the coefficients or weight vector $\mathbf w\in\mathbb R^p$, the bias/intercept $\beta\in\mathbb R$, and the scale $\sigma\in\mathbb R_+$. In this section for the BIC/SIC penalty we have $p=1$ feature, but `survreg` can be used with more features. * $(\underline y_i,\overline y_i)$ is the target interval of log(penalty) values. * $\mathbf x_i\in\mathbb R^p$ is the feature vector for data sequence $i$. * $\log\lambda_i= f(\mathbf x_i)=\mathbf w^\intercal \mathbf x_i+\beta$ is the predicted log(penalty) value for data sequence $i$. The `survreg` Normal model is $\log \lambda_i \sim N(\mathbf w^\intercal \mathbf x_i+\beta, \sigma^2)$. * $F$ is the likelihood: a cumulative distribution function such as the normal, logistic, or Weibull. (specified using the `dist` argument) * The objective function to minimize is the total negative log likelihood, which is convex with respect to the weight vector $\mathbf w$ and the bias/intercept $\beta$, but not jointly convex with respect to the scale parameter $\sigma$. Thus overall the optimization problem is non-convex, meaning that the gradient-based algorithms that `survreg` uses are only guaranteed to reach a local minimum (not the globally most likely model). To perform a more fair comparison, we will hold out chromosome 11 as a test set, and only train the model using the other labeled chromosomes. ```{r fit-survreg-only-chr11} library(survival) train.dt <- six.features.targets[chromosome!=11] ``` **New function:** the `survival::survreg` function is used below with three arguments: * a formula `output ~ inputs` as in other R modeling functions such as `lm` -- the difference is that the left/output must be a `Surv` object. When learning a penalty function for changepoint detection, we always use `type="interval2"` because all outputs are either left, right, or interval censored. (target intervals of log penalty values) * the data.frame where the input/output variables can be found. * the distribution to use, in this case normal/Gaussian. ```{r fit-survreg} (fit.survreg <- survreg( Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature, train.dt, dist="gaussian")) ``` It is clear that the coefficients of the learned model are different from the BIC/SIC model (slope=1, intercept=0). We can visualize the regression line in the feature space via the code below: ```{r six-gg-survreg} survreg.line.dt <- data.table(t(coef(fit.survreg)), model="survreg") six.survreg.pred <- data.table(six.BIC.dt) six.survreg.pred[, model := "survreg"] six.survreg.pred[, pred.log.lambda := predict(fit.survreg, six.survreg.pred)] six.survreg.res <- six.survreg.pred[six.features.targets, on=list( profile.id, chromosome)] six.survreg.res[, residual := 0] six.survreg.res[{ is.finite(min.log.lambda)|is.finite(max.log.lambda) }, residual := { penaltyLearning::targetIntervalResidual( cbind(min.log.lambda, max.log.lambda), pred.log.lambda) }] six.gg.survreg <- six.gg.bic+ geom_abline(aes( slope=feature, intercept=`(Intercept)`, color=model), size=1, data=survreg.line.dt)+ geom_segment(aes( feature, pred.log.lambda, xend=feature, yend=pred.log.lambda-residual, color=model), data=six.survreg.res) print(six.gg.survreg) ``` It is clear from the plot above that the learned survreg penalty reduces the number of incorrectly predicted target intervals from 5 to 2. The plot below shows the predicted errors and the smooth surrogate loss minimized by the gradient descent algorithm implemented by the survreg function (gradient descent cannot minimize the non-convex error function directly). ```{r six-gg-errors-survreg} six.survreg.pred[, pred.log.penalty := pred.log.lambda] six.survreg.selection <- data.table(six.selection)[six.survreg.pred, on=list( profile.id, chromosome, min.log.lambda < pred.log.lambda, max.log.lambda > pred.log.lambda)] six.survreg.errors <- six.error.list$model.errors[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] log.pen.vec <- six.error.list$model.errors[, seq( min(max.log.lambda), max(min.log.lambda), l=100)] six.survreg.surrogate <- six.survreg.res[, { ## For censored observations, survreg finds a linear function that ## predicts mean values that maximize the normal CDF (pnorm) -- ## equivalent to minimizing the negative log likelihood below. neg.log.lik <- -log( pnorm(max.log.lambda, log.pen.vec, fit.survreg$scale)- pnorm(min.log.lambda, log.pen.vec, fit.survreg$scale)) data.table( model="survreg", log.penalty=log.pen.vec, surrogate.loss=neg.log.lik )[neg.log.lik < ymax.err] }, by=list(profile.id, chromosome)] six.gg.errors.survreg <- six.gg.errors.bic+ geom_point(aes( pred.log.penalty, errors, color=model), size=2, data=six.survreg.errors)+ geom_line(aes( log.penalty, surrogate.loss, color=model), data=six.survreg.surrogate) print(six.gg.errors.survreg) ``` Note in the plot above: * four `survreg` predicted penalties achieve 0 error where the BIC/SIC has 1 error. * one BIC/SIC prediction achieves 0 error where `survreg` has 1 error. * the smooth surrogate loss is a good approximation of the non-convex error function. We can also visualize the survreg model predictions by plotting the data sequences via the code below. ```{r figsurvreg, fig.width=10} six.survreg.labels <- six.error.list$label.errors[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.survreg.changes <- six.changes[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.survreg.segs <- six.segs[six.survreg.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("survreg 1 feature model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.survreg.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.survreg.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.survreg.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` The false positive and false negative predictions are evident in the plot above. **Exercise 3 during class:** the survreg model above assumed a Gaussian distribution for the log(penalty) values. Would using another distribution increase accuracy? Hint: try re-doing the analyses in this section with the `survreg(dist="logistic")` model. For other choices, see `help(survreg)`. Also try other input features from `feature.mat`. **Exercise after class:** for one of the neuroblastoma data sequences plotted above, verify that `changepoint::cpt.mean(data.vec, penalty="Manual", pen.value=PREDICTED.PENALTY, method="PELT")` and `fpop::Fpop(data.vec, lambda=PREDICTED.PENALTY)` yield the same changepoints. Fpop implements a "functional pruning optimal partitioning" algorithm which is guaranteed to prune more sub-optimal changepoints (i.e. be faster than) the PELT algorithm. Read [On optimal multiple changepoint algorithms for large data](http://link.springer.com/article/10.1007/s11222-016-9636-3) for more info. **Exercise after class:** try re-doing the analyses in this section using `penaltyLearning::IntervalRegressionUnregularized`, which learns an un-regularized model by minimizing the squared hinge loss with respect to the target interval limits. Is there any difference in prediction accuracy? ### IntervalRegressionCV (L1-regularized interval regression) In the previous sections, we assumed that the log(penalty) is an affine function of a given single feature. We chose that feature based on the theoretical arguments of the BIC/SIC penalty. In this section, we show how a multi-variate affine penalty function can be learned using L1 regularization. This technique is great when you don't have a theoretical prior for the penalty. You can compute as many features as you like (256 using `penaltyLearning::featureMatrix`), and the L1 regularization with ignore any irrelevant features automatically (feature selection). The penalty function we learn is $\log\lambda_i = f(\mathbf x_i) = \beta + \mathbf w^\intercal \mathbf x_i$ which is equivalent to solving the following penalized changepoint detection problem, $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{\exp(\beta+\mathbf w^\intercal \mathbf x_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), $$ where the intercept $\beta$ and the weight vector $\mathbf w$ are defined as the solution to the following optimization problem, $$ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R } \gamma ||\mathbf w||_1 + \frac{1}{n} \sum_{i=1}^n \phi(\mathbf w^\intercal \mathbf x_i+\beta-\underline y_i)+ \phi(\overline y_i-\mathbf w^\intercal \mathbf x_i-\beta) $$ where * the data and optimization variables are the same as in the un-regularized interval regression problem described in the previous section (except there is no scale $\sigma$). * the L1-norm $||\mathbf w||_1=\sum_{j=1}^p |w_j|$ is a regularization term which is used to avoid overfitting. It forces some of the $w_j$ weights to be exactly zero (feature selection). * the hyper-parameter $\gamma>0$ is the degree of L1-regularization (chosen via an internal cross-validation loop). This means that the model performs automatic feature selection. You can thus use as many features $p$ as you want, and the model will automatically ignore any which do not increase prediction accuracy. * $\phi(x)=(x-1)^2I(x<1)$ is the squared hinge loss ($I$ is the indicator function: 1 if the argument is true, and 0 otherwise). * The objective function to minimize is the L1-norm of the weight vector plus the mean squared hinge loss (both are convex). Overall the optimization problem is convex, so the global minimum is guaranteed to be found in a reasonable amount of time using gradient-based algorithms. This is a machine learning technique that automatically selects features that result in good prediction accuracy. First we create a feature and target matrix for the data in the train set: ```{r train-feature-mat} finite.feature.mat <- feature.mat[, apply(is.finite(feature.mat), 2, all)] train.ids <- train.dt[, paste(profile.id, chromosome) ] train.feature.mat <- finite.feature.mat[train.ids, ] train.target.mat <- train.dt[, cbind(min.log.lambda, max.log.lambda)] str(train.feature.mat) ``` The training set consists of 30 examples, each with 256 features. Before training, we register the multiprocess future plan (the internal cross-validation loop to choose the regularization parameter will be computed in parallel). ```{r require-future} if(require(future)){ options(mc.cores=parallel::detectCores()/2) plan(multiprocess) } ``` We use `set.seed` below in order to make the results reproducible (there is randomness in the fold assignment in the internal cross-validation loop). Then, we use `IntervalRegressionCV` to fit the regularized linear model. **New function:** the `penaltyLearning::IntervalRegressionCV` function computes an L1-regularized interval regression model. Its two main inputs are a numeric input feature matrix (n observations x p features), and a numeric output target interval matrix (n observations x 2). Its return value is a model fit object of class `IntervalRegression`, which has `predict`, `print`, `plot`, and `coef` methods. ```{r fit-cv} set.seed(1) any.finite <- apply(is.finite(train.target.mat), 1, any) fit.cv <- penaltyLearning::IntervalRegressionCV( train.feature.mat[any.finite,], train.target.mat[any.finite,]) coef(fit.cv) ``` It is clear from the output above that even though the model was trained using 256 possible input features, it has selected only a few of them to use in the prediction function. We visualize the predicted penalty values as dots in the error curve plots below: ```{r six-gg-errors-cv} six.cv.pred <- data.table(six.survreg.pred) six.cv.pred[, pred.log.penalty := { pred.feat.mat <- finite.feature.mat[paste(profile.id, chromosome), ] predict(fit.cv, pred.feat.mat) }] six.cv.pred[, model := "IntervalRegressionCV"] six.cv.pred[, pred.log.lambda := pred.log.penalty] six.cv.selection <- data.table(six.selection)[six.cv.pred, on=list( profile.id, chromosome, min.log.lambda < pred.log.lambda, max.log.lambda > pred.log.lambda)] six.cv.errors <- six.error.list$model.errors[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.surrogate <- six.survreg.res[, { ## IntervalRegressionCV minimizes the squared hinge loss. data.table( model="IntervalRegressionCV", log.penalty=log.pen.vec, surrogate.loss=penaltyLearning::squared.hinge(log.pen.vec-min.log.lambda)+ penaltyLearning::squared.hinge(max.log.lambda-log.pen.vec) )[surrogate.loss < ymax.err] }, by=list(profile.id, chromosome)] six.gg.errors.survreg+ geom_point(aes( pred.log.penalty, errors, color=model), size=2, data=six.cv.errors)+ geom_line(aes( log.penalty, surrogate.loss, color=model), data=six.cv.surrogate) ``` The plot above also shows the surrogate loss that `IntervalRegressionCV` minimizes using a gradient descent algorithm. Note how it is very similar to the `survreg` surrogate loss -- both are smooth functions which increase quadratically as the predicted value goes past the target interval limit. Finally, we can also visualize the `IntervalRegressionCV` model in terms of the data sequences, labels, and predicted segmentation: ```{r figcv, fig.width=10} six.cv.labels <- six.error.list$label.errors[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.changes <- six.changes[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.segs <- six.segs[six.cv.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("IntervalRegressionCV multi-feature model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.cv.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.cv.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.cv.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` The plot above clearly shows the advantage of the machine learning approach: all labels are correctly predicted, including all labels on the test set (chromosome 11). **Exercise after class:** by default the `IntervalRegressionCV` function chooses the regularization parameter with minimum error, as estimated using an internal cross-validation loop. Another option is to choose the least complex model within 1 standard deviation. Re-fit the model using `IntervalRegressionCV(reg.type="1sd")`, view the error as a function of model complexity via `plot(fit.cv)`, then re-do the model prediction error plots. Is this model more or less accurate? Why? Hint: use `coef(fit.cv)` to see how many features are included in the model. ### Other methods **Exercise after class:** try one of these other interval regression algorithms, and see if you can learn a more accurate penalty function. * [iregnet](https://github.com/anujkhare/iregnet), elastic net regularized Accelerated Failure Time models. * [mmit](https://github.com/aldro61/mmit), max margin interval trees. * [trtf](https://r-forge.r-project.org/R/?group_id=1322) transformation trees and forests. ## Evaluation Since we did not use the chromosome 11 labels to train any of the models above, we can use those labels to evaluate the prediction accuracy of the learned models. * In fact, we have already **qualitatively** evaluated the accuracy of the predictions by visual inspection of the plots in the last section. Qualitative evaluation via visual inspection is also possible in the context of an unlabeled unsupervised changepoint analysis. * However because we have labels, we can **quantitatively** evaluate the prediction accuracy by computing the number of incorrect labels and AUC (area under the ROC curve). This is a major advantage of the supervised machine learning approach. Below we create a list with an element for each of the three models we want to compare. Then we use a `for` loop to compute ROC curves for each model, with respect to the test data on chromosome 11. **New function:** the `penaltyLearning::ROChange` function computes Receiver Operating Characteristic curves for penalty function learning problems. It inputs the `model.errors` data.table from `penaltyLearning::labelError`, and a data.table with predicted log(penalty) values in the `pred.log.lambda` column. It outputs a list which describes the ROC curves. Below we use the `roc` element which is a data.table with one row for each point on the ROC curve. We also use the `thresholds` element, which is a data.table which has a row with the error rates for the predicted threshold. For more info read `help(ROChange)`. ```{r six-auc} model.pred.list <- list( IntervalRegressionCV=six.cv.pred, "BIC/SIC"=six.BIC.dt, survreg=six.survreg.pred) roc.list <- list() auc.list <- list() for(model in names(model.pred.list)){ model.pred <- model.pred.list[[model]] model.roc <- penaltyLearning::ROChange( six.error.list$model.errors, model.pred[chromosome==11], problem.vars=c("profile.id", "chromosome")) roc.list[[model]] <- data.table(model, model.roc$roc) auc.list[[model]] <- with(model.roc, data.table( model, auc, thresholds[threshold=="predicted"])) } roc <- do.call(rbind, roc.list) (auc <- do.call(rbind, auc.list)) ``` The table above shows the error/accuracy of the three models with respect to the labels in the test data (chromosome 11). It is clear that the IntervalRegressionCV is slightly more accurate (higher auc and lower errors). The ROC curves are shown below: ```{r six-gg-roc} ggplot()+ scale_color_manual(values=model.colors)+ geom_path(aes( FPR, TPR, color=model, linetype=model), size=1, data=roc)+ geom_point(aes( FPR, TPR, color=model), size=3, data=auc)+ coord_equal() ``` However, with so few labels, and only one test set, it is difficult/impossible to tell if there is any significant difference between models. Below we use K-fold cross-validation to evaluate the models: * we assign each labeled data sequence to one of K folds. * for each fold, we set aside the corresponding data in that fold as a test set. * we then train each of the models using all of the other folds (not using the test fold at all during this training phase). * after each model has been trained, we compute predicted penalty values for the test data, then accuracy metrics (percent incorrect labels and AUC). * we then repeat the process for each of the folds, and check to see if any method is better on average over all of the test sets. Since the neuroblastoma data set is relatively big (3418 labeled data sequences), we have pre-computed the Segmentor models, which we load using the code below. (to re-compute them on your computer, run [Segmentor.models.R](https://github.com/tdhock/change-tutorial/blob/master/Segmentor.models.R)) ```{r load-Segmentor-models} if(!file.exists("Segmentor.models.RData")){ download.file( "http://members.cbio.ensmp.fr/~thocking/change-tutorial/Segmentor.models.RData", "Segmentor.models.RData") } load("Segmentor.models.RData") ``` Before performing cross-validation, we can compute the error functions, features and target intervals for all of the labeled data. ```{r labeled-feature-mat} nb.selection <- Segmentor.models$loss[, { penaltyLearning::modelSelection( .SD, complexity="n.segments") }, by=list(profile.id, chromosome)] nb.error.list <- penaltyLearning::labelError( nb.selection, neuroblastoma$annotations, Segmentor.models$segs[1 < start], problem.vars=c("profile.id", "chromosome") ) nb.target.dt <- penaltyLearning::targetIntervals( nb.error.list$model.errors, problem.vars=c("profile.id", "chromosome")) labeled.profiles <- data.table( neuroblastoma$profiles)[nb.target.dt[, list( profile.id, chromosome)], on=list( profile.id, chromosome)] labeled.feature.mat <- penaltyLearning::featureMatrix( labeled.profiles, problem.vars=c("profile.id", "chromosome"), data.var="logratio") ``` Rather than training on all features (which is possible but takes a bit too long for this tutorial), we will just use the following subset: ```{r some-feature-names} (some.feature.names <- c( colnames(labeled.feature.mat)[1:5], rownames(coef(fit.cv))[-1])) ``` Next we assign a fold to each target, ```{r nb-target-dt} set.seed(1) n.folds <- 5 nb.target.dt[, fold := sample(rep(1:n.folds, l=.N))] ``` Then we loop over test folds. In each iteration of the for loop, we create `train.target.dt` and `train.features` by excluding data in the test fold. We then train each model using these data, and compute prediction accuracy for each model using `penaltyLearning::ROChange`. ```{r cv-auc-list} cv.roc.list <- list() cv.auc.list <- list() for(test.fold in 1:n.folds){ train.target.dt <- nb.target.dt[fold!=test.fold] train.id.vec <- train.target.dt[, paste(profile.id, chromosome)] train.features <- labeled.feature.mat[train.id.vec, ] train.target.mat <- train.target.dt[, cbind(min.log.lambda, max.log.lambda)] print(test.fold) set.seed(1) fit <- penaltyLearning::IntervalRegressionCV( train.features[, some.feature.names], train.target.mat) train.target.dt$feature <- train.features[, "n.loglog"] fit.survreg <- survreg( Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature, train.target.dt, dist="gaussian") ## compute predictions test.target.dt <- nb.target.dt[fold==test.fold] test.id.vec <- test.target.dt[, paste(profile.id, chromosome)] test.feature.vec <- labeled.feature.mat[test.id.vec, "n.loglog"] pred.list <- list( "BIC/SIC"=test.feature.vec, survreg=predict(fit.survreg, data.table(feature=test.feature.vec)), IntervalRegressionCV=predict(fit, labeled.feature.mat[test.id.vec,])) for(model in names(pred.list)){ pred.dt <- data.table( test.target.dt, pred.log.lambda=as.numeric(pred.list[[model]])) pred.roc <- penaltyLearning::ROChange( nb.error.list$model.errors, pred.dt, problem.vars=c("profile.id", "chromosome")) cv.roc.list[[paste(test.fold, model)]] <- data.table( test.fold, model, pred.roc$roc) cv.auc.list[[paste(test.fold, model)]] <- with(pred.roc, data.table( test.fold, model, auc, thresholds[threshold=="predicted"])) } } cv.auc <- do.call(rbind, cv.auc.list) cv.roc <- do.call(rbind, cv.roc.list) ``` Note that in the code above, we used `cv.roc.list` to save the ROC curve for each model and test fold, and we used `cv.auc.list` to save AUC. We plot the ROC curves below in order to visualize what kinds of errors the models commit (false positives or false negatives). ```{r cv-roc-curves} ggplot()+ scale_color_manual(values=model.colors)+ geom_path(aes( FPR, TPR, color=model, group=paste(model, test.fold)), data=cv.roc)+ geom_point(aes( FPR, TPR, color=model), fill="white", shape=21, data=cv.auc)+ coord_equal(xlim=c(0, 0.5), ylim=c(0.5, 1)) ``` It is clear from the ROC curves above that the BIC/SIC model has a relatively low true positive rate (high false negative rate -- many breakpoint/positive labels with no predicted changepoints). Below we compute and plot two accuracy metrics, AUC and percent incorrectly predicted labels. ```{r cvAccuracyPlot} cv.auc[, accuracy.percent := 100-error.percent] cv.auc.tall <- melt(cv.auc, measure.vars=c("accuracy.percent", "auc")) cv.means <- cv.auc.tall[, list( mean=mean(value), sd=sd(value) ), by=list(model, variable)] levs <- cv.means[variable=="auc"][order(mean), model] cv.auc.tall[, model.fac := factor(model, levs)] ggplot()+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(. ~ variable, scales="free")+ scale_color_manual(values=model.colors)+ guides(color="none")+ geom_point(aes( value, model.fac, color=model), data=cv.auc.tall)+ xlab("")+ ylab("model") ``` It is clear from the plot above that survreg is significantly more accurate than BIC/SIC, and IntervalRegressionCV is slightly more accurate than survreg. **Remark 1:** the accuracy rates above (BIC/SIC predicts 92% correct labels whereas survreg and IntervalRegressionCV predict 97-98% correct labels) are specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions like BIC/SIC will be less accurate than supervised methods such as IntervalRegressionCV. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy. **Remark 2:** To learn the most accurate model, it is best to use all training data (instead of holding out some of the labels as a test set). The [figure-neuroblastoma-predictions.R](https://github.com/tdhock/change-tutorial/blob/master/figure-neuroblastoma-predictions.R) script fits an IntervalRegressionCV model to the entire neuroblastoma data set, uses the model to predict changepoints for all data sequences (even those without labels), then [plots the predicted changepoints along with the train errors](figure-neuroblastoma-predictions.png). Some overfitting/extrapolation errors are evident. For example some chrY sequences have too many changes --- this could be fixed by add some negative/normal labels on those sequences. **Conclusion:** in **labeled data sequences with a common signal/noise pattern**, supervised learning algorithms can be used to predict penalty values and changepoints which are **more accurate than unsupervised methods**. An advantage of the supervised learning approach is that we can **quantitatively evaluate** the learned models by counting the number of incorrectly predicted labels in a held-out test set. **Exercise after class:** is the logistic or Gaussian distribution a more accurate model for the log(penalty) values? In the code block above, add some code for computing the model predictions and error rates for `survreg(dist="logistic")`. Then re-do the plots in this section -- there should be 4 models instead of 3 in the last plot.