--- title: "Introduction to supervised changepoint detection" author: Toby Dylan Hocking, McGill University, Montreal, Canada output: html_document: toc: true toc_depth: 2 --- Before starting this tutorial, make sure to install the required packages: ```{r packages} packages.R <- if(file.exists("packages.R")){ "packages.R" }else{ "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R" } source(packages.R) options(width=100) ``` # What is the difference between unsupervised and supervised changepoint detection? There are two features of a data set that make it possible to do a supervised changepoint analysis: * The data consist of **several sequences** with similar signal/noise patterns, which are treated as separate changepoint detection problems. * Some data sequences have **labels** which indicate specific regions where the model should predict presence or absence of changepoints. The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence). ## Neuroblastoma data set We will use the neuroblastoma data set to explore supervised changepoint detection. ```{r neuroblastoma} data(neuroblastoma, package="neuroblastoma") str(neuroblastoma) lapply(neuroblastoma, head, 10) ``` The neuroblastoma data set consists of two data.frames: * profiles contains the noisy data (logratio), which is approximate DNA copy number (measured at chromosome, position), for a particular patient (profile.id). Every unique combination of (profile.id, chromosome) defines a separate multiple changepoint detection problem. The logratio measures how many copies of each part of the human genome are present in the patient's tumor. Normally there are two copies of each chromosome (logratio=0), but in cancer samples there can be gains and losses of specific regions or entire chromosomes. * annotations contains the labels, which indicate presence (breakpoint) or absence (normal) of changepoints in specific regions (profile.id, chromosome, min, max) of the data. These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment: ```{r clinical} followup <- data.frame( profile.id=c(10L, 8L, 4L, 6L, 11L, 1L), status=c("ok", "relapse", "relapse", "ok", "ok", "relapse")) followup ``` ## Unsupervised: no labels, model selection via theory When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in `neuroblastoma$profiles$logratio`. We begin by selecting the profiles of the six patients for which we have follow-up data. ```{r six} rownames(followup) <- followup$profile.id followup$status.profile <- with(followup, paste(status, profile.id)) some.ids <- rownames(followup) library(data.table) someProfiles <- function(all.profiles){ some <- subset(all.profiles, paste(profile.id) %in% some.ids) status.profile <- followup[paste(some$profile.id), "status.profile"] some$status.profile <- ifelse( is.na(status.profile), paste(some$profile.id), status.profile) data.table(some) } six.profiles <- someProfiles(neuroblastoma$profiles) ``` For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel). ```{r plotProfiles, fig.width=10} library(ggplot2) gg.unsupervised <- ggplot()+ ggtitle("unsupervised changepoint detection = only noisy data sequences")+ theme( panel.margin=grid::unit(0, "lines"), panel.border=element_rect(fill=NA, color="grey50") )+ facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+ geom_point(aes(position/1e6, logratio), data=six.profiles, shape=1)+ scale_x_continuous( "position on chromosome (mega bases)", breaks=c(100, 200))+ scale_y_continuous( "logratio (approximate DNA copy number)", limits=c(-1,1)*1.1) print(gg.unsupervised) ``` Note the facet titles on the right: * The top three profiles are "ok" because those children completely recovered. * The bottom three profiles are "relapse" because the neuroblastoma cancer came back, and required another treatment at the hospital. **Question:** can you visually identify any differences between the "ok" and "relapse" profiles? How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as AIC (penalty=2) or BIC/SIC (penalty=log n). More on this later. ## Supervised: learn a penalty function with minimal incorrect labels In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below. ```{r plotLabels, fig.width=10} six.labels <- someProfiles(neuroblastoma$annotations) gg.supervised <- gg.unsupervised+ ggtitle(paste( "supervised changepoint detection = data + labels", "that indicate specific regions with or without changepoints"))+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, fill=annotation), alpha=0.5, color=NA, data=six.labels)+ scale_fill_manual("label", values=penaltyLearning::change.colors)+ theme(legend.position="bottom") print(gg.supervised) ``` It is clear from this plot that the neuroblastoma data satisfy the two criteria which are necessary for a supervised changepoint analysis: * The data consist of **several sequences** with similar signal/noise patterns, which are treated as separate changepoint detection problems. * Some data sequences have **labels** which indicate specific regions where the model should predict presence or absence of changepoints. As we will see later in the tutorial, the labels can be used for choosing the best model and parameters. The main idea is that the changepoint model should be consistent with the labels: * Every breakpoint/positive label is a region where the model should predict at least one changepoint. If the model predicts no changepoints in a region with a breakpoint label, that is considered a false negative. * Every normal/negative label is a region where the model should predict no changepoints. If the model predicts one or more changepoints in a region with a normal label, that is considered a false positive. * The overall goal is find a changepoint model that minimizes the number of incorrectly predicted labels, which is defined as the sum of false positives and false negatives. **Exercise:** plot data and labels for a different set of profiles. Hint: assign a new value to `some.ids`, then re-run the code above. ## Creating labels via visual inspection Labels can be created using prior knowledge or visual inspection. In this section we explain how to create labels via visual inspection. **Exercise:** create a set of labels via visual inspection for one un-labeled data sequence in the neuroblastoma data set. Begin by plotting one data set by itself to see the details of the signal and noise ```{r zoom} zoom.profile <- 4 #Exercise: change this value! zoom.chromosome <- 14 #Exercise: change this value! zoom.pro <- subset( neuroblastoma$profiles, profile.id==zoom.profile & chromosome==zoom.chromosome) zoom.gg <- ggplot()+ geom_point(aes(position/1e6, logratio), data=zoom.pro, shape=1)+ scale_y_continuous( "logratio (approximate DNA copy number)") print(zoom.gg) ``` Then add some new labels by creating a `data.frame` in R that encodes where you saw regions with and without changepoints. Hints: * in the neuroblastoma data set there is only one label per data sequence, but you can add as many as you want. Try adding two labels to the same data sequence. * in addition to breakpoint/normal labels, you can use the "1change" label to indicate a region that should have exactly 1 changepoint (2 or more implies a false positive, 0 implies a false negative). * it will simplify the code to use a helper function as below: ```{r manual-label-fun} label <- function(min, max, annotation){ data.frame( profile.id=zoom.profile, chromosome=zoom.chromosome, min, max, annotation) } zoom.labels <- rbind( label(70e6, 80e6, "1change"), label(20e6, 60e6, "normal")) zoom.gg.lab <- zoom.gg+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, fill=annotation), alpha=0.5, color=NA, data=zoom.labels)+ scale_fill_manual("label", values=penaltyLearning::change.colors) print(zoom.gg.lab) ``` To quickly create labels for many data sequences, I recommend using a GUI in which you can use the mouse to draw label rectangles on the scatterplots, such as [annotate_regions.py](https://pypi.python.org/pypi/annotate_regions/1.0) or [SegAnnDB](https://github.com/tdhock/SegAnnDB). # A typical supervised changepoint analysis In this section we explain how to perform a supervised changepoint analysis on a labeled data set. ## Overview of supervised changepoint computations A typical supervised changepoint analysis consists of the following computations: * [Changepoints and model selection](#changepoints-and-model-selection). For each of several labeled segmentation problems (data sequences that are separate but have a similar signal/noise pattern), use your favorite changepoint detection package to compute a sequence of models of increasing complexity (say from 0 to 20 changepoints). For each segmentation problem, make sure to save the changepoint positions of each model, and use `penaltyLearning::modelSelection` to compute the exact path of models that will be selected for every possible non-negative penalty value. * [Label error](#label-error). Use `penaltyLearning::labelError` to compute the number of incorrect labels for each labeled segmentation problem and each changepoint model. The goal of learning is to minimize the number of incorrectly predicted labels. * [Outputs](#outputs). Use `penaltyLearning::targetIntervals` to compute a target interval of log(penalty) values that predicts the minimum number of incorrect labels for each segmentation problem. Create a target interval matrix (one row for each segmentation problems, 2 columns) which can be used as the output in the supervised machine learning problem. * [Inputs](#inputs). Compute a feature matrix (segmentation problems x features) using `penaltyLearning::featureMatrix`. Or do it yourself using simple statistics of each segmentation problem (quantiles, mean, number of data points, estimated variance, etc). * [Learning and prediction](#learning-and-prediction). Use `survival::survreg` or `penaltyLearning::IntervalRegressionCV` to learn a penalty function. * [Evaluation](#evaluation). use `penaltyLearning::ROChange` to compute ROC curves using labels in a test set (that were not used for training). The AUC (area under the ROC curve) and the percent incorrect labels in the test set can be used to evaluate the prediction accuracy of your model. The mathematical details are described in [our ICML'13 paper, Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression](http://proceedings.mlr.press/v28/hocking13.html). ## Changepoints and model selection In this section we will begin by computing maximum likelihood Gaussian changepoint models using the `Segmentor` function. * Let there be $n$ separate data sequences to segment. For example in the previous section we plotted $n=144$ data sequences, each a separate changepoint detection problem. * For any data sequence $i\in\{1,\dots,n\}$, let $d_i$ be the number of data points in data sequence $i$, and let $\mathbf z_i\in\mathbb R^{d_i}$ be the vector of data points to segment for that sequence. * The Segment Neighborhood problem with $s$ segments for data sequence $i$ is to find the most likely $s-1$ changepoints: $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) \text{ subject to} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1})=s-1. $$ For the Normal homoscedastic model we have * the optimization variable $\mathbf m = (m_1 \cdots m_{d_i})$ is a vector of $d_i$ real-valued segment mean variables. * the loss function $\ell$ is the square loss $\ell(z,m)=(z-m)^2$. * the indicator function $I$ is 1 for every change-point where $m_{j} \neq m_{j+1}$, and 0 otherwise. * $\lambda>0$ is a non-negative penalty constant. * the objective function is the total loss over all $d_i$ data points, which is convex. * the constraint is that the mean $\mathbf m$ must have exactly $s-1$ changes, which is non-convex. * overall the optimization problem is non-convex, which means that gradient-based algorithms are not guaranteed to compute the global minimum. However the dynamic programming algorithms discussed below can be used to compute the global minimum in a reasonable amount of time. A few references about the algorithm implemented in the `Segmentor` function: * the "Segment Neighborhood" problem was originally described by [Auger and Lawrence (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2706400). They proposed an $O(S_{\text{max}} d^2)$ algorithm for computing the most likely models with $s\in\{1,\dots,S_{\text{max}}\}$ segments, for $d$ data points. An R implementation of their algorithm is available as `changepoint::cpt.mean(method="SegNeigh")` but it is very slow for large $d$, due to the quadratic time complexity. * [Rigaill (2010) discovered a Pruned Dynamic Programming Algorithm (PDPA)](https://arxiv.org/abs/1004.0887) which solves the same "Segment Neighborhood" problem. Its $O(S_{\text{max}}d\log d)$ time complexity makes it much faster for larger data sets. The original R implementation `cghseg:::segmeanCO` only works for the Gaussian/normal likelihood. * [Cleynen et al (2014) generalized the PDPA to other loss/likelihood models](https://almob.biomedcentral.com/articles/10.1186/1748-7188-9-6). Their R function `Segmentor3IsBack::Segmentor` implements the $O(S_{\text{max}}d\log d)$ PDPA for Poisson, Normal homoscedastic (uniform variance -- change in mean), Negative Binomial, Normal Heteroscedastic (uniform mean -- change in variance), and Exponential likelihood models. **Exercise:** for the data sequence $i$ that you labeled in the previous section, compute the Normal homoscedastic Segment Neighborhood models for $s\in\{1,\dots,10\}$ segments using the code below. Then create the rest of the figures in this section using your labeled data. ```{r zoomSegmentor} max.segments <- 10 (fit <- Segmentor3IsBack::Segmentor( zoom.pro$logratio, model=2, Kmax=max.segments)) ``` The Segmentor `fit` object is an S4 class with slots * `breaks` the end position of each segment. * `parameters` the mean of each segment. Next, we convert the model to tidy format (a `data.frame` with one row per segment). The code below is specific to the `Segmentor` function; if you use a different changepoint detection package/function, you will have to write some other code to tidy the data. ```{r zoomTidy} zoom.segs.list <- list() zoom.loss.vec <- rep(NA, max.segments) for(n.segments in 1:max.segments){ end <- fit@breaks[n.segments, 1:n.segments] data.before.change <- end[-n.segments] data.after.change <- data.before.change+1 pos.before.change <- as.integer( (zoom.pro$position[data.before.change]+ zoom.pro$position[data.after.change])/2) start <- c(1, data.after.change) chromStart <- c(zoom.pro$position[1], pos.before.change) chromEnd <- c(pos.before.change, max(zoom.pro$position)) seg.mean.vec <- fit@parameters[n.segments, 1:n.segments] zoom.segs.list[[n.segments]] <- data.frame( profile.id=zoom.profile, chromosome=zoom.chromosome, n.segments, start, end, chromStart, chromEnd, mean=seg.mean.vec, row.names=NULL) data.mean.vec <- rep(seg.mean.vec, end-start+1) stopifnot(length(data.mean.vec)==nrow(zoom.pro)) zoom.loss.vec[n.segments] <- sum((zoom.pro$logratio-data.mean.vec)^2) } ``` Notice that we computed `zoom.loss.vec` which is the sum of squared residuals of each model. Below we use that to construct a data.frame with one row for each model, which we can use with `modelSelection` to compute the models that will be selected for every possible penalty value. ```{r zoomSelection} zoom.models <- data.frame( profile.id=zoom.profile, chromosome=zoom.chromosome, loss=zoom.loss.vec, n.segments=as.numeric(1:max.segments)) (zoom.selection <- penaltyLearning::modelSelection( zoom.models, complexity="n.segments")) ``` Let $L_{i,s}$ be the loss (sum of squared residuals) of the model with $s\in\{1,\dots,s_{\text{max}}\}$ segments for data sequence $i$. The model selection function is defined for every non-negative penalty $\lambda\geq 0$ as $$ s_i^*(\lambda) = \text{arg min}_s L_{i,s} + \lambda s $$ We plot the function in the figure below, which shows that it is a piecewise constant non-increasing function. ```{r zoomSelectionPlot} ggplot()+ geom_segment(aes( min.lambda, n.segments, xend=max.lambda, yend=n.segments), size=1, data=zoom.selection)+ xlab("penalty constant = lambda")+ scale_y_continuous(breaks=zoom.selection$n.segments) ggplot()+ geom_segment(aes( min.log.lambda, n.segments, xend=max.log.lambda, yend=n.segments), size=1, data=zoom.selection)+ xlab("log(penalty) = log(lambda)")+ scale_y_continuous(breaks=zoom.selection$n.segments) ``` The figure above shows the same function in the $\log\lambda$ space (which is defined on the entire real line). As can be seen from the definition of $s_i^*(\lambda)$ and the figures above, a bigger penalty $\lambda$ results in a model with fewer segments/changes (and vice versa). Note that there are some models that are not selected for any penalty values (e.g. 3 or 6 segments for this data set). The predicted changepoint positions are every `chromStart` on segments after the first: ```{r zoomSegsChanges} (zoom.segs <- do.call(rbind, zoom.segs.list[zoom.selection$n.segments])) (zoom.changes <- subset(zoom.segs, 1 < start)) ``` We use the code below to visualize the models along with the noisy data sequence: ```{r zoomPlotdata} zoom.gg.models <- zoom.gg.lab+ theme( panel.margin=grid::unit(0, "lines"), panel.border=element_rect(fill=NA, color="grey50") )+ facet_grid(n.segments ~ .)+ geom_vline(aes( xintercept=chromStart/1e6), data=zoom.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=zoom.segs, size=1, color="green") print(zoom.gg.models) ``` ## Label error Next, we compute the number of incorrect labels for each model. The `labelError` function inputs 3 data.frames: * `models` one row per changepoint model, with columns for loss and model complexity. * `labels` one row for each label, with columns for label location and type. * `changes` one row for each predicted changepoint, with a column for the predicted changepoint position. ```{r zoomlabelerror} zoom.error.list <- penaltyLearning::labelError( zoom.selection, zoom.labels, zoom.changes, problem.vars=c("profile.id", "chromosome")) str(zoom.error.list) ``` The result of `labelError` is a list of two data tables: * `model.errors` has one row per (sequence,model), with the total number of incorrect labels. This is used as the input for the next step (computing a target interval of penalty values which achieves minimum incorrect labels for each data sequence). * `label.errors` has one row per (sequence,model,label), with the fp/fn status of each label. This can be used to visualize false negative and false positive predictions in particular data sequences. **Exercise:** visualize the false negatives and false positives in the data set that you labeled, using the code below. ```{r zoomplotlabels} zoom.gg.models+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), data=zoom.error.list$label.errors, fill=NA)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1)) ``` * **False positives** are labels with too many predicted changepoints, drawn with a solid black border. * **False negatives** are labels with too few predicted changepoints, drawn with a dotted black border. * **Question:** which models are consistent with the labels in your data? ## Outputs Now that we have computed `zoom.error.list$model.errors` which contains the number of incorrect labels $e_i(s)\in\{0,1,2,\dots\}$ for this data sequence $i$ and every number of segments $s$, we can visualize the label error as function of the penalty, $$ E_i(\lambda) = e_i[s^*_i(\lambda)] $$ The code below makes a figure of $E_i(\lambda)$, clearly showing that it is a piecewise constant function that takes integer values (like the 01 loss in binary classification). ```{r zoomploterrors} zoom.errors.tall <- data.table::melt( zoom.error.list$model.errors, measure.vars=c("n.segments", "errors")) zoom.gg.errors <- ggplot()+ geom_segment(aes( min.log.lambda, value, xend=max.log.lambda, yend=value), size=1, data=zoom.errors.tall)+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(variable ~ ., scales="free")+ scale_y_continuous("", breaks=0:max.segments)+ xlab("log(penalty) = log(lambda)") print(zoom.gg.errors) ``` Below we compute the output for the machine learning problem, a target interval of penalty values that yield changepoint models with minimal incorrect labels. ```{r zoomplottargets} zoom.target <- penaltyLearning::targetIntervals( zoom.error.list$model.errors, problem.vars=c("profile.id", "chromosome")) zoom.target.tall <- data.table::melt( zoom.target, measure.vars=c("min.log.lambda", "max.log.lambda"), variable.name="limit")[is.finite(value)] zoom.gg.errors+ geom_point(aes( value, errors, fill=limit), shape=21, size=4, data=data.frame(zoom.target.tall, variable="errors"))+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white")) ``` The plot above emphasizes the target interval, which is used as the output in the machine learning problem: * the black dot shows the lower limit of the target interval. If a smaller log(penalty) is predicted, then there are too many changepoints (false positives). * the white dot shows the upper limit of the target interval. If a larger log(penalty) is predicted, then there are too few changepoints (false negatives). The target interval is different for each labeled data sequence. We will show that by computing the target interval for each of the labeled data sequences in the six profiles that we showed in the beginning of the tutorial. The first step is to compute changepoints and model selection functions, which we do via a for loop over each data sequence below. ```{r sixSegmentor} six.problems.dt <- unique(six.profiles[, list( profile.id, chromosome, status.profile)]) setkey(six.profiles, profile.id, chromosome) six.segs.list <- list() six.selection.list <- list() for(problem.i in 1:nrow(six.problems.dt)){ meta <- six.problems.dt[problem.i,] pro <- six.profiles[meta] max.segments <- min(nrow(pro), 10) fit <- Segmentor3IsBack::Segmentor( pro$logratio, model=2, Kmax=max.segments) rss.vec <- rep(NA, max.segments) for(n.segments in 1:max.segments){ end <- fit@breaks[n.segments, 1:n.segments] seg.mean.vec <- fit@parameters[n.segments, 1:n.segments] data.before.change <- end[-n.segments] data.after.change <- data.before.change+1 pos.before.change <- as.integer( (pro$position[data.before.change]+pro$position[data.after.change])/2) start <- c(1, data.after.change) chromStart <- c(pro$position[1], pos.before.change) chromEnd <- c(pos.before.change, max(pro$position)) data.mean.vec <- rep(seg.mean.vec, end-start+1) rss.vec[n.segments] <- sum((pro$logratio-data.mean.vec)^2) six.segs.list[[paste(problem.i, n.segments)]] <- data.table( meta, n.segments, start, end, chromStart, chromEnd, mean=seg.mean.vec) } loss.dt <- data.table( meta, n.segments=1:max.segments, loss=rss.vec) six.selection.list[[problem.i]] <- penaltyLearning::modelSelection( loss.dt, complexity="n.segments") } six.selection <- do.call(rbind, six.selection.list) six.segs <- do.call(rbind, six.segs.list) six.changes <- six.segs[1 < start] ``` Inside of each iteration of the loop above, we used * `Segmentor3IsBack::Segmentor` to compute the optimal changepoint models, saving the tidy result data.frame of segments to an element of `six.segs.list`. * `penaltyLearning::modelSelection` to compute the model selection functions, saving the result to an element of `six.selection.list`. Below, we use * `penaltyLearning::labelError` to compute the number of incorrect labels for each segmentation model. * `penaltyLearning::targetIntervals` to compute the target interval of penalty values for each labeled data sequence. ```{r sixPlotErrors} six.error.list <- penaltyLearning::labelError( six.selection, six.labels, six.changes, problem.vars=c("profile.id", "chromosome")) six.targets <- penaltyLearning::targetIntervals( six.error.list$model.errors, problem.vars=c("profile.id", "chromosome")) ymax.err <- 1.2 six.targets.tall <- data.table::melt( six.targets, measure.vars=c("min.log.lambda", "max.log.lambda"), variable.name="limit", value.name="log.lambda")[is.finite(log.lambda)] six.gg.errors <- ggplot()+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(profile.id ~ chromosome, labeller=function(df){ if("chromosome" %in% names(df)) df$chromosome <- paste0("chr", df$chromosome) if("profile.id" %in% names(df)) df$profile.id <- paste("profile", df$profile.id) df })+ geom_segment(aes( min.log.lambda, errors, xend=max.log.lambda, yend=errors), size=1, data=six.error.list$model.errors)+ geom_point(aes( log.lambda, errors, fill=limit), shape=21, size=3, alpha=0.5, data=six.targets.tall)+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white"))+ scale_y_continuous( "incorrectly predicted labels", limits=c(0,ymax.err), breaks=c(0,1))+ scale_x_continuous( "predicted log(penalty) = log(lambda)") print(six.gg.errors) ``` The plot above shows the error function and target intervals for each labeled data sequence. Each column of panels/facets represents a different chromosome, and each row represents a different profile/patient. It is clear from the plot above that a model with minimal incorrect labels will predict a different penalty for each data sequence. **Question:** discuss with your neighbor for 1 minute, why is there only one limit per data sequence? (only one target limit point per panel in the plot above) ## Inputs To train a machine learning model, we first need to compute a vector $\mathbf x_i\in\mathbb R^p$ of exactly $p$ features for each data sequence $i$. We then learn a function $f(\mathbf x_i)=\log\lambda_i\in\mathbb R$. To compute a feature matrix, use the code below: ```{r sixFeatureMatrix} feature.mat <- penaltyLearning::featureMatrix( six.profiles, problem.vars=c("profile.id", "chromosome"), data.var="logratio") str(feature.mat) ``` The feature matrix is 144 rows (one for each data sequence) by 365 columns/features. Which features should we use to predict log(penalty) values? If you have a lot of labels, you can use `penaltyLearning::IntervalRegressionCV` which uses L1-regularization to automatically select relevant features -- more details in the next section. For now we will discuss the simpler unsupervised BIC/SIC model. **Example:** The BIC/SIC penalty is $\lambda_i = \log d_i$, where $d_i$ is the number of data points to segment in sequence $i$. In the changepoint package version 2.2 this is `penalty="SIC0"`. In this framework the BIC/SIC criterion is a linear model $f(\mathbf x_i)=\mathbf w^\intercal \mathbf x_i + \beta$ with intercept $\beta=0$ and slope $w=1$ for $p=1$ feature, $\mathbf x_i=\log\log d_i$ -- this is the column `n.loglog` of the feature matrix. The BIC/SIC log penalty is $$\log\lambda_i = f(\mathbf x_i) = \beta + w\log\log d_i = \log\log d_i$$ This is equivalent to solving the following penalized changepoint detection problem, $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) +\underbrace{(\log d_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}). $$ The plot below visualizes the target interval limits in this feature space. ```{r sixLimitsPlot} six.BIC.dt <- six.profiles[, list( feature=log(log(.N)), model="BIC/SIC" ), by=list(profile.id, chromosome)] six.BIC.dt[, pred.log.lambda := feature] six.features.tall <- six.BIC.dt[six.targets.tall, on=list( profile.id, chromosome)] six.gg.limits <- ggplot()+ geom_point(aes( feature, log.lambda, fill=limit), shape=21, data=six.features.tall)+ scale_fill_manual("limit", values=c( min.log.lambda="black", max.log.lambda="white"))+ scale_x_continuous( "feature")+ scale_y_continuous( "log(penalty)=log(lambda)") print(six.gg.limits) ``` In the plot above, each point represents a finite limit of a target interval. The goal of learning is to find a regression function which is below the white dots (upper limits) and above the black dots (lower limits). Note that this is not the same problem as binary classification, which the plot below makes clear. ```{r sixTargetsPlot} six.features.targets <- six.BIC.dt[six.targets, on=list( profile.id, chromosome)] six.gg.limits+ geom_segment(aes( feature, min.log.lambda, xend=feature, yend=max.log.lambda), data=six.features.targets) ``` In the plot above, each line segment represents the target interval of penalty values that achieves minimal incorrect regions for a labeled data sequence. The goal of learning is to find a function that intersects as many intervals as possible. In this context we can plot the BIC/SIC model as a line with slope 1 and intercept 0, and we can visualize its errors using the vertical line segments below. ```{r sixBIC} six.features.targets[, residual := 0] six.features.targets[{ is.finite(min.log.lambda)|is.finite(max.log.lambda) }, residual := { penaltyLearning::targetIntervalResidual( cbind(min.log.lambda, max.log.lambda), pred.log.lambda) }] bic.line.dt <- data.table( slope=1, intercept=0, model="BIC/SIC") model.colors <- c( "BIC/SIC"="#1B9E77",#green survreg="#D95F02",#orange IntervalRegressionCV="#7570B3",#dark purple IntervalRegressionCVmargin="#BEAED4")#light purple six.gg.bic <- six.gg.limits+ scale_color_manual(values=model.colors)+ geom_abline(aes( slope=slope, intercept=intercept, color=model), size=1, data=bic.line.dt)+ geom_segment(aes( feature, pred.log.lambda, xend=feature, yend=pred.log.lambda-residual, color=model), data=six.features.targets) print(six.gg.bic) ``` We can see from the plot above that there are 5 data sequences for which the BIC/SIC predicts a penalty value outside of the target interval (penalty too large = too few changepoints). It is clear that the number of errors could be reduced by simply decreasing the intercept (moving the line down). However there is no line in this feature space that perfectly separates the white and black points. We can also visualize the number of incorrect labels for the BIC/SIC model by plotting the data, segmentation and labels below. ```{r figBIC, fig.width=10} six.BIC.selection <- data.table(six.selection)[six.BIC.dt, on=list( profile.id, chromosome, min.log.lambda < feature, max.log.lambda > feature)] six.BIC.labels <- six.error.list$label.errors[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.BIC.changes <- six.changes[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.BIC.segs <- six.segs[six.BIC.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("BIC/SIC model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.BIC.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.BIC.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.BIC.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` **Remark:** the result above (BIC/SIC predicts too few changepoints, resulting in false negatives) is specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions will commit changepoint detection errors which are visually obvious. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy. **Question:** discuss with your neighbor for a minute, why are there only four false negatives in this plot, but there are five predictions outside the target interval in the previous plot with the regression line? Hint: look at the plot of the error as a function of penalty. Another way to visualize the predicted penalty values is with the dots in the error function plot below. ```{r six-gg-errors-bic} six.BIC.errors <- six.error.list$model.errors[six.BIC.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.gg.errors.bic <- six.gg.errors+ geom_point(aes( pred.log.lambda, errors, color=model), size=2, data=six.BIC.errors)+ scale_color_manual(values=model.colors) print(six.gg.errors.bic) ``` It is clear from the plot above that * Most of the BIC/SIC penalty values occur where the error function is zero. These are predictions which are correct with respect to the labeled regions. * Some of the penalty values occur where the error function is non-zero. These are prediction errors with respect to the labels. **Exercise (easy):** In the changepoint package version 2.2 `penalty="SIC0"`means $\lambda_i = \log d_i$ (same as the BIC/SIC we explored in this section). Use the `changepoint::cpt.mean` function to compute the optimal changepoints for one of the neuroblastoma data sequences. Verify that the penalty and changepoints are the same as we computed above. **Exercise (medium):** In the changepoint package version 2.2 `penalty="SIC"`means $\lambda_i = 2\log d_i$ (twice as much penalty as the BIC/SIC in this section). What is the slope/intercept of this model in the same feature space? Re-do the plots in this section. Is the model more or less accurate? **Exercise (hard):** can you think of any other features which would be useful for predicting log(penalty) values? In other words, what features of a data sequence should be related to the number of changepoints we should predict? Change the definition of `feature` in `six.BIC.dt` above, and re-do the plots in this section. Using your feature, is it possible to learn an affine function with fewer errors -- are the white and black dots linearly separable? ## Learning and prediction In the previous section we explored the unsupervised BIC/SIC penalty, which uses 1 feature, and does not learn any parameters using the labels. In this section, we will explain two supervised penalty learning methods that can be used to increase changepoint detection accuracy. ### survreg (un-regularized interval regression) We begin by using the same feature as the BIC/SIC penalty, but relaxing the assumption that the intercept $\beta=0$ and slope $w=1$. Instead of taking these parameter values for granted based on the theoretical arguments of the BIC/SIC penalty, we can use the `survival::survreg` function to optimize these parameters based on the labels. The penalty function is $\log\lambda_i = f(\mathbf x_i) = \beta + w\log\log d_i$ which is equivalent to solving the following penalized changepoint detection problem, $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) +\underbrace{e^\beta(\log d_i)^w }_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), $$ where the intercept $\beta$ and the slope $w$ are defined as the solution to the following optimization problem, $$ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R, \sigma\in\mathbb R_+ } \sum_{i=1}^n -\log[ F(\overline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma)- F(\underline y_i, \mathbf w^\intercal \mathbf x_i+\beta, \sigma) ], $$ where * the optimization variables are the coefficients or weight vector $\mathbf w\in\mathbb R^p$, the bias/intercept $\beta\in\mathbb R$, and the scale $\sigma\in\mathbb R_+$. In this section for the BIC/SIC penalty we have $p=1$ feature, but `survreg` can be used with more features. * $(\underline y_i,\overline y_i)$ is the target interval of log(penalty) values. * $\mathbf x_i\in\mathbb R^p$ is the feature vector for data sequence $i$. * $\log\lambda_i= f(\mathbf x_i)=\mathbf w^\intercal \mathbf x_i+\beta$ is the predicted log(penalty) value for data sequence $i$. The `survreg` Normal model is $\log \lambda_i \sim N(\mathbf w^\intercal \mathbf x_i+\beta, \sigma^2)$. * $F$ is the likelihood: a cumulative distribution function such as the normal, logistic, or Weibull. (specified using the `dist` argument) * The objective function to minimize is the total negative log likelihood, which is convex with respect to the weight vector $\mathbf w$ and the bias/intercept $\beta$, but not jointly convex with respect to the scale parameter $\sigma$. Thus overall the optimization problem is non-convex, meaning that the gradient-based algorithms that `survreg` uses are only guaranteed to reach a local minimum (not the globally most likely model). To perform a more fair comparison, we will hold out chromosome 11 as a test set, and only train the model using the other labeled chromosomes. ```{r fit-survreg} library(survival) train.dt <- six.features.targets[chromosome!=11] (fit.survreg <- survreg( Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature, train.dt, dist="gaussian")) ``` It is clear that the coefficients of the learned model are different from the BIC/SIC model. We can visualize the regression line in the feature space via the code below: ```{r six-gg-survreg} survreg.line.dt <- data.table(t(coef(fit.survreg)), model="survreg") six.survreg.pred <- data.table(six.BIC.dt) six.survreg.pred[, model := "survreg"] six.survreg.pred[, pred.log.lambda := predict(fit.survreg, six.survreg.pred)] six.survreg.res <- six.survreg.pred[six.features.targets, on=list( profile.id, chromosome)] six.survreg.res[, residual := penaltyLearning::targetIntervalResidual( cbind(min.log.lambda, max.log.lambda), pred.log.lambda)] six.gg.survreg <- six.gg.bic+ geom_abline(aes( slope=feature, intercept=`(Intercept)`, color=model), size=1, data=survreg.line.dt)+ geom_segment(aes( feature, pred.log.lambda, xend=feature, yend=pred.log.lambda-residual, color=model), data=six.survreg.res) print(six.gg.survreg) ``` The plot below shows the predicted errors and the smooth surrogate loss minimized by the gradient descent algorithm implemented by the survreg function (gradient descent cannot minimize the non-convex error function directly). ```{r six-gg-errors-survreg} six.survreg.pred[, pred.log.penalty := pred.log.lambda] six.survreg.selection <- data.table(six.selection)[six.survreg.pred, on=list( profile.id, chromosome, min.log.lambda < pred.log.lambda, max.log.lambda > pred.log.lambda)] six.survreg.errors <- six.error.list$model.errors[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] log.pen.vec <- six.error.list$model.errors[, seq( min(max.log.lambda), max(min.log.lambda), l=100)] six.survreg.surrogate <- six.survreg.res[, { ## For censored observations, survreg finds a linear function that ## predicts mean values that maximize the normal CDF (pnorm) -- ## equivalent to minimizing the negative log likelihood below. neg.log.lik <- -log( pnorm(max.log.lambda, log.pen.vec, fit.survreg$scale)- pnorm(min.log.lambda, log.pen.vec, fit.survreg$scale)) data.table( model="survreg", log.penalty=log.pen.vec, surrogate.loss=neg.log.lik )[neg.log.lik < ymax.err] }, by=list(profile.id, chromosome)] six.gg.errors.survreg <- six.gg.errors.bic+ geom_point(aes( pred.log.penalty, errors, color=model), size=2, data=six.survreg.errors)+ geom_line(aes( log.penalty, surrogate.loss, color=model), data=six.survreg.surrogate) print(six.gg.errors.survreg) ``` Note in the plot above: * four `survreg` predicted penalties achieve 0 error where the BIC/SIC has 1 error. * one BIC/SIC prediction achieves 0 error where `survreg` has 1 error. * the smooth surrogate loss is a good approximation of the non-convex error function. We can also visualize the survreg model predictions by plotting the data sequences via the code below. ```{r figsurvreg, fig.width=10} six.survreg.labels <- six.error.list$label.errors[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.survreg.changes <- six.changes[six.survreg.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.survreg.segs <- six.segs[six.survreg.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("survreg 1 feature model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.survreg.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.survreg.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.survreg.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` The false positive and false negative predictions are evident in the plot above. **Exercise (easy):** the survreg model above assumed a Gaussian distribution for the log(penalty) values. Would using another distribution increase accuracy? Hint: try re-doing the analyses in this section with the `survreg(dist="logistic")` model. **Exercise (medium):** for one of the neuroblastoma data sequences plotted above, verify that `changepoint::cpt.mean(data.vec, penalty="Manual", pen.value=PREDICTED.PENALTY, method="PELT")` and `fpop::Fpop(data.vec, lambda=PREDICTED.PENALTY)` yield the same changepoints. Fpop implements a "functional pruning optimal partitioning" algorithm which is guaranteed to prune more sub-optimal changepoints (i.e. be faster than) the PELT algorithm. Read [On optimal multiple changepoint algorithms for large data](http://link.springer.com/article/10.1007/s11222-016-9636-3) for more info. **Exercise (hard):** try re-doing the analyses in this section using `penaltyLearning::IntervalRegressionUnregularized`, which learns an un-regularized model by minimizing the squared hinge loss with respect to the target interval limits. Is there any difference in prediction accuracy? ### IntervalRegressionCV (L1-regularized interval regression) In the previous sections, we assumed that the log(penalty) is an affine function of a given single feature. In this section, we show how a multi-variate affine penalty function can be learned using L1 regularization. The penalty function we learn is $\log\lambda_i = f(\mathbf x_i) = \beta + \mathbf w^\intercal \mathbf x_i$ which is equivalent to solving the following penalized changepoint detection problem, $$ \operatorname{minimize}_{\mathbf m\in\mathbb R^{d_i}} \sum_{j=1}^{d_i} \ell(z_{ij}, m_j) + \underbrace{\exp(\beta+\mathbf w^\intercal \mathbf x_i)}_{\lambda_i} \sum_{j=1}^{d_i-1} I(m_{j} \neq m_{j+1}), $$ where the intercept $\beta$ and the weight vector $w$ are defined as the solution to the following optimization problem, $$ \operatorname{minimize}_{ \mathbf w\in\mathbb R^p, \beta\in\mathbb R } \gamma ||\mathbf w||_1 + \sum_{i=1}^n \phi(\mathbf w^\intercal \mathbf x_i+\beta-\underline y_i)+ \phi(\overline y_i-\mathbf w^\intercal \mathbf x_i-\beta) $$ where * the data and optimization variables are the same as in the un-regularized interval regression problem described in the previous section (except there is no scale $\sigma$). * the L1-norm $||\mathbf w||_1=\sum_{j=1}^p |w_j|$ is a regularization term which is to avoid overfitting. It forces some of the $w_j$ weights to be exactly equal to zero (feature selection). * the hyper-parameter $\gamma>0$ is the degree of L1-regularization (chosen via an internal cross-validation loop). This means that the model performs automatic feature selection. You can thus use as many features $p$ as you want, and the model will automatically ignore any which do not increase prediction accuracy. * $\phi(x)=(x-1)^2I(x<1)$ is the squared hinge loss ($I$ is the indicator function: 1 if the argument is true, and 0 otherwise). * The objective function to minimize is the L1-norm of the weight vector plus the total squared hinge loss (both are convex). Overall the optimization problem is convex, so the global minimum is guaranteed to be found in a reasonable amount of time using gradient-based algorithms. This is a machine learning technique that automatically selects features that result in good prediction accuracy. First we create a feature and target matrix for the data in the train set: ```{r train-feature-mat} finite.feature.mat <- feature.mat[, apply(is.finite(feature.mat), 2, all)] train.feature.mat <- train.dt[, finite.feature.mat[paste(profile.id, chromosome),] ] train.target.mat <- train.dt[, cbind(min.log.lambda, max.log.lambda)] str(train.feature.mat) ``` The training set consists of 30 examples, each with 256 features. Before training, we register the multiprocess future plan (the internal cross-validation loop to choose the regularization parameter will be computed in parallel). ```{r require-future} if(require(future)){ options(mc.cores=parallel::detectCores()/2) plan(multiprocess) } ``` We use `set.seed` below in order to make the results reproducible (there is randomness in the fold assignment in the internal cross-validation loop). Then, we use `IntervalRegressionCV` to fit the regularized linear model. ```{r fit-cv} set.seed(1) fit.cv <- penaltyLearning::IntervalRegressionCV( train.feature.mat, train.target.mat) coef(fit.cv) ``` It is clear form the output above that even though the model was trained using 256 possible input features, it has selected only a few of them to use in the prediction function. We visualize the predicted penalty values as dots in the error curve plots below: ```{r six-gg-errors-cv} six.cv.pred <- data.table(six.survreg.pred) six.cv.pred[, pred.log.penalty := { pred.feat.mat <- finite.feature.mat[paste(profile.id, chromosome), ] predict(fit.cv, pred.feat.mat) }] six.cv.pred[, model := "IntervalRegressionCV"] six.cv.pred[, pred.log.lambda := pred.log.penalty] six.cv.selection <- data.table(six.selection)[six.cv.pred, on=list( profile.id, chromosome, min.log.lambda < pred.log.lambda, max.log.lambda > pred.log.lambda)] six.cv.errors <- six.error.list$model.errors[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.surrogate <- six.survreg.res[, { ## IntervalRegressionCV minimizes the squared hinge loss. data.table( model="IntervalRegressionCV", log.penalty=log.pen.vec, surrogate.loss=penaltyLearning::squared.hinge(log.pen.vec-min.log.lambda)+ penaltyLearning::squared.hinge(max.log.lambda-log.pen.vec) )[surrogate.loss < ymax.err] }, by=list(profile.id, chromosome)] six.gg.errors.survreg+ geom_point(aes( pred.log.penalty, errors, color=model), size=2, data=six.cv.errors)+ geom_line(aes( log.penalty, surrogate.loss, color=model), data=six.cv.surrogate) ``` The plot above also shows the surrogate loss that `IntervalRegressionCV` minimizes using a gradient descent algorithm. Note how it is very similar to the `survreg` surrogate loss -- both are smooth functions which increase quadratically as the predicted value goes past the target interval limit. Finally, we can also visualize the `IntervalRegressionCV` model in terms of the data sequences, labels, and predicted segmentation: ```{r figcv, fig.width=10} six.cv.labels <- six.error.list$label.errors[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.changes <- six.changes[six.cv.selection, on=list( profile.id, chromosome, n.segments), nomatch=0L] six.cv.segs <- six.segs[six.cv.selection, on=list( profile.id, chromosome, n.segments)] gg.supervised+ ggtitle("IntervalRegressionCV multi-feature model changepoints and label errors")+ penaltyLearning::geom_tallrect(aes( xmin=min/1e6, xmax=max/1e6, linetype=status), fill=NA, data=six.cv.labels)+ scale_linetype_manual("error type", values=c( correct=0, "false negative"=3, "false positive"=1))+ geom_vline(aes( xintercept=chromStart/1e6), data=six.cv.changes, color="green", size=1, linetype="dashed")+ geom_segment(aes( chromStart/1e6, mean, xend=chromEnd/1e6, yend=mean), data=six.cv.segs, size=1, color="green")+ theme(legend.box="horizontal") ``` The plot above clearly shows the advantage of the machine learning approach: all labels are correctly predicted, including all labels on the test set (chromosome 11). **Exercise:** by default the `IntervalRegressionCV` function chooses the regularization parameter with minimum error, as estimated using an internal cross-validation loop. Another option is to choose the least complex model within 1 standard deviation. Re-fit the model using `IntervalRegressionCV(reg.type="1sd")`, view the error as a function of model complexity via `plot(fit.cv)`, then re-do the model prediction error plots. Is this model more or less accurate? Why? Hint: use `coef(fit.cv)` to see how many features are included in the model. ## Evaluation Since we did not use the chromosome 11 labels to train any of the models above, we can use those labels to evaluate the prediction accuracy of the learned models. * In fact, we have already **qualitatively** evaluated the accuracy of the predictions by visual inspection of the plots in the last section. Qualitative evaluation via visual inspection is also possible in the context of an unlabeled unsupervised changepoint analysis. * However because we have labels, we can **quantitatively** evaluate the prediction accuracy by computing the number of incorrect labels and AUC (area under the ROC curve). This is a major advantage of the supervised machine learning approach. Below we create a list with an element for each of the three models we want to compare. Then we use a `for` loop to go through ```{r six-auc} model.pred.list <- list( IntervalRegressionCV=six.cv.pred, "BIC/SIC"=six.BIC.dt, survreg=six.survreg.pred) roc.list <- list() auc.list <- list() for(model in names(model.pred.list)){ model.pred <- model.pred.list[[model]] model.roc <- penaltyLearning::ROChange( six.error.list$model.errors, model.pred[chromosome==11], problem.vars=c("profile.id", "chromosome")) roc.list[[model]] <- data.table(model, model.roc$roc) auc.list[[model]] <- with(model.roc, data.table( model, auc, thresholds[threshold=="predicted"])) } roc <- do.call(rbind, roc.list) (auc <- do.call(rbind, auc.list)) ``` The table above shows the error/accuracy of the three models with respect to the labels in the test data (chromosome 11). It is clear that the IntervalRegressionCV is slightly more accurate (higher auc and lower errors). The ROC curves are shown below: ```{r six-gg-roc} ggplot()+ scale_color_manual(values=model.colors)+ geom_path(aes( FPR, TPR, color=model, linetype=model), size=1, data=roc)+ geom_point(aes( FPR, TPR, color=model), size=3, data=auc) ``` However, with so few labels, and only one test set, it is difficult/impossible to tell if there is any significant difference between models. Below we use K-fold cross-validation to evaluate the models: * we assign each labeled data sequence to one of K folds. * for each fold, we set aside the corresponding data in that fold as a test set. * we then train each of the models using all of the other folds (not using the test fold at all during this training phase). * after each model has been trained, we compute predicted penalty values for the test data, then accuracy metrics (percent incorrect labels and AUC). * we then repeat the process for each of the folds, and check to see if any method is better on average over all of the test sets. Since the neuroblastoma data set is relatively big (3418 labeled data sequences), we have pre-computed the Segmentor models, which we load using the code below. (to re-compute them on your computer, run [Segmentor.models.R](https://github.com/tdhock/change-tutorial/blob/master/Segmentor.models.R)) ```{r load-Segmentor-models} if(!file.exists("Segmentor.models.RData")){ download.file( "http://members.cbio.ensmp.fr/~thocking/change-tutorial/Segmentor.models.RData", "Segmentor.models.RData") } load("Segmentor.models.RData") ``` Before performing cross-validation, we can compute the error functions, features and target intervals for all of the labeled data. ```{r labeled-feature-mat} nb.selection <- Segmentor.models$loss[, penaltyLearning::modelSelection( .SD, complexity="n.segments"), by=list( profile.id, chromosome)] nb.error.list <- penaltyLearning::labelError( nb.selection, neuroblastoma$annotations, Segmentor.models$segs[1 < start], problem.vars=c("profile.id", "chromosome") ) nb.target.dt <- penaltyLearning::targetIntervals( nb.error.list$model.errors, problem.vars=c("profile.id", "chromosome")) labeled.profiles <- data.table( neuroblastoma$profiles)[nb.target.dt[, list( profile.id, chromosome)], on=list( profile.id, chromosome)] labeled.feature.mat <- penaltyLearning::featureMatrix( labeled.profiles, problem.vars=c("profile.id", "chromosome"), data.var="logratio") ``` Rather than training on all features (which is possible but takes a bit too long for this tutorial), we will just use the following subset: ```{r some-feature-names} (some.feature.names <- c( colnames(labeled.feature.mat)[1:5], rownames(coef(fit.cv))[-1])) ``` Next we assign a fold to each target, ```{r nb-target-dt} set.seed(1) n.folds <- 5 nb.target.dt[, fold := sample(rep(1:n.folds, l=.N))] ``` Then we loop over test folds. In each iteration of the for loop, we create `train.target.dt` and `train.features` by excluding data in the test fold. We then train each model using these data, and compute prediction accuracy for each model using `penaltyLearning::ROChange`. ```{r cv-auc-list} cv.roc.list <- list() cv.auc.list <- list() for(test.fold in 1:n.folds){ train.target.dt <- nb.target.dt[fold!=test.fold] train.id.vec <- train.target.dt[, paste(profile.id, chromosome)] train.features <- labeled.feature.mat[train.id.vec, ] train.target.mat <- train.target.dt[, cbind(min.log.lambda, max.log.lambda)] print(test.fold) set.seed(1) fit <- penaltyLearning::IntervalRegressionCV( train.features[, some.feature.names], train.target.mat) ## fit.margin <- penaltyLearning::IntervalRegressionCVmargin( ## train.features[, some.feature.names], train.target.mat) train.target.dt$feature <- train.features[, "n.loglog"] fit.survreg <- survreg( Surv(min.log.lambda, max.log.lambda, type="interval2") ~ feature, train.target.dt, dist="gaussian") ## compute predictions test.target.dt <- nb.target.dt[fold==test.fold] test.id.vec <- test.target.dt[, paste(profile.id, chromosome)] test.feature.vec <- labeled.feature.mat[test.id.vec, "n.loglog"] pred.list <- list( "BIC/SIC"=test.feature.vec, survreg=predict(fit.survreg, data.table(feature=test.feature.vec)), ##IntervalRegressionCVmargin=predict(fit.margin, labeled.feature.mat[test.id.vec,]), IntervalRegressionCV=predict(fit, labeled.feature.mat[test.id.vec,])) for(model in names(pred.list)){ pred.dt <- data.table( test.target.dt, pred.log.lambda=as.numeric(pred.list[[model]])) pred.roc <- penaltyLearning::ROChange( nb.error.list$model.errors, pred.dt, problem.vars=c("profile.id", "chromosome")) cv.roc.list[[paste(test.fold, model)]] <- data.table( test.fold, model, pred.roc$roc) cv.auc.list[[paste(test.fold, model)]] <- with(pred.roc, data.table( test.fold, model, auc, thresholds[threshold=="predicted"])) } } cv.auc <- do.call(rbind, cv.auc.list) cv.roc <- do.call(rbind, cv.roc.list) ``` Note that in the code above, we used `cv.roc.list` to save the ROC curve for each model and test fold, and we used `cv.auc.list` to save AUC. Below we compute and plot two accuracy metrics, AUC and percent incorrectly predicted labels. ```{r cvAccuracyPlot} cv.auc[, accuracy.percent := 100-error.percent] cv.auc.tall <- melt(cv.auc, measure.vars=c("accuracy.percent", "auc")) cv.means <- cv.auc.tall[, list( mean=mean(value), sd=sd(value) ), by=list(model, variable)] levs <- cv.means[variable=="auc"][order(mean), model] cv.auc.tall[, model.fac := factor(model, levs)] ggplot()+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ facet_grid(. ~ variable, scales="free")+ scale_color_manual(values=model.colors)+ guides(color="none")+ geom_point(aes( value, model.fac, color=model), data=cv.auc.tall)+ xlab("")+ ylab("model") ``` It is clear from the plot above that survreg is significantly more accurate than BIC/SIC, and IntervalRegressionCV is slightly more accurate than survreg. **Remark:** the accuracy rates above (BIC/SIC predicts 92% correct labels whereas survreg and IntervalRegressionCV predict 97-98% correct labels) are specific to the neuroblastoma data set. However, in general you can expect that unsupervised penalty functions like BIC/SIC will be less accurate than supervised methods such as IntervalRegressionCV. This is the main motivation of supervised changepoint analysis: we can use labels to learn a penalty function with improved changepoint detection accuracy. We can also plot the ROC curves, to visualize what kinds of errors the models commit (false positives or false negatives). ```{r cv-roc-curves} ggplot()+ scale_color_manual(values=model.colors)+ geom_path(aes( FPR, TPR, color=model, group=paste(model, test.fold)), data=cv.roc)+ geom_point(aes( FPR, TPR, color=model), fill="white", shape=21, data=cv.auc)+ coord_equal(xlim=c(0, 0.5), ylim=c(0.5, 1)) ``` It is clear from the ROC curves above that the BIC/SIC model has a relatively low true positive rate (high false negative rate -- many breakpoint/positive labels with no predicted changepoints). **Conclusion:** in **labeled data sequences with a common signal/noise pattern**, supervised learning algorithms can be used to predict penalty values and changepoints which are **more accurate than unsupervised methods**. An advantage of the supervised learning approach is that we can **quantitatively evaluate** the learned models by counting the number of incorrectly predicted labels in a held-out test set.