## Install and attach packages ```{r} if(FALSE){ install.packages(c("data.table", "R.utils", "ggplot2", "torch")) torch::torch_tensor(pi) rmarkdown::render("2023-04-19-deep-learning.Rmd") } library(data.table) library(ggplot2) ``` In R one of the biggest strengths is the variety of packages that are available for different kinds of data analysis. In this tutorial we use packages: * `data.table` for reading data from disk, and converting into a format useful for visualization. * `ggplot2` for visualizing the data and results. * `torch` for machine learning using linear models and neural networks. In R there are two methods for accessing objects which are exported from packages: * double-colon syntax, `pkg::fun` means to get `fun` which is exported from `pkg`, for example `data.table::fread` is a function for reading text data files into R. This is useful for teaching/understanding because it is explicit (easy to see which package each object comes from), so it should be preferred for most use cases. * We can also use `library(package)` to attach all of the exported functions in package. For example after doing `library(data.table)` you can just write `fread` (without `data.table::`). This is useful for convenience (no need to use the double colon syntax), but it hides which package each object comes from (potentially confusing), so it should be used sparingly. ## Download, read, visualize zip code data Download zip code image data set. ```{r} url.vec <- c( zip.gz="https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz") for(f in names(url.vec)){ u <- url.vec[[f]] if(!file.exists(f)){ download.file(u, f) } } ``` Read data from gzipped plain text (one row per image, columns separated by spaces) into R as data table. ```{r} zip.dt <- fread("zip.gz") ``` Convert data table to two multi-dimensional arrays (one for inputs, one for outputs). ```{r} zip.feature.dt <- zip.dt[,-1] n.features <- ncol(zip.feature.dt) (zip.pixels <- sqrt(n.features)) n.images <- nrow(zip.dt) zip.X.array <- array( unlist(zip.feature.dt), c(n.images, 1, zip.pixels, zip.pixels)) str(zip.X.array) zip.y.array <- array(zip.dt$V1, n.images) str(zip.y.array) table(zip.y.array) ``` Visualize one image. ```{r} image(zip.X.array[1,,,]) ``` Below we convert several digits to long/tall form for display. Note data table syntax below, `by=observation` is like a for loop over observations, and the data tables returned for each value of `observation` are combined into a single result table, `zip.some`, with all of the images. ```{r} (zip.some <- data.table(observation=1:12)[, { data.table( label=zip.y.array[observation], col=rep(1:zip.pixels, zip.pixels), row=rep(1:zip.pixels, each=zip.pixels), intensity=as.numeric(zip.X.array[observation,,,])) }, by=observation]) ``` Display images using a panel/facet for each observation. ```{r} breaks <- c(1, zip.pixels) ggplot()+ geom_tile(aes( x=col, y=row, fill=intensity), data=zip.some)+ facet_wrap(observation + label ~ ., labeller=label_both)+ scale_x_continuous(breaks=breaks)+ scale_y_reverse(breaks=breaks)+ coord_equal()+ theme_bw()+ theme(panel.spacing=grid::unit(0, "lines"))+ scale_fill_gradient(low="black", high="white") ``` ## Convert R data to torch Torch is a machine learning library which is popular for deep learning, because it provides automatic differentiation (also known as auto-grad or automatic gradients), which makes it easy to learn parameters in deep neural networks. * In auto-grad systems, you code the forward calculation (predictions and loss), * using operations for which the gradients of outputs with respect to inputs are known (already coded by the creators of the torch library), * so the auto-grad system can use the chain rule (back-propagation) to derive the gradients (you do not have to explicitly code the gradients, which can be error/bug-prone), * and the gradients are used for learning (take steps in the negative gradient direction to decrease loss). * For more information about auto-grad, with a demo of how to code a simple auto-grad system from scratch in R, and an explanation of how to code interactive visualizations of the learning, see Chapter 18, Neural Networks, in [The animint2 Manual](https://rcdata.nau.edu/genomic-ml/animint2-manual/Ch18-neural-networks.html). The popular python module `torch` uses a C++ library `libtorch` under the hood, which R package `torch` also uses, so it has most of the same functionality/concepts, and similar naming conventions. The main data structure in torch is a tensor, which is analogous to the R multi-dimensional array. Below we convert R arrays to torch tensors. ```{r} (zip.y.tensor <- torch::torch_tensor(zip.y.array+1L, torch::torch_long())) str(zip.y.tensor) typeof(zip.y.array) (zip.X.tensor <- torch::torch_tensor(zip.X.array)) str(zip.X.tensor) typeof(zip.X.array) ``` An important difference between R and torch is that R is very permissive about types (double/numeric can be used for almost anything), whereas torch is very strict (single precision float typically used for inputs, long int used for output). ## Linear model in torch Before explaining deep learning (neural networks), we first explain linear models. To implement a linear model in torch, we first need to flatten each 2d image back to a 1d vector of pixel intensity values, ```{r} flatten <- torch::nn_flatten() (zip.X.flat <- flatten(zip.X.tensor)) str(zip.X.flat) ``` A linear model for classification is defined by two kinds of learned parameters. First, a matrix of real-valued weights, with one row for each output class, and one column for each input feature. Second, an intercept/bias, which is a vector of real numbers, one for each output class. ```{r} (n.classes <- length(unique(zip.y.array))) torch::torch_manual_seed(1)#for reproducibility. linear <- torch::nn_linear(n.features, n.classes) str(linear$parameters) ``` Calling the linear object with the flattened data as input, yields a matrix of predicted scores as output (one row per image, one column per class, larger scores mean more likely to predict that class). ```{r} (zip.linear.pred <- linear(zip.X.flat)) str(zip.linear.pred) ``` Rather than using flatten and linear operations in separate lines of code, below we combine them into a sequential object, ```{r} (sequential <- torch::nn_sequential(flatten, linear)) zip.sequential.pred <- sequential(zip.X.tensor) str(zip.sequential.pred) ``` What are the current predicted classes? We have not yet done any learning, and the initialization is random for the weight/bias parameters, so the initial predictions are very bad, as we see below. ```{r} pred.class.vec <- apply(zip.sequential.pred, 1, which.max) zip.y.int <- as.integer(zip.y.tensor) table(prediction=pred.class.vec, label=zip.y.int) ``` Above is the confusion matrix (predicted classes on rows, true label on columns, entries are counts of images), and below is the error rate, which is the diagonal of the confusion matrix. ```{r} is.error <- pred.class.vec != zip.y.int (percent.error <- 100*mean(is.error)) ``` Above is the linear model prediction error at initialization, below is the baseline featureless model error (always predict most frequent class label), ```{r} (zip.y.tab <- table(zip.y.int)) (most.frequent.count <- zip.y.tab[which.max(zip.y.tab)]) (most.frequent.label <- as.integer(names(most.frequent.count))) 100*mean(zip.y.int!=most.frequent.label) ``` ## Learning in linear model To improve predictions, we need to use a gradient descent learning algorithm. First we compute a differentiable loss (cross entropy loss in this case, multi-class classification), then we compute gradients using the backward method, then we update parameters by taking steps in the negative gradient direction. Below we define the loss and optimizer, ```{r} loss.fun <- torch::nn_cross_entropy_loss() step.size <- 0.1 optimizer <- torch::optim_sgd(sequential$parameters, lr=step.size) ``` Below we compute predictions and loss, ```{r} zip.sequential.pred <- sequential(zip.X.tensor) (loss.tensor <- loss.fun(zip.sequential.pred, zip.y.tensor)) ``` Below we set gradients to zero, then call backward to compute and store gradients of the loss with respect to model parameters. ```{r} optimizer$zero_grad() linear$bias$grad loss.tensor$backward() linear$bias$grad ``` Below we use the gradients to update the model parameters. ```{r} linear$bias optimizer$step() linear$bias ``` And that's the basic idea of gradient descent learning! Just keep updating the model parameters until you get good predictions! ## Cross-validation Keep in mind our goal is generalization, meaning we want to get good predictions on new data/images that the algorithm was not trained on. We therefore need to use cross-validation for two purposes: * testing: hold out some samples as a test set, to evaluate predictions after all parameters have been learned on the other/train data. * training: split the train set into subtrain and validation sets. Subtrain set is used to compute gradients and update model parameters (weights/bias), and validation set is used to learn model complexity hyper-parameters (step size, number of steps/epochs, etc). First we need to split the whole data set into train and test, using cross-validation. ```{r} cv_list <- function(index.vec, keep, hold.out, n.folds=3){ uniq.folds <- 1:n.folds fold.vec <- sample(rep(uniq.folds, l=length(index.vec))) out.list <- list() for(fold in uniq.folds){ is.held.out <- fold.vec==fold fold.list <- list() fold.list[[hold.out]] <- index.vec[is.held.out] fold.list[[keep]] <- index.vec[!is.held.out] out.list[[paste(hold.out,"fold",fold)]] <- fold.list } out.list } set.seed(1) train.test.cv.list <- cv_list(1:n.images, "train", "test") str(train.test.cv.list) ``` Above we used K-fold cross-validation to create several train/test splits, represented in R as a list of lists of indices. Below we split the train set into subtrain and validation sets. ```{r} test.fold <- 1 train.test.index.list <- train.test.cv.list[[test.fold]] sub.val.cv.list <- cv_list( train.test.index.list$train, "subtrain", "validation") str(sub.val.cv.list) ``` Below we fix one subtrain/validation split, ```{r} validation.fold <- 1 sub.val.index.list <- sub.val.cv.list[[validation.fold]] ``` We compute the average loss and gradient over the entire subtrain set below, ```{r} get_pred_loss <- function(model, index.vec){ X.subset <- zip.X.tensor[index.vec,,,,drop=FALSE] y.subset <- zip.y.tensor[index.vec] pred.subset <- model(X.subset) list(pred=pred.subset, loss=loss.fun(pred.subset, y.subset)) } optimizer$zero_grad() (subtrain.loss <- get_pred_loss(sequential, sub.val.index.list$subtrain)$loss) subtrain.loss$backward() optimizer$step() ``` Above is the mean loss over the subtrain set, which can be used for gradients (deterministic optimization), or for monitoring convergence (subtrain loss should decrease with each step/epoch, if step size is chosen appropriately). For monitoring convergence and avoiding overfitting, we would like to compute the loss and error rate, as in the function below. ```{r} loss_one_set <- function(set.indices, model){ torch::with_no_grad({#for efficiency, gradients not necessary. set.label.vec <- zip.y.tensor[set.indices] set.pred.loss <- get_pred_loss(model, set.indices) set.pred.class <- apply(set.pred.loss$pred, 1, function(prob.vec){ if(any(is.na(prob.vec)))NA_integer_ else which.max(prob.vec) }) error.rate <- if(any(is.na(set.pred.class))){ NA_real_ }else{ mean(as.numeric(set.label.vec != set.pred.class)) } rbind( data.table(variable="loss",value=as.numeric(set.pred.loss$loss)), data.table(variable="error.percent",value=100*error.rate)) }) } loss_one_set(sub.val.index.list$subtrain, sequential) ``` To compute loss for each set, we use the function below, ```{r} loss_each_set <- function(list.of.indices, model){ data.table(set=names(list.of.indices))[, { loss_one_set(list.of.indices[[set]], model) }, by=set] } loss_each_set(sub.val.index.list, sequential) ``` ## Batching In previous sections we have computed loss and gradients over all the observations/images in the subtrain set (deterministic optimization). For learning it is more common to compute gradients with respect to a random sample of observations/images (stochastic optimization), as below, ```{r} get_batch_list <- function(index.vec, batch_size=200){ n_data <- length(index.vec) n_batches <- ceiling(n_data / batch_size) batch.vec <- sample(rep(1:n_batches,each=batch_size)[1:n_data]) split(index.vec, batch.vec) } big.batches <- get_batch_list(sub.val.index.list$subtrain, 1000) str(big.batches) ``` After creating the batches above, we use a for loop below to compute the loss and gradient for each batch. ```{r} for(batch.number in seq_along(big.batches)){ batch.indices <- big.batches[[batch.number]] batch.loss <- get_pred_loss(sequential, batch.indices)$loss print(batch.loss) optimizer$zero_grad() batch.loss$backward() optimizer$step() } ``` We use the function below to implement the same logic as above (take a step for each subtrain batch). It modifies the parameters of the input `model` in-place. ```{r} take_steps <- function(subtrain.indices, model){ optimizer <- torch::optim_sgd(model$parameters, lr=step.size) batch.list <- get_batch_list(subtrain.indices) for(batch.number in seq_along(batch.list)){ batch.indices <- batch.list[[batch.number]] batch.loss <- get_pred_loss(model, batch.indices)$loss optimizer$zero_grad() batch.loss$backward() optimizer$step() } } ``` ## Gradient descent learning loop over epochs Below we can do a gradient descent learning for loop over epochs, while monitoring the validation loss, to learn how many epochs results in the best predictions (smallest loss/error). ```{r} max_epochs <- 100 torch::torch_manual_seed(1)#control weight/bias initialization. set.seed(1)#control random batching. linear.model <- torch::nn_sequential( torch::nn_flatten(), torch::nn_linear(n.features, n.classes)) loss.dt.list <- list() for(epoch in seq(0, max_epochs)){ epoch.loss.dt <- loss_each_set(sub.val.index.list, linear.model) loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt) take_steps(sub.val.index.list$subtrain, linear.model) } (loss.dt <- rbindlist(loss.dt.list)) ``` The code above first initializes a new linear model, then for each epoch, it computes the loss on each set, and takes steps using gradients computed on batches from the subtrain set. The loss/error values at each epoch are combined into the data table displayed above, and plotted below. ```{r} plot_loss <- function(loss.dt){ (min.dt <- loss.dt[, .SD[which.min(value)], by=.(set, variable)]) gg <- ggplot()+ ggtitle(paste("Linear model, validation fold", validation.fold))+ facet_grid(variable ~ ., scales="free")+ theme_bw()+ theme(panel.spacing=grid::unit(0, "lines"))+ geom_line(aes( epoch, value, color=set), data=loss.dt)+ geom_point(aes( epoch, value, shape=point), data=data.table(point="min", min.dt))+ scale_y_continuous("loss/error value (lower for better predictions)")+ scale_x_continuous( "epoch (gradient descent passes through subtrain data)", limits=c(0, max_epochs*1.2), breaks=seq(0, max_epochs, by=10)) directlabels::direct.label(gg, "right.polygons") } plot_loss(loss.dt) ``` Since the loss/error values are so large when the number of epochs of learning is so small, it makes the max on the y axis too large to see details for larger epoch numbers, so we exclude the first ten epochs of data in the plot below. ```{r} plot_loss(loss.dt[epoch>=10]) ``` It is clear from the plot above that * the subtrain loss/error consistently decrease as the number of learning epochs increases, which means that the step size has been chosen appropriately (see interactive figure about gradient descent for regression for examples of what it looks like when step size is too small/large), * the validation loss/error has the characteristic U shape of a regularization paramaeter, meaning that we should use early stopping regularization, at about 50-60 epochs. ## Averaging over several subtrain/validation splits Instead of trusting a single random subtrain/validation split to determine the best number of epochs (sensitive to random split), we typically use K-fold cross-validation, and average the loss/error curves over several validation sets (less sensitive, more stable). ```{r} gradient_descent <- function(index.list, model, n_epochs, gradient.set){ loss.dt.list <- list() for(epoch in seq(1, n_epochs)){ take_steps(index.list[[gradient.set]], model) epoch.loss.dt <- loss_each_set(index.list, model) loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt) } rbindlist(loss.dt.list) } new_linear_model <- function(){ torch::nn_sequential( torch::nn_flatten(), torch::nn_linear(n.features, n.classes)) } torch::torch_manual_seed(1)#control weight/bias initialization. set.seed(1)#control random batching. fold.loss.dt.list <- list() for(validation.fold in seq_along(sub.val.cv.list)){ fold.model <- new_linear_model() one.index.list <- sub.val.cv.list[[validation.fold]] fold.result <- gradient_descent( one.index.list, fold.model, max_epochs, "subtrain") fold.loss.dt.list[[validation.fold]] <- data.table( validation.fold, fold.result) } (fold.loss.dt <- rbindlist(fold.loss.dt.list)) ``` The result above is a table of loss/error for each epoch and validation fold. It can be visualized as three separate curves, as below. ```{r} gg <- ggplot()+ ggtitle("Linear model")+ facet_grid(variable ~ validation.fold, scales="free", labeller=label_both)+ theme_bw()+ theme(panel.spacing=grid::unit(0, "lines"))+ geom_line(aes( epoch, value, color=set), data=fold.loss.dt[epoch>=10])+ scale_y_continuous("loss/error value (lower for better predictions)")+ scale_x_continuous( "epoch (gradient descent passes through subtrain data)", limits=c(0, max_epochs*1.5), breaks=seq(0, max_epochs, by=10)) directlabels::direct.label(gg, "right.polygons") ``` Typically to select the best number of epochs, we first compute the mean validation loss curves (over the three folds), as below. ```{r} (mean.curves.dt <- fold.loss.dt[, .( mean=mean(value) ), by=.(variable, set, epoch)]) ``` Then we take the min of the validation loss curve, as below. ```{r} (selected.dt <- mean.curves.dt[set=="validation"&variable=="loss"][which.min(mean)]) gg <- ggplot()+ ggtitle("Linear model, mean curves over subtrain/validation splits")+ facet_grid(variable ~ ., scales="free")+ theme_bw()+ theme(panel.spacing=grid::unit(0, "lines"))+ geom_line(aes( epoch, mean, color=set), data=mean.curves.dt[epoch>=10])+ geom_point(aes( epoch, mean), data=selected.dt)+ scale_y_continuous("loss/error value (lower for better predictions)")+ scale_x_continuous( "epoch (gradient descent passes through subtrain data)", limits=c(0, max_epochs*1.5), breaks=seq(0, max_epochs, by=10)) directlabels::direct.label(gg, "right.polygons") ``` Finally, after having learned the best number of epochs, we re-initialize the model, and run gradient descent on the train set (not subtrain), using that number of epochs. ```{r} final.model <- new_linear_model() final.loss <- gradient_descent( train.test.index.list, final.model, selected.dt$epoch, "train") final.loss[epoch==max(epoch)] ``` ## Neural networks Everything we have done above for the linear model generalizes easily to neural networks; the only change is in the definition of the model. For example, a fully connected (dense weight matrices) neural network with several hidden layers, and the same number of hidden units per layer, can be defined as below. ```{r} n.hidden.units <- 100 new_fully_connected_units <- function(units.per.layer){ seq.args <- list(torch::nn_flatten()) for(output.i in seq(2, length(units.per.layer))){ input.i <- output.i-1 seq.args[[length(seq.args)+1]] <- torch::nn_linear( units.per.layer[[input.i]], units.per.layer[[output.i]]) if(output.i