2023-04-19-deep-learning.knit

Install and attach packages

if(FALSE){
  install.packages(c("data.table", "R.utils", "ggplot2", "torch"))
  torch::torch_tensor(pi)
  rmarkdown::render("2023-04-19-deep-learning.Rmd")
}
library(data.table)
library(ggplot2)

In R one of the biggest strengths is the variety of packages that are available for different kinds of data analysis. In this tutorial we use packages:

data.table for reading data from disk, and converting into a format useful for visualization.
ggplot2 for visualizing the data and results.
torch for machine learning using linear models and neural networks.

In R there are two methods for accessing objects which are exported from packages:

double-colon syntax, pkg::fun means to get fun which is exported from pkg, for example data.table::fread is a function for reading text data files into R. This is useful for teaching/understanding because it is explicit (easy to see which package each object comes from), so it should be preferred for most use cases.
We can also use library(package) to attach all of the exported functions in package. For example after doing library(data.table) you can just write fread (without data.table::). This is useful for convenience (no need to use the double colon syntax), but it hides which package each object comes from (potentially confusing), so it should be used sparingly.

Download, read, visualize zip code data

Download zip code image data set.

url.vec <- c(
  zip.gz="https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz")
for(f in names(url.vec)){
  u <- url.vec[[f]]
  if(!file.exists(f)){
    download.file(u, f)
  }
}

Read data from gzipped plain text (one row per image, columns separated by spaces) into R as data table.

zip.dt <- fread("zip.gz")

Convert data table to two multi-dimensional arrays (one for inputs, one for outputs).

zip.feature.dt <- zip.dt[,-1]
n.features <- ncol(zip.feature.dt)
(zip.pixels <- sqrt(n.features))

## [1] 16

n.images <- nrow(zip.dt)
zip.X.array <- array(
  unlist(zip.feature.dt),
  c(n.images, 1, zip.pixels, zip.pixels))
str(zip.X.array)

##  num [1:7291, 1, 1:16, 1:16] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...

zip.y.array <- array(zip.dt$V1, n.images)
str(zip.y.array)

##  num [1:7291(1d)] 6 5 4 7 3 6 3 1 0 1 ...

table(zip.y.array)

## zip.y.array
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644

Visualize one image.

image(zip.X.array[1,,,])

Below we convert several digits to long/tall form for display. Note data table syntax below, by=observation is like a for loop over observations, and the data tables returned for each value of observation are combined into a single result table, zip.some, with all of the images.

(zip.some <- data.table(observation=1:12)[, {
  data.table(
    label=zip.y.array[observation],
    col=rep(1:zip.pixels, zip.pixels),
    row=rep(1:zip.pixels, each=zip.pixels),
    intensity=as.numeric(zip.X.array[observation,,,]))
}, by=observation])

##       observation label col row intensity
##    1:           1     6   1   1        -1
##    2:           1     6   2   1        -1
##    3:           1     6   3   1        -1
##    4:           1     6   4   1        -1
##    5:           1     6   5   1        -1
##   ---                                    
## 3068:          12     0  12  16        -1
## 3069:          12     0  13  16        -1
## 3070:          12     0  14  16        -1
## 3071:          12     0  15  16        -1
## 3072:          12     0  16  16        -1

Display images using a panel/facet for each observation.

breaks <- c(1, zip.pixels)
ggplot()+
  geom_tile(aes(
    x=col, y=row, fill=intensity),
    data=zip.some)+
  facet_wrap(observation + label ~ ., labeller=label_both)+
  scale_x_continuous(breaks=breaks)+
  scale_y_reverse(breaks=breaks)+
  coord_equal()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  scale_fill_gradient(low="black", high="white")

Convert R data to torch

Torch is a machine learning library which is popular for deep learning, because it provides automatic differentiation (also known as auto-grad or automatic gradients), which makes it easy to learn parameters in deep neural networks.

In auto-grad systems, you code the forward calculation (predictions and loss),
using operations for which the gradients of outputs with respect to inputs are known (already coded by the creators of the torch library),
so the auto-grad system can use the chain rule (back-propagation) to derive the gradients (you do not have to explicitly code the gradients, which can be error/bug-prone),
and the gradients are used for learning (take steps in the negative gradient direction to decrease loss).
For more information about auto-grad, with a demo of how to code a simple auto-grad system from scratch in R, and an explanation of how to code interactive visualizations of the learning, see Chapter 18, Neural Networks, in The animint2 Manual.

The popular python module torch uses a C++ library libtorch under the hood, which R package torch also uses, so it has most of the same functionality/concepts, and similar naming conventions.

The main data structure in torch is a tensor, which is analogous to the R multi-dimensional array. Below we convert R arrays to torch tensors.

(zip.y.tensor <- torch::torch_tensor(zip.y.array+1L, torch::torch_long()))

## torch_tensor
##   7
##   6
##   5
##   8
##   4
##   7
##   4
##   2
##   1
##   2
##   8
##   1
##   2
##   2
##   8
##   8
##   5
##   9
##   1
##   2
##   5
##   9
##   8
##   5
##   9
##   8
##   4
##   8
##   5
##   2
## ... [the output was truncated (use n=-1 to disable)]
## [ CPULongType{7291} ]

str(zip.y.tensor)

## Long [1:7291]

typeof(zip.y.array)

## [1] "double"

(zip.X.tensor <- torch::torch_tensor(zip.X.array))

## torch_tensor
## (1,1,.,.) = 
##  Columns 1 to 9 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.7970  0.2780
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.2570  0.9090  1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -0.9380  0.1000  0.9500  1.0000  0.8770
##  -1.0000 -1.0000 -1.0000 -0.6830  0.5400  1.0000  1.0000  0.3000 -0.8240
##  -1.0000 -0.9920 -0.4100  0.8250  1.0000  0.9220 -0.1620 -0.9610 -1.0000
##  -0.6310  0.2970  1.0000  1.0000  0.7780 -0.4390 -1.0000 -1.0000 -0.9050
##   0.8620  1.0000  0.9860  0.5620 -0.7150 -1.0000 -1.0000 -1.0000  0.1450
##  -0.1670  0.3070 -0.5650 -1.0000 -1.0000 -1.0000 -1.0000 -0.5500  0.9770
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.9870  0.4850  1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.7140  0.9960  1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.8320  0.8670  1.0000
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000  0.0920  0.9900
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.7450
##  -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## 
## Columns 10 to 16 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
##  -0.9500 -0.6300 -0.6770 -0.9030 -1.0000 -1.0000 -1.0000
##   0.8470  1.0000  1.0000  0.7920 -0.4520 -1.0000 -1.0000
##   1.0000  1.0000  1.0000  1.0000  0.8280 -0.4830 -1.0000
##   0.3270  0.0680  0.7530  1.0000  1.0000  0.8130 -0.9740
##  -1.0000 -0.9250  0.3410  1.0000  1.0000  1.0000 -0.4290
##  -1.0000  0.1130  1.0000  1.0000  1.0000  1.0000  0.3040
##   0.3550  0.9600  0.7070  0.5360  1.0000  1.0000  0.8230
##   1.0000  0.3080 -0.9420  0.1840  1.0000  1.0000  1.0000
##   0.6550 -0.8840 -1.0000  0.8120  1.0000  1.0000  0.4820
##  -0.1090 -1.0000 -1.0000  0.8370  1.0000  1.0000 -0.4740
##  -0.1850 -0.0750  0.5450  0.9780  1.0000  0.2190 -0.9910
## ... [the output was truncated (use n=-1 to disable)]
## [ CPUFloatType{7291,1,16,16} ]

str(zip.X.tensor)

## Float [1:7291, 1:1, 1:16, 1:16]

typeof(zip.X.array)

## [1] "double"

An important difference between R and torch is that R is very permissive about types (double/numeric can be used for almost anything), whereas torch is very strict (single precision float typically used for inputs, long int used for output).

Linear model in torch

Before explaining deep learning (neural networks), we first explain linear models. To implement a linear model in torch, we first need to flatten each 2d image back to a 1d vector of pixel intensity values,

flatten <- torch::nn_flatten()
(zip.X.flat <- flatten(zip.X.tensor))

## torch_tensor
## Columns 1 to 10-1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.9600
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.9220
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -0.9960 -0.5530 -0.9350 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.6010 -0.8260 -0.9020
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -0.6250  0.6200 -0.4540
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000
## ... [the output was truncated (use n=-1 to disable)]
## [ CPUFloatType{7291,256} ]

str(zip.X.flat)

## Float [1:7291, 1:256]

A linear model for classification is defined by two kinds of learned parameters. First, a matrix of real-valued weights, with one row for each output class, and one column for each input feature. Second, an intercept/bias, which is a vector of real numbers, one for each output class.

(n.classes <- length(unique(zip.y.array)))

## [1] 10

torch::torch_manual_seed(1)#for reproducibility.
linear <- torch::nn_linear(n.features, n.classes)
str(linear$parameters)

## List of 2
##  $ weight:Float [1:10, 1:256]
##  $ bias  :Float [1:10]

Calling the linear object with the flattened data as input, yields a matrix of predicted scores as output (one row per image, one column per class, larger scores mean more likely to predict that class).

(zip.linear.pred <- linear(zip.X.flat))

## torch_tensor
## Columns 1 to 6 8.3722e-02  4.0556e-01 -2.1138e-01 -1.4984e-01 -8.6050e-01  8.7388e-02
##  1.5125e-01  3.1454e-01  7.8810e-01 -7.8996e-01  2.6592e-01  1.1697e-01
## -5.5851e-01 -9.3562e-02 -4.3682e-01  4.9550e-01  2.0649e-01 -2.1504e-01
## -1.0399e+00  8.6033e-01 -6.8142e-02  2.4952e-01  6.5793e-02  1.0810e-01
##  8.1950e-02  1.9625e-01  6.1302e-01 -3.7451e-02  1.5670e-01 -1.7778e-01
## -8.1263e-01  1.4397e-01  8.8294e-02 -1.7617e-01  3.0649e-01  5.3261e-01
## -4.3530e-01  2.6380e-03  1.5146e-01  9.2552e-02 -3.7762e-01  2.6511e-01
## -3.2516e-01  5.2262e-01  1.0229e-02  8.4166e-02 -5.1177e-01 -4.0212e-01
##  2.8215e-01  9.8304e-02  4.7157e-01  1.8188e-01 -5.1238e-01  2.2311e-02
## -1.8938e-01  6.6965e-01 -7.9038e-02  1.8598e-01 -5.6698e-01 -4.3820e-01
##  1.6045e-01  1.5566e-01  9.5257e-02  6.5550e-01  5.5383e-01  9.3469e-01
## -4.3449e-01  1.6752e-01  2.1992e-01 -1.6706e-01 -8.0140e-02  1.5735e-01
## -2.1069e-01  5.1881e-01  9.3807e-02  2.1360e-01 -4.7565e-01 -6.1004e-01
## -2.2263e-01  5.3057e-01  7.2251e-02  2.2420e-01 -5.2688e-01 -5.7601e-01
## -8.1430e-01  6.1648e-01 -7.8843e-02  3.6643e-01  1.0322e-01  5.3075e-01
## -7.0662e-01  5.7915e-02  8.3268e-02 -1.1987e-01  4.4936e-01 -9.4456e-03
## -1.3251e-01  2.2643e-01 -2.5267e-01  6.8703e-01 -1.1746e-01 -4.9890e-01
##  1.5346e-01  6.5914e-01  2.3883e-01  1.1236e+00 -1.2175e-01  2.3762e-01
##  2.5917e-01  2.5903e-01 -4.0113e-01  4.0930e-01  7.9426e-02  2.2673e-01
## -2.2484e-01  4.4847e-01  6.4907e-01  4.0093e-01 -1.6742e-01  1.1783e-01
## -1.5034e-01  1.6825e-01  3.4139e-01  2.9184e-01 -1.1814e+00 -1.5261e-01
##  1.4874e-01  7.8191e-01  3.5824e-01  7.3465e-01 -4.4917e-01 -5.9663e-02
## -1.4576e+00  1.0044e+00  1.5806e-01  1.5137e-01  1.3848e-01  3.1238e-01
## -1.1245e+00 -1.1530e-01 -5.5778e-01  8.9810e-03 -2.4352e-01 -6.6138e-01
##  5.3762e-01 -4.9214e-01  5.9736e-01 -1.1622e-01  9.6620e-01 -9.5962e-02
## -1.1260e+00  3.1140e-01 -2.6743e-01  1.9571e-03  5.1711e-01 -2.4674e-02
## -1.7920e-01  7.9927e-01  6.7342e-01  4.3888e-02 -5.6315e-01  2.6405e-01
## -9.5916e-01  4.4528e-01 -4.3726e-01  1.2315e-01  3.3845e-01  3.6729e-01
## -4.9499e-01  2.8039e-01 -3.3418e-01  8.1388e-02  6.0696e-01 -3.5467e-01
## -2.8936e-01  7.5328e-01  7.9180e-02  3.3542e-01 -6.8469e-01 -4.2743e-01
## ... [the output was truncated (use n=-1 to disable)]
## [ CPUFloatType{7291,10} ][ grad_fn = <AddmmBackward0> ]

str(zip.linear.pred)

## Float [1:7291, 1:10]

Rather than using flatten and linear operations in separate lines of code, below we combine them into a sequential object,

(sequential <- torch::nn_sequential(flatten, linear))

## An `nn_module` containing 2,570 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #2,570 parameters

zip.sequential.pred <- sequential(zip.X.tensor)
str(zip.sequential.pred)

## Float [1:7291, 1:10]

What are the current predicted classes? We have not yet done any learning, and the initialization is random for the weight/bias parameters, so the initial predictions are very bad, as we see below.

pred.class.vec <- apply(zip.sequential.pred, 1, which.max)
zip.y.int <- as.integer(zip.y.tensor)
table(prediction=pred.class.vec, label=zip.y.int)

##           label
## prediction   1   2   3   4   5   6   7   8   9  10
##         1  203   2  32   7   3  18   7   0   7   0
##         2   30 149  63  62  23  72  47  40  21  61
##         3  106  39 165 146  41 113 185  10 100  12
##         4   25   9  43  70  77  48   4  34  20  45
##         5    4   0  60   7  11   8   0   3  36   5
##         6  141   0  34   6  12  79  59   2  28  15
##         7  130   0   7   6   0   2  14   0   1   0
##         8  449   0 123 283 347  99  98 258 247 239
##         9   44   0  53  34  20  38  19   0  20   0
##         10  62 806 151  37 118  79 231 298  62 267

Above is the confusion matrix (predicted classes on rows, true label on columns, entries are counts of images), and below is the error rate, which is the diagonal of the confusion matrix.

is.error <- pred.class.vec != zip.y.int
(percent.error <- 100*mean(is.error))

## [1] 83.04759

Above is the linear model prediction error at initialization, below is the baseline featureless model error (always predict most frequent class label),

(zip.y.tab <- table(zip.y.int))

## zip.y.int
##    1    2    3    4    5    6    7    8    9   10 
## 1194 1005  731  658  652  556  664  645  542  644

(most.frequent.count <- zip.y.tab[which.max(zip.y.tab)])

##    1 
## 1194

(most.frequent.label <- as.integer(names(most.frequent.count)))

## [1] 1

100*mean(zip.y.int!=most.frequent.label)

## [1] 83.62365

Learning in linear model

To improve predictions, we need to use a gradient descent learning algorithm. First we compute a differentiable loss (cross entropy loss in this case, multi-class classification), then we compute gradients using the backward method, then we update parameters by taking steps in the negative gradient direction. Below we define the loss and optimizer,

loss.fun <- torch::nn_cross_entropy_loss()
step.size <- 0.1
optimizer <- torch::optim_sgd(sequential$parameters, lr=step.size)

Below we compute predictions and loss,

zip.sequential.pred <- sequential(zip.X.tensor)
(loss.tensor <- loss.fun(zip.sequential.pred, zip.y.tensor))

## torch_tensor
## 2.18719
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]

Below we set gradients to zero, then call backward to compute and store gradients of the loss with respect to model parameters.

optimizer$zero_grad()
linear$bias$grad

## torch_tensor
## [ Tensor (undefined) ]

loss.tensor$backward()
linear$bias$grad

## torch_tensor
## 0.01 *
## -8.9679
## -2.0456
##  1.0085
##  1.4969
## -1.2661
##  1.5369
## -1.9076
##  4.6543
##  1.2375
##  4.2530
## [ CPUFloatType{10} ]

Below we use the gradients to update the model parameters.

linear$bias

## torch_tensor
## 0.01 *
##  4.7054
## -3.1756
## -0.6997
## -3.6115
## -4.5345
## -4.8868
## -4.4242
## -3.7722
##  2.0563
## -0.8581
## [ CPUFloatType{10} ][ requires_grad = TRUE ]

optimizer$step()

## NULL

linear$bias

## torch_tensor
## 0.01 *
##  5.6022
## -2.9711
## -0.8005
## -3.7611
## -4.4079
## -5.0405
## -4.2334
## -4.2376
##  1.9325
## -1.2834
## [ CPUFloatType{10} ][ requires_grad = TRUE ]

And that’s the basic idea of gradient descent learning! Just keep updating the model parameters until you get good predictions!

Cross-validation

Keep in mind our goal is generalization, meaning we want to get good predictions on new data/images that the algorithm was not trained on. We therefore need to use cross-validation for two purposes:

testing: hold out some samples as a test set, to evaluate predictions after all parameters have been learned on the other/train data.
training: split the train set into subtrain and validation sets. Subtrain set is used to compute gradients and update model parameters (weights/bias), and validation set is used to learn model complexity hyper-parameters (step size, number of steps/epochs, etc).

First we need to split the whole data set into train and test, using cross-validation.

cv_list <- function(index.vec, keep, hold.out, n.folds=3){
  uniq.folds <- 1:n.folds
  fold.vec <- sample(rep(uniq.folds, l=length(index.vec)))
  out.list <- list()
  for(fold in uniq.folds){
    is.held.out <- fold.vec==fold
    fold.list <- list()
    fold.list[[hold.out]] <- index.vec[is.held.out]
    fold.list[[keep]] <- index.vec[!is.held.out]
    out.list[[paste(hold.out,"fold",fold)]] <- fold.list
  }
  out.list
}
set.seed(1)
train.test.cv.list <- cv_list(1:n.images, "train", "test")
str(train.test.cv.list)

## List of 3
##  $ test fold 1:List of 2
##   ..$ test : int [1:2431] 4 6 7 11 15 18 21 22 27 28 ...
##   ..$ train: int [1:4860] 1 2 3 5 8 9 10 12 13 14 ...
##  $ test fold 2:List of 2
##   ..$ test : int [1:2430] 2 3 14 17 19 24 25 31 32 34 ...
##   ..$ train: int [1:4861] 1 4 5 6 7 8 9 10 11 12 ...
##  $ test fold 3:List of 2
##   ..$ test : int [1:2430] 1 5 8 9 10 12 13 16 20 23 ...
##   ..$ train: int [1:4861] 2 3 4 6 7 11 14 15 17 18 ...

Above we used K-fold cross-validation to create several train/test splits, represented in R as a list of lists of indices. Below we split the train set into subtrain and validation sets.

test.fold <- 1
train.test.index.list <- train.test.cv.list[[test.fold]]
sub.val.cv.list <- cv_list(
  train.test.index.list$train, "subtrain", "validation")
str(sub.val.cv.list)

## List of 3
##  $ validation fold 1:List of 2
##   ..$ validation: int [1:1620] 8 9 13 16 17 19 35 41 42 45 ...
##   ..$ subtrain  : int [1:3240] 1 2 3 5 10 12 14 20 23 24 ...
##  $ validation fold 2:List of 2
##   ..$ validation: int [1:1620] 5 12 14 20 24 26 31 32 36 47 ...
##   ..$ subtrain  : int [1:3240] 1 2 3 8 9 10 13 16 17 19 ...
##  $ validation fold 3:List of 2
##   ..$ validation: int [1:1620] 1 2 3 10 23 25 34 37 38 39 ...
##   ..$ subtrain  : int [1:3240] 5 8 9 12 13 14 16 17 19 20 ...

Below we fix one subtrain/validation split,

validation.fold <- 1
sub.val.index.list <- sub.val.cv.list[[validation.fold]]

We compute the average loss and gradient over the entire subtrain set below,

get_pred_loss <- function(model, index.vec){
  X.subset <- zip.X.tensor[index.vec,,,,drop=FALSE]
  y.subset <- zip.y.tensor[index.vec]
  pred.subset <- model(X.subset)
  list(pred=pred.subset, loss=loss.fun(pred.subset, y.subset))
}
optimizer$zero_grad()
(subtrain.loss <- get_pred_loss(sequential, sub.val.index.list$subtrain)$loss)

## torch_tensor
## 1.81546
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]

subtrain.loss$backward()
optimizer$step()

## NULL

Above is the mean loss over the subtrain set, which can be used for gradients (deterministic optimization), or for monitoring convergence (subtrain loss should decrease with each step/epoch, if step size is chosen appropriately). For monitoring convergence and avoiding overfitting, we would like to compute the loss and error rate, as in the function below.

loss_one_set <- function(set.indices, model){
  torch::with_no_grad({#for efficiency, gradients not necessary.
    set.label.vec <- zip.y.tensor[set.indices]
    set.pred.loss <- get_pred_loss(model, set.indices)
    set.pred.class <- apply(set.pred.loss$pred, 1, function(prob.vec){
      if(any(is.na(prob.vec)))NA_integer_ else which.max(prob.vec)
    })
    error.rate <- if(any(is.na(set.pred.class))){
      NA_real_
    }else{
      mean(as.numeric(set.label.vec != set.pred.class))
    }
    rbind(
      data.table(variable="loss",value=as.numeric(set.pred.loss$loss)),
      data.table(variable="error.percent",value=100*error.rate))
  })
}
loss_one_set(sub.val.index.list$subtrain, sequential)

##         variable     value
## 1:          loss  1.575817
## 2: error.percent 40.771605

To compute loss for each set, we use the function below,

loss_each_set <- function(list.of.indices, model){
  data.table(set=names(list.of.indices))[, {
    loss_one_set(list.of.indices[[set]], model)
  }, by=set]
}
loss_each_set(sub.val.index.list, sequential)

##           set      variable     value
## 1: validation          loss  1.561952
## 2: validation error.percent 39.629630
## 3:   subtrain          loss  1.575817
## 4:   subtrain error.percent 40.771605

Batching

In previous sections we have computed loss and gradients over all the observations/images in the subtrain set (deterministic optimization). For learning it is more common to compute gradients with respect to a random sample of observations/images (stochastic optimization), as below,

get_batch_list <- function(index.vec, batch_size=200){
  n_data <- length(index.vec)
  n_batches <- ceiling(n_data / batch_size)
  batch.vec <- sample(rep(1:n_batches,each=batch_size)[1:n_data])
  split(index.vec, batch.vec)
}
big.batches <- get_batch_list(sub.val.index.list$subtrain, 1000)
str(big.batches)

## List of 4
##  $ 1: int [1:1000] 2 10 12 23 25 34 46 47 60 74 ...
##  $ 2: int [1:1000] 14 26 32 36 37 55 64 69 70 82 ...
##  $ 3: int [1:1000] 1 5 20 24 38 39 49 56 57 62 ...
##  $ 4: int [1:240] 3 31 80 86 160 201 228 265 317 361 ...

After creating the batches above, we use a for loop below to compute the loss and gradient for each batch.

for(batch.number in seq_along(big.batches)){
  batch.indices <- big.batches[[batch.number]]
  batch.loss <- get_pred_loss(sequential, batch.indices)$loss
  print(batch.loss)
  optimizer$zero_grad()
  batch.loss$backward()
  optimizer$step()
}

## torch_tensor
## 1.56357
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]
## torch_tensor
## 1.4151
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]
## torch_tensor
## 1.26867
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]
## torch_tensor
## 1.23161
## [ CPUFloatType{} ][ grad_fn = <NllLossBackward0> ]

We use the function below to implement the same logic as above (take a step for each subtrain batch). It modifies the parameters of the input model in-place.

take_steps <- function(subtrain.indices, model){
  optimizer <- torch::optim_sgd(model$parameters, lr=step.size)
  batch.list <- get_batch_list(subtrain.indices)
  for(batch.number in seq_along(batch.list)){
    batch.indices <- batch.list[[batch.number]]
    batch.loss <- get_pred_loss(model, batch.indices)$loss
    optimizer$zero_grad()
    batch.loss$backward()
    optimizer$step()
  }
}

Gradient descent learning loop over epochs

Below we can do a gradient descent learning for loop over epochs, while monitoring the validation loss, to learn how many epochs results in the best predictions (smallest loss/error).

max_epochs <- 100
torch::torch_manual_seed(1)#control weight/bias initialization.
set.seed(1)#control random batching.
linear.model <- torch::nn_sequential(
  torch::nn_flatten(),
  torch::nn_linear(n.features, n.classes))
loss.dt.list <- list()
for(epoch in seq(0, max_epochs)){
  epoch.loss.dt <- loss_each_set(sub.val.index.list, linear.model)
  loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt)
  take_steps(sub.val.index.list$subtrain, linear.model)
}
(loss.dt <- rbindlist(loss.dt.list))

##      epoch        set      variable      value
##   1:     0 validation          loss  2.1905527
##   2:     0 validation error.percent 83.7654321
##   3:     0   subtrain          loss  2.1951709
##   4:     0   subtrain error.percent 82.9320988
##   5:     1 validation          loss  0.6515086
##  ---                                          
## 400:    99   subtrain error.percent  2.3148148
## 401:   100 validation          loss  0.2156566
## 402:   100 validation error.percent  6.1728395
## 403:   100   subtrain          loss  0.1040122
## 404:   100   subtrain error.percent  2.2839506

The code above first initializes a new linear model, then for each epoch, it computes the loss on each set, and takes steps using gradients computed on batches from the subtrain set. The loss/error values at each epoch are combined into the data table displayed above, and plotted below.

plot_loss <- function(loss.dt){
  (min.dt <- loss.dt[, .SD[which.min(value)], by=.(set, variable)])
  gg <- ggplot()+
    ggtitle(paste("Linear model, validation fold", validation.fold))+
    facet_grid(variable ~ ., scales="free")+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    geom_line(aes(
      epoch, value, color=set),
      data=loss.dt)+
    geom_point(aes(
      epoch, value, shape=point),
      data=data.table(point="min", min.dt))+
    scale_y_continuous("loss/error value (lower for better predictions)")+
    scale_x_continuous(
      "epoch (gradient descent passes through subtrain data)",
      limits=c(0, max_epochs*1.2),
      breaks=seq(0, max_epochs, by=10))
  directlabels::direct.label(gg, "right.polygons")
}
plot_loss(loss.dt)

Since the loss/error values are so large when the number of epochs of learning is so small, it makes the max on the y axis too large to see details for larger epoch numbers, so we exclude the first ten epochs of data in the plot below.

plot_loss(loss.dt[epoch>=10])

It is clear from the plot above that

the subtrain loss/error consistently decrease as the number of learning epochs increases, which means that the step size has been chosen appropriately (see interactive figure about gradient descent for regression for examples of what it looks like when step size is too small/large),
the validation loss/error has the characteristic U shape of a regularization paramaeter, meaning that we should use early stopping regularization, at about 50-60 epochs.

Averaging over several subtrain/validation splits

Instead of trusting a single random subtrain/validation split to determine the best number of epochs (sensitive to random split), we typically use K-fold cross-validation, and average the loss/error curves over several validation sets (less sensitive, more stable).

gradient_descent <- function(index.list, model, n_epochs, gradient.set){
  loss.dt.list <- list()
  for(epoch in seq(1, n_epochs)){
    take_steps(index.list[[gradient.set]], model)
    epoch.loss.dt <- loss_each_set(index.list, model)
    loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt)
  }
  rbindlist(loss.dt.list)
}
new_linear_model <- function(){
  torch::nn_sequential(
    torch::nn_flatten(),
    torch::nn_linear(n.features, n.classes))
}
torch::torch_manual_seed(1)#control weight/bias initialization.
set.seed(1)#control random batching.
fold.loss.dt.list <- list()
for(validation.fold in seq_along(sub.val.cv.list)){
  fold.model <- new_linear_model()
  one.index.list <- sub.val.cv.list[[validation.fold]]
  fold.result <- gradient_descent(
    one.index.list, fold.model, max_epochs, "subtrain")
  fold.loss.dt.list[[validation.fold]] <- data.table(
    validation.fold, fold.result)
}
(fold.loss.dt <- rbindlist(fold.loss.dt.list))

##       validation.fold epoch        set      variable       value
##    1:               1     1 validation          loss  0.65150863
##    2:               1     1 validation error.percent 14.19753086
##    3:               1     1   subtrain          loss  0.65193254
##    4:               1     1   subtrain error.percent 13.79629630
##    5:               1     2 validation          loss  0.46156967
##   ---                                                           
## 1196:               3    99   subtrain error.percent  2.25308642
## 1197:               3   100 validation          loss  0.22860475
## 1198:               3   100 validation error.percent  6.23456790
## 1199:               3   100   subtrain          loss  0.09782747
## 1200:               3   100   subtrain error.percent  2.31481481

The result above is a table of loss/error for each epoch and validation fold. It can be visualized as three separate curves, as below.

gg <- ggplot()+
  ggtitle("Linear model")+
  facet_grid(variable ~ validation.fold, scales="free", labeller=label_both)+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  geom_line(aes(
    epoch, value, color=set),
    data=fold.loss.dt[epoch>=10])+
  scale_y_continuous("loss/error value (lower for better predictions)")+
  scale_x_continuous(
    "epoch (gradient descent passes through subtrain data)",
    limits=c(0, max_epochs*1.5),
    breaks=seq(0, max_epochs, by=10))
directlabels::direct.label(gg, "right.polygons")

Typically to select the best number of epochs, we first compute the mean validation loss curves (over the three folds), as below.

(mean.curves.dt <- fold.loss.dt[, .(
  mean=mean(value)
), by=.(variable, set, epoch)])

##           variable        set epoch       mean
##   1:          loss validation     1  0.6871229
##   2: error.percent validation     1 14.6913580
##   3:          loss   subtrain     1  0.6708252
##   4: error.percent   subtrain     1 14.4238683
##   5:          loss validation     2  0.4831940
##  ---                                          
## 396: error.percent   subtrain    99  2.2016461
## 397:          loss validation   100  0.2152812
## 398: error.percent validation   100  5.9053498
## 399:          loss   subtrain   100  0.1007051
## 400: error.percent   subtrain   100  2.2222222

Then we take the min of the validation loss curve, as below.

(selected.dt <- mean.curves.dt[set=="validation"&variable=="loss"][which.min(mean)])

##    variable        set epoch      mean
## 1:     loss validation    61 0.2129522

gg <- ggplot()+
  ggtitle("Linear model, mean curves over subtrain/validation splits")+
  facet_grid(variable ~ ., scales="free")+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  geom_line(aes(
    epoch, mean, color=set),
    data=mean.curves.dt[epoch>=10])+
  geom_point(aes(
    epoch, mean),
    data=selected.dt)+
  scale_y_continuous("loss/error value (lower for better predictions)")+
  scale_x_continuous(
    "epoch (gradient descent passes through subtrain data)",
    limits=c(0, max_epochs*1.5),
    breaks=seq(0, max_epochs, by=10))
directlabels::direct.label(gg, "right.polygons")

Finally, after having learned the best number of epochs, we re-initialize the model, and run gradient descent on the train set (not subtrain), using that number of epochs.

final.model <- new_linear_model()
final.loss <- gradient_descent(
  train.test.index.list, final.model, selected.dt$epoch, "train")
final.loss[epoch==max(epoch)]

##    epoch   set      variable     value
## 1:    61  test          loss 0.1851047
## 2:    61  test error.percent 5.0185109
## 3:    61 train          loss 0.1253111
## 4:    61 train error.percent 3.0041152

Neural networks

Everything we have done above for the linear model generalizes easily to neural networks; the only change is in the definition of the model.

For example, a fully connected (dense weight matrices) neural network with several hidden layers, and the same number of hidden units per layer, can be defined as below.

n.hidden.units <- 100
new_fully_connected_units <- function(units.per.layer){
  seq.args <- list(torch::nn_flatten())
  for(output.i in seq(2, length(units.per.layer))){
    input.i <- output.i-1
    seq.args[[length(seq.args)+1]] <- torch::nn_linear(
      units.per.layer[[input.i]], 
      units.per.layer[[output.i]])
    if(output.i<length(units.per.layer)){
      seq.args[[length(seq.args)+1]] <- torch::nn_relu()
    }
  }
  do.call(torch::nn_sequential, seq.args)
}
new_fully_connected <- function(n.hidden.layers){
  units.per.layer <- c(
    n.features,
    rep(n.hidden.units,n.hidden.layers),
    n.classes)
  new_fully_connected_units(units.per.layer)
}

new_fully_connected(0)# linear model.

## An `nn_module` containing 2,570 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #2,570 parameters

new_fully_connected(1)# neural network with one hidden layer (not deep).

## An `nn_module` containing 26,710 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #25,700 parameters
## • 2: <nn_relu> #0 parameters
## • 3: <nn_linear> #1,010 parameters

new_fully_connected(2)# "deep" neural network.

## An `nn_module` containing 36,810 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #25,700 parameters
## • 2: <nn_relu> #0 parameters
## • 3: <nn_linear> #10,100 parameters
## • 4: <nn_relu> #0 parameters
## • 5: <nn_linear> #1,010 parameters

new_fully_connected(3)# "deeper" neural network.

## An `nn_module` containing 46,910 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #25,700 parameters
## • 2: <nn_relu> #0 parameters
## • 3: <nn_linear> #10,100 parameters
## • 4: <nn_relu> #0 parameters
## • 5: <nn_linear> #10,100 parameters
## • 6: <nn_relu> #0 parameters
## • 7: <nn_linear> #1,010 parameters

## "deep" with about same # of params as conv net.
(dense.network <- new_fully_connected(8))

## An `nn_module` containing 97,410 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_flatten> #0 parameters
## • 1: <nn_linear> #25,700 parameters
## • 2: <nn_relu> #0 parameters
## • 3: <nn_linear> #10,100 parameters
## • 4: <nn_relu> #0 parameters
## • 5: <nn_linear> #10,100 parameters
## • 6: <nn_relu> #0 parameters
## • 7: <nn_linear> #10,100 parameters
## • 8: <nn_relu> #0 parameters
## • 9: <nn_linear> #10,100 parameters
## • 10: <nn_relu> #0 parameters
## • 11: <nn_linear> #10,100 parameters
## • 12: <nn_relu> #0 parameters
## • 13: <nn_linear> #10,100 parameters
## • 14: <nn_relu> #0 parameters
## • 15: <nn_linear> #10,100 parameters
## • 16: <nn_relu> #0 parameters
## • 17: <nn_linear> #1,010 parameters

dense.network(zip.X.tensor[1:2,])

## torch_tensor
## 0.01 *
##  0.7152 -8.4251 -5.3038 -2.3340  2.4992 -9.4213 -8.9800  0.9778  5.0086  7.9127
##   0.7090 -8.4154 -5.3267 -2.3038  2.5110 -9.4004 -9.0140  1.0051  4.9921  7.9313
## [ CPUFloatType{2,10} ][ grad_fn = <AddmmBackward0> ]

linear.model(zip.X.tensor[1:2,])

## torch_tensor
## Columns 1 to 8  5.0040  -1.3596   1.3680  -7.8165   2.2423   5.0716  13.7993  -7.6465
##   1.5468  -2.3510   2.5304   2.3051  -3.8605  11.6086  -3.1735  -3.2827
## 
## Columns 9 to 10  1.7977 -12.3845
##   1.6323  -6.1035
## [ CPUFloatType{2,10} ][ grad_fn = <AddmmBackward0> ]

Note that in the code above if we specify more than one hidden layer, that is considered a deep neural network (deep learning).

In the code below, we define another kind of deep neural network, with a sparse/convolutional layer:

new_convolutional <- function(){
  seq2flat <- torch::nn_sequential(
    torch::nn_conv2d(in_channels = 1, out_channels = 20, kernel_size = 3),
    torch::nn_relu(),
    torch::nn_max_pool2d(kernel_size = 2),
    torch::nn_flatten())
  one.flat <- seq2flat(zip.X.tensor[1,,,,drop=FALSE])
  n.flat <- length(one.flat)
  torch::nn_sequential(
    seq2flat, 
    torch::nn_linear(n.flat, n.hidden.units),
    torch::nn_relu(),
    torch::nn_linear(n.hidden.units, n.classes))
}
(convolutional.network <- new_convolutional())

## An `nn_module` containing 99,310 parameters.
## 
## ── Modules ─────────────────────────────────────────────────────────────────────
## • 0: <nn_sequential> #200 parameters
## • 1: <nn_linear> #98,100 parameters
## • 2: <nn_relu> #0 parameters
## • 3: <nn_linear> #1,010 parameters

We combine all of the models into the list below,

new_model_list <- list(
  linear=new_linear_model,
  convolutional=new_convolutional,
  one_dense_hidden_layer=function()new_fully_connected(1),
  eight_dense_hidden_layers=function()new_fully_connected(8))

The code below runs the full model training (including learning number of epochs), for each model and test fold.

loss_each_split <- function(new_model, train.indices){
  cv.list <- cv_list(train.indices, "subtrain", "validation")
  fold.loss.dt.list <- list()
  for(validation.fold in seq_along(cv.list)){
    fold.model <- new_model()
    one.index.list <- sub.val.cv.list[[validation.fold]]
    fold.result <- gradient_descent(
      one.index.list, fold.model, max_epochs, "subtrain")
    fold.loss.dt.list[[validation.fold]] <- data.table(
      validation.fold, fold.result)
  }
  rbindlist(fold.loss.dt.list)
}
train_model <- function(new_model, tt.index.list){
  fold.loss.dt <- loss_each_split(new_model, tt.index.list$train)
  mean.curves.dt <- fold.loss.dt[, .(
    mean=mean(value)
  ), by=.(variable, set, epoch)]
  selected.dt <- mean.curves.dt[
    set=="validation"&variable=="loss"][which.min(mean)]
  final.model <- new_model()
  final.loss <- gradient_descent(
    tt.index.list, final.model, selected.dt$epoch, "train")
  list(
    model=final.model,
    final.loss=final.loss,
    fold.loss=fold.loss.dt,
    mean.curves=mean.curves.dt)
}
test.loss.list <- list()
for(test.fold in seq_along(train.test.cv.list)){
  tt.index.list <- train.test.cv.list[[test.fold]]
  train.label.tab <- table(zip.y.int[tt.index.list$train])
  featureless.pred <- names(train.label.tab)[which.max(train.label.tab)]
  test.label <- zip.y.int[tt.index.list$test]
  is.test.error <- test.label != featureless.pred
  test.loss.list[[paste(test.fold,"featureless")]] <- data.table(
    test.fold, model.name="featureless", epoch=NA_integer_, set="test",
    variable="error.percent",
    value=100*mean(is.test.error))
  for(model.name in names(new_model_list)){
    out.name <- paste(test.fold,model.name)
    print(out.name)
    new_model <- new_model_list[[model.name]]
    result <- train_model(new_model, tt.index.list)
    test.loss.list[[out.name]] <- data.table(
      test.fold, model.name,
      result$final.loss[epoch==max(epoch)&set=="test"])
  }
}

## [1] "1 linear"
## [1] "1 convolutional"
## [1] "1 one_dense_hidden_layer"
## [1] "1 eight_dense_hidden_layers"
## [1] "2 linear"
## [1] "2 convolutional"
## [1] "2 one_dense_hidden_layer"
## [1] "2 eight_dense_hidden_layers"
## [1] "3 linear"
## [1] "3 convolutional"
## [1] "3 one_dense_hidden_layer"
## [1] "3 eight_dense_hidden_layers"

(test.loss <- rbindlist(test.loss.list))

##     test.fold                model.name epoch  set      variable       value
##  1:         1               featureless    NA test error.percent 83.66927190
##  2:         1                    linear    70 test          loss  0.18385553
##  3:         1                    linear    70 test error.percent  4.97737557
##  4:         1             convolutional    62 test          loss  0.06202247
##  5:         1             convolutional    62 test error.percent  2.01563143
##  6:         1    one_dense_hidden_layer    65 test          loss  0.13344371
##  7:         1    one_dense_hidden_layer    65 test error.percent  3.78445084
##  8:         1 eight_dense_hidden_layers    99 test          loss  0.93828464
##  9:         1 eight_dense_hidden_layers    99 test error.percent 30.93377211
## 10:         2               featureless    NA test error.percent 83.82716049
## 11:         2                    linear    62 test          loss  0.20835865
## 12:         2                    linear    62 test error.percent  5.34979424
## 13:         2             convolutional    64 test          loss  0.08277620
## 14:         2             convolutional    64 test error.percent  2.42798354
## 15:         2    one_dense_hidden_layer    71 test          loss  0.16659287
## 16:         2    one_dense_hidden_layer    71 test error.percent  4.36213992
## 17:         2 eight_dense_hidden_layers    97 test          loss  0.55776656
## 18:         2 eight_dense_hidden_layers    97 test error.percent 14.19753086
## 19:         3               featureless    NA test error.percent 83.37448560
## 20:         3                    linear    81 test          loss  0.20346425
## 21:         3                    linear    81 test error.percent  5.26748971
## 22:         3             convolutional    46 test          loss  0.09574776
## 23:         3             convolutional    46 test error.percent  2.42798354
## 24:         3    one_dense_hidden_layer    82 test          loss  0.16080725
## 25:         3    one_dense_hidden_layer    82 test error.percent  3.95061728
## 26:         3 eight_dense_hidden_layers    93 test          loss  0.36736831
## 27:         3 eight_dense_hidden_layers    93 test error.percent  5.72016461
##     test.fold                model.name epoch  set      variable       value

The for loops above are embarassingly parallel (can be done independently, in any order), so great speedups could be obtained if you run in them parallel, for example on a computer cluster such as NAU Monsoon, see my tutorial here.

ggplot()+
  geom_point(aes(
    value, model.name),
    shape=1,
    data=test.loss)+
  facet_grid(. ~ variable, labeller=label_both, scales="free")

Above is all data, below excludes outliers.

test.some <- test.loss[
  !model.name %in% c("featureless","eight_dense_hidden_layers")]
ggplot()+
  geom_point(aes(
    value, model.name),
    shape=1,
    data=test.some)+
  facet_grid(. ~ variable, labeller=label_both, scales="free")

Package versions used

sessionInfo()

## R Under development (unstable) (2023-03-29 r84123)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
## 
## Matrix products: default
## BLAS:   /projects/genomic-ml/lib64/R/lib/libRblas.so 
## LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Phoenix
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  utils     datasets  grDevices methods   base     
## 
## other attached packages:
## [1] ggplot2_3.4.2     data.table_1.14.8
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.0.5              gtable_0.3.3           jsonlite_1.8.4        
##  [4] dplyr_1.1.1            compiler_4.4.0         highr_0.10            
##  [7] tidyselect_1.2.0       Rcpp_1.0.10            callr_3.7.3           
## [10] jquerylib_0.1.4        directlabels_2021.2.24 scales_1.2.1          
## [13] fastmap_1.1.1          R6_2.5.1               labeling_0.4.2        
## [16] generics_0.1.3         knitr_1.42             tibble_3.2.1          
## [19] munsell_0.5.0          bslib_0.4.2            pillar_1.9.0          
## [22] R.utils_2.12.2         rlang_1.1.0            utf8_1.2.3            
## [25] cachem_1.0.7           xfun_0.38              quadprog_1.5-8        
## [28] sass_0.4.5             bit64_4.0.5            cli_3.6.1             
## [31] withr_2.5.0            magrittr_2.0.3         ps_1.7.3              
## [34] processx_3.8.0         digest_0.6.31          grid_4.4.0            
## [37] torch_0.10.0           lifecycle_1.0.3        coro_1.0.3            
## [40] R.methodsS3_1.8.2      R.oo_1.25.0            vctrs_0.6.1           
## [43] evaluate_0.20          glue_1.6.2             farver_2.1.1          
## [46] fansi_1.0.4            colorspace_2.1-1       rmarkdown_2.21        
## [49] tools_4.4.0            pkgconfig_2.0.3        htmltools_0.5.5