## Install and attach packages

```{r}
if(FALSE){
  install.packages(c("data.table", "R.utils", "ggplot2", "torch"))
  torch::torch_tensor(pi)
  rmarkdown::render("2023-04-19-deep-learning.Rmd")
}
library(data.table)
library(ggplot2)
```

In R one of the biggest strengths is the variety of packages that are
available for different kinds of data analysis. In this tutorial we
use packages:

* `data.table` for reading data from disk, and converting into a
  format useful for visualization.
* `ggplot2` for visualizing the data and results.
* `torch` for machine learning using linear models and neural
  networks.

In R there are two methods for accessing objects which are exported
from packages:

* double-colon syntax, `pkg::fun` means to get `fun` which is exported
  from `pkg`, for example `data.table::fread` is a function for
  reading text data files into R. This is useful for
  teaching/understanding because it is explicit (easy to see which
  package each object comes from), so it should be preferred for most
  use cases.
* We can also use `library(package)` to attach all of the exported
  functions in package. For example after doing `library(data.table)`
  you can just write `fread` (without `data.table::`). This is useful
  for convenience (no need to use the double colon syntax), but it
  hides which package each object comes from (potentially confusing),
  so it should be used sparingly.

## Download, read, visualize zip code data

Download zip code image data set.

```{r}
url.vec <- c(
  zip.gz="https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz")
for(f in names(url.vec)){
  u <- url.vec[[f]]
  if(!file.exists(f)){
    download.file(u, f)
  }
}
```

Read data from gzipped plain text (one row per image, columns
separated by spaces) into R as data table.

```{r}
zip.dt <- fread("zip.gz")
```

Convert data table to two multi-dimensional arrays (one for inputs,
one for outputs).

```{r}
zip.feature.dt <- zip.dt[,-1]
n.features <- ncol(zip.feature.dt)
(zip.pixels <- sqrt(n.features))
n.images <- nrow(zip.dt)
zip.X.array <- array(
  unlist(zip.feature.dt),
  c(n.images, 1, zip.pixels, zip.pixels))
str(zip.X.array)
zip.y.array <- array(zip.dt$V1, n.images)
str(zip.y.array)
table(zip.y.array)
```

Visualize one image.

```{r}
image(zip.X.array[1,,,])
```

Below we convert several digits to long/tall form for display. Note
data table syntax below, `by=observation` is like a for loop over
observations, and the data tables returned for each value of
`observation` are combined into a single result table, `zip.some`,
with all of the images.

```{r}
(zip.some <- data.table(observation=1:12)[, {
  data.table(
    label=zip.y.array[observation],
    col=rep(1:zip.pixels, zip.pixels),
    row=rep(1:zip.pixels, each=zip.pixels),
    intensity=as.numeric(zip.X.array[observation,,,]))
}, by=observation])
```

Display images using a panel/facet for each observation.

```{r}
breaks <- c(1, zip.pixels)
ggplot()+
  geom_tile(aes(
    x=col, y=row, fill=intensity),
    data=zip.some)+
  facet_wrap(observation + label ~ ., labeller=label_both)+
  scale_x_continuous(breaks=breaks)+
  scale_y_reverse(breaks=breaks)+
  coord_equal()+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  scale_fill_gradient(low="black", high="white")
```

## Convert R data to torch

Torch is a machine learning library which is popular for deep
learning, because it provides automatic differentiation (also known as
auto-grad or automatic gradients), which makes it easy to learn
parameters in deep neural networks. 

* In auto-grad systems, you code the forward calculation (predictions
  and loss),
* using operations for which the gradients of outputs with respect to
  inputs are known (already coded by the creators of the torch
  library),
* so the auto-grad system can use the chain rule (back-propagation) to
  derive the gradients (you do not have to explicitly code the
  gradients, which can be error/bug-prone),
* and the gradients are used for learning (take steps in the negative
  gradient direction to decrease loss).
* For more information about auto-grad, with a demo of how to code a
  simple auto-grad system from scratch in R, and an explanation of how
  to code interactive visualizations of the learning, see Chapter
  18, Neural Networks, in [The animint2
  Manual](https://rcdata.nau.edu/genomic-ml/animint2-manual/Ch18-neural-networks.html).

The popular python module `torch` uses a C++ library `libtorch` under
the hood, which R package `torch` also uses, so it has most of the
same functionality/concepts, and similar naming conventions.

The main data structure in torch is a tensor, which is analogous to
the R multi-dimensional array. Below we convert R arrays to torch
tensors.

```{r}
(zip.y.tensor <- torch::torch_tensor(zip.y.array+1L, torch::torch_long()))
str(zip.y.tensor)
typeof(zip.y.array)
(zip.X.tensor <- torch::torch_tensor(zip.X.array))
str(zip.X.tensor)
typeof(zip.X.array)
```

An important difference between R and torch is that R is very
permissive about types (double/numeric can be used for almost
anything), whereas torch is very strict (single precision float
typically used for inputs, long int used for output).

## Linear model in torch

Before explaining deep learning (neural networks), we first explain
linear models. To implement a linear model in torch, we first need to
flatten each 2d image back to a 1d vector of pixel intensity values,

```{r}
flatten <- torch::nn_flatten()
(zip.X.flat <- flatten(zip.X.tensor))
str(zip.X.flat)
```

A linear model for classification is defined by two kinds of learned
parameters. First, a matrix of real-valued weights, with one row for
each output class, and one column for each input feature. Second, an
intercept/bias, which is a vector of real numbers, one for each output
class.

```{r}
(n.classes <- length(unique(zip.y.array)))
torch::torch_manual_seed(1)#for reproducibility.
linear <- torch::nn_linear(n.features, n.classes)
str(linear$parameters)
```

Calling the linear object with the flattened data as input, yields a
matrix of predicted scores as output (one row per image, one column
per class, larger scores mean more likely to predict that class).

```{r}
(zip.linear.pred <- linear(zip.X.flat))
str(zip.linear.pred)
```

Rather than using flatten and linear operations in separate lines of
code, below we combine them into a sequential object,

```{r}
(sequential <- torch::nn_sequential(flatten, linear))
zip.sequential.pred <- sequential(zip.X.tensor)
str(zip.sequential.pred)
```

What are the current predicted classes? We have not yet done any
learning, and the initialization is random for the weight/bias
parameters, so the initial predictions are very bad, as we see below.

```{r}
pred.class.vec <- apply(zip.sequential.pred, 1, which.max)
zip.y.int <- as.integer(zip.y.tensor)
table(prediction=pred.class.vec, label=zip.y.int)
```

Above is the confusion matrix (predicted classes on rows, true label
on columns, entries are counts of images), and below is the error
rate, which is the diagonal of the confusion matrix.

```{r}
is.error <- pred.class.vec != zip.y.int
(percent.error <- 100*mean(is.error))
```

Above is the linear model prediction error at initialization, below
is the baseline featureless model error (always predict most
frequent class label),

```{r}
(zip.y.tab <- table(zip.y.int))
(most.frequent.count <- zip.y.tab[which.max(zip.y.tab)])
(most.frequent.label <- as.integer(names(most.frequent.count)))
100*mean(zip.y.int!=most.frequent.label)
```

## Learning in linear model

To improve predictions, we need to use a gradient descent learning
algorithm. First we compute a differentiable loss (cross entropy loss
in this case, multi-class classification), then we compute gradients
using the backward method, then we update parameters by taking steps
in the negative gradient direction. Below we define the loss and optimizer,

```{r}
loss.fun <- torch::nn_cross_entropy_loss()
step.size <- 0.1
optimizer <- torch::optim_sgd(sequential$parameters, lr=step.size)
```

Below we compute predictions and loss,

```{r}
zip.sequential.pred <- sequential(zip.X.tensor)
(loss.tensor <- loss.fun(zip.sequential.pred, zip.y.tensor))
```

Below we set gradients to zero, then call backward to compute and
store gradients of the loss with respect to model parameters.

```{r}
optimizer$zero_grad()
linear$bias$grad
loss.tensor$backward()
linear$bias$grad
```

Below we use the gradients to update the model parameters.

```{r}
linear$bias
optimizer$step()
linear$bias
```

And that's the basic idea of gradient descent learning! Just keep
updating the model parameters until you get good predictions!

## Cross-validation

Keep in mind our goal is generalization, meaning we want to get good
predictions on new data/images that the algorithm was not trained
on. We therefore need to use cross-validation for two purposes:

* testing: hold out some samples as a test set, to evaluate
  predictions after all parameters have been learned on the
  other/train data.
* training: split the train set into subtrain and validation
  sets. Subtrain set is used to compute gradients and update model
  parameters (weights/bias), and validation set is used to learn model
  complexity hyper-parameters (step size, number of steps/epochs, etc).
  
First we need to split the whole data set into train and test, using
cross-validation.

```{r}
cv_list <- function(index.vec, keep, hold.out, n.folds=3){
  uniq.folds <- 1:n.folds
  fold.vec <- sample(rep(uniq.folds, l=length(index.vec)))
  out.list <- list()
  for(fold in uniq.folds){
    is.held.out <- fold.vec==fold
    fold.list <- list()
    fold.list[[hold.out]] <- index.vec[is.held.out]
    fold.list[[keep]] <- index.vec[!is.held.out]
    out.list[[paste(hold.out,"fold",fold)]] <- fold.list
  }
  out.list
}
set.seed(1)
train.test.cv.list <- cv_list(1:n.images, "train", "test")
str(train.test.cv.list)
```

Above we used K-fold cross-validation to create several train/test
splits, represented in R as a list of lists of indices. Below we split
the train set into subtrain and validation sets.

```{r}
test.fold <- 1
train.test.index.list <- train.test.cv.list[[test.fold]]
sub.val.cv.list <- cv_list(
  train.test.index.list$train, "subtrain", "validation")
str(sub.val.cv.list)
```

Below we fix one subtrain/validation split,

```{r}
validation.fold <- 1
sub.val.index.list <- sub.val.cv.list[[validation.fold]]
```

We compute the average loss and gradient over the entire subtrain set below,

```{r}
get_pred_loss <- function(model, index.vec){
  X.subset <- zip.X.tensor[index.vec,,,,drop=FALSE]
  y.subset <- zip.y.tensor[index.vec]
  pred.subset <- model(X.subset)
  list(pred=pred.subset, loss=loss.fun(pred.subset, y.subset))
}
optimizer$zero_grad()
(subtrain.loss <- get_pred_loss(sequential, sub.val.index.list$subtrain)$loss)
subtrain.loss$backward()
optimizer$step()
```

Above is the mean loss over the subtrain set, which can be used for
gradients (deterministic optimization), or for monitoring convergence
(subtrain loss should decrease with each step/epoch, if step size is
chosen appropriately). For monitoring convergence and avoiding
overfitting, we would like to compute the loss and error rate, as in
the function below.

```{r}
loss_one_set <- function(set.indices, model){
  torch::with_no_grad({#for efficiency, gradients not necessary.
    set.label.vec <- zip.y.tensor[set.indices]
    set.pred.loss <- get_pred_loss(model, set.indices)
    set.pred.class <- apply(set.pred.loss$pred, 1, function(prob.vec){
      if(any(is.na(prob.vec)))NA_integer_ else which.max(prob.vec)
    })
    error.rate <- if(any(is.na(set.pred.class))){
      NA_real_
    }else{
      mean(as.numeric(set.label.vec != set.pred.class))
    }
    rbind(
      data.table(variable="loss",value=as.numeric(set.pred.loss$loss)),
      data.table(variable="error.percent",value=100*error.rate))
  })
}
loss_one_set(sub.val.index.list$subtrain, sequential)
```

To compute loss for each set, we use the function below,

```{r}
loss_each_set <- function(list.of.indices, model){
  data.table(set=names(list.of.indices))[, {
    loss_one_set(list.of.indices[[set]], model)
  }, by=set]
}
loss_each_set(sub.val.index.list, sequential)
```

## Batching

In previous sections we have computed loss and gradients over all the
observations/images in the subtrain set (deterministic
optimization). For learning it is more common to compute gradients
with respect to a random sample of observations/images (stochastic
optimization), as below,

```{r}
get_batch_list <- function(index.vec, batch_size=200){
  n_data <- length(index.vec)
  n_batches <- ceiling(n_data / batch_size)
  batch.vec <- sample(rep(1:n_batches,each=batch_size)[1:n_data])
  split(index.vec, batch.vec)
}
big.batches <- get_batch_list(sub.val.index.list$subtrain, 1000)
str(big.batches)
```

After creating the batches above, we use a for loop below to compute
the loss and gradient for each batch.

```{r}
for(batch.number in seq_along(big.batches)){
  batch.indices <- big.batches[[batch.number]]
  batch.loss <- get_pred_loss(sequential, batch.indices)$loss
  print(batch.loss)
  optimizer$zero_grad()
  batch.loss$backward()
  optimizer$step()
}
```

We use the function below to implement the same logic as above (take a
step for each subtrain batch). It modifies the parameters of the input
`model` in-place.

```{r}
take_steps <- function(subtrain.indices, model){
  optimizer <- torch::optim_sgd(model$parameters, lr=step.size)
  batch.list <- get_batch_list(subtrain.indices)
  for(batch.number in seq_along(batch.list)){
    batch.indices <- batch.list[[batch.number]]
    batch.loss <- get_pred_loss(model, batch.indices)$loss
    optimizer$zero_grad()
    batch.loss$backward()
    optimizer$step()
  }
}
```

## Gradient descent learning loop over epochs

Below we can do a gradient descent learning for loop over epochs,
while monitoring the validation loss, to learn how many epochs results
in the best predictions (smallest loss/error).

```{r}
max_epochs <- 100
torch::torch_manual_seed(1)#control weight/bias initialization.
set.seed(1)#control random batching.
linear.model <- torch::nn_sequential(
  torch::nn_flatten(),
  torch::nn_linear(n.features, n.classes))
loss.dt.list <- list()
for(epoch in seq(0, max_epochs)){
  epoch.loss.dt <- loss_each_set(sub.val.index.list, linear.model)
  loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt)
  take_steps(sub.val.index.list$subtrain, linear.model)
}
(loss.dt <- rbindlist(loss.dt.list))
```

The code above first initializes a new linear model, then for each
epoch, it computes the loss on each set, and takes steps using
gradients computed on batches from the subtrain set. The loss/error
values at each epoch are combined into the data table displayed above,
and plotted below.

```{r}
plot_loss <- function(loss.dt){
  (min.dt <- loss.dt[, .SD[which.min(value)], by=.(set, variable)])
  gg <- ggplot()+
    ggtitle(paste("Linear model, validation fold", validation.fold))+
    facet_grid(variable ~ ., scales="free")+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    geom_line(aes(
      epoch, value, color=set),
      data=loss.dt)+
    geom_point(aes(
      epoch, value, shape=point),
      data=data.table(point="min", min.dt))+
    scale_y_continuous("loss/error value (lower for better predictions)")+
    scale_x_continuous(
      "epoch (gradient descent passes through subtrain data)",
      limits=c(0, max_epochs*1.2),
      breaks=seq(0, max_epochs, by=10))
  directlabels::direct.label(gg, "right.polygons")
}
plot_loss(loss.dt)
```

Since the loss/error values are so large when the number of epochs of
learning is so small, it makes the max on the y axis too large to see
details for larger epoch numbers, so we exclude the first ten epochs
of data in the plot below.

```{r}
plot_loss(loss.dt[epoch>=10])
```

It is clear from the plot above that 

* the subtrain loss/error consistently decrease as the number of
  learning epochs increases, which means that the step size has been
  chosen appropriately (see interactive figure about gradient descent
  for regression for examples of what it looks like when step size is
  too small/large),
* the validation loss/error has the characteristic U shape of a
  regularization paramaeter, meaning that we should use early stopping
  regularization, at about 50-60 epochs.
  
## Averaging over several subtrain/validation splits

Instead of trusting a single random subtrain/validation split to
determine the best number of epochs (sensitive to random split), we
typically use K-fold cross-validation, and average the loss/error
curves over several validation sets (less sensitive, more stable).

```{r}
gradient_descent <- function(index.list, model, n_epochs, gradient.set){
  loss.dt.list <- list()
  for(epoch in seq(1, n_epochs)){
    take_steps(index.list[[gradient.set]], model)
    epoch.loss.dt <- loss_each_set(index.list, model)
    loss.dt.list[[paste(epoch)]] <- data.table(epoch, epoch.loss.dt)
  }
  rbindlist(loss.dt.list)
}
new_linear_model <- function(){
  torch::nn_sequential(
    torch::nn_flatten(),
    torch::nn_linear(n.features, n.classes))
}
torch::torch_manual_seed(1)#control weight/bias initialization.
set.seed(1)#control random batching.
fold.loss.dt.list <- list()
for(validation.fold in seq_along(sub.val.cv.list)){
  fold.model <- new_linear_model()
  one.index.list <- sub.val.cv.list[[validation.fold]]
  fold.result <- gradient_descent(
    one.index.list, fold.model, max_epochs, "subtrain")
  fold.loss.dt.list[[validation.fold]] <- data.table(
    validation.fold, fold.result)
}
(fold.loss.dt <- rbindlist(fold.loss.dt.list))
```

The result above is a table of loss/error for each epoch and
validation fold. It can be visualized as three separate curves, as
below.

```{r}
gg <- ggplot()+
  ggtitle("Linear model")+
  facet_grid(variable ~ validation.fold, scales="free", labeller=label_both)+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  geom_line(aes(
    epoch, value, color=set),
    data=fold.loss.dt[epoch>=10])+
  scale_y_continuous("loss/error value (lower for better predictions)")+
  scale_x_continuous(
    "epoch (gradient descent passes through subtrain data)",
    limits=c(0, max_epochs*1.5),
    breaks=seq(0, max_epochs, by=10))
directlabels::direct.label(gg, "right.polygons")
```

Typically to select the best number of epochs, we first compute the
mean validation loss curves (over the three folds), as below.

```{r}
(mean.curves.dt <- fold.loss.dt[, .(
  mean=mean(value)
), by=.(variable, set, epoch)])
```

Then we take the min of the validation loss curve, as below.

```{r}
(selected.dt <- mean.curves.dt[set=="validation"&variable=="loss"][which.min(mean)])
gg <- ggplot()+
  ggtitle("Linear model, mean curves over subtrain/validation splits")+
  facet_grid(variable ~ ., scales="free")+
  theme_bw()+
  theme(panel.spacing=grid::unit(0, "lines"))+
  geom_line(aes(
    epoch, mean, color=set),
    data=mean.curves.dt[epoch>=10])+
  geom_point(aes(
    epoch, mean),
    data=selected.dt)+
  scale_y_continuous("loss/error value (lower for better predictions)")+
  scale_x_continuous(
    "epoch (gradient descent passes through subtrain data)",
    limits=c(0, max_epochs*1.5),
    breaks=seq(0, max_epochs, by=10))
directlabels::direct.label(gg, "right.polygons")
```

Finally, after having learned the best number of epochs, we
re-initialize the model, and run gradient descent on the train set
(not subtrain), using that number of epochs.

```{r}
final.model <- new_linear_model()
final.loss <- gradient_descent(
  train.test.index.list, final.model, selected.dt$epoch, "train")
final.loss[epoch==max(epoch)]
```

## Neural networks

Everything we have done above for the linear model generalizes easily
to neural networks; the only change is in the definition of the model.

For example, a fully connected (dense weight matrices) neural network with several hidden
layers, and the same number of hidden units per layer, can be defined
as below.

```{r}
n.hidden.units <- 100
new_fully_connected_units <- function(units.per.layer){
  seq.args <- list(torch::nn_flatten())
  for(output.i in seq(2, length(units.per.layer))){
    input.i <- output.i-1
    seq.args[[length(seq.args)+1]] <- torch::nn_linear(
      units.per.layer[[input.i]], 
      units.per.layer[[output.i]])
    if(output.i<length(units.per.layer)){
      seq.args[[length(seq.args)+1]] <- torch::nn_relu()
    }
  }
  do.call(torch::nn_sequential, seq.args)
}
new_fully_connected <- function(n.hidden.layers){
  units.per.layer <- c(
    n.features,
    rep(n.hidden.units,n.hidden.layers),
    n.classes)
  new_fully_connected_units(units.per.layer)
}

new_fully_connected(0)# linear model.
new_fully_connected(1)# neural network with one hidden layer (not deep).
new_fully_connected(2)# "deep" neural network.
new_fully_connected(3)# "deeper" neural network.

## "deep" with about same # of params as conv net.
(dense.network <- new_fully_connected(8))
dense.network(zip.X.tensor[1:2,])
linear.model(zip.X.tensor[1:2,])
```

Note that in the code above if we specify more than one hidden layer,
that is considered a deep neural network (deep learning).

In the code below, we define another kind of deep neural network, with
a sparse/convolutional layer:

```{r}
new_convolutional <- function(){
  seq2flat <- torch::nn_sequential(
    torch::nn_conv2d(in_channels = 1, out_channels = 20, kernel_size = 3),
    torch::nn_relu(),
    torch::nn_max_pool2d(kernel_size = 2),
    torch::nn_flatten())
  one.flat <- seq2flat(zip.X.tensor[1,,,,drop=FALSE])
  n.flat <- length(one.flat)
  torch::nn_sequential(
    seq2flat, 
    torch::nn_linear(n.flat, n.hidden.units),
    torch::nn_relu(),
    torch::nn_linear(n.hidden.units, n.classes))
}
(convolutional.network <- new_convolutional())
```

We combine all of the models into the list below,

```{r}
new_model_list <- list(
  linear=new_linear_model,
  convolutional=new_convolutional,
  one_dense_hidden_layer=function()new_fully_connected(1),
  eight_dense_hidden_layers=function()new_fully_connected(8))
```

The code below runs the full model training (including learning number
of epochs), for each model and test fold.

```{r}
loss_each_split <- function(new_model, train.indices){
  cv.list <- cv_list(train.indices, "subtrain", "validation")
  fold.loss.dt.list <- list()
  for(validation.fold in seq_along(cv.list)){
    fold.model <- new_model()
    one.index.list <- sub.val.cv.list[[validation.fold]]
    fold.result <- gradient_descent(
      one.index.list, fold.model, max_epochs, "subtrain")
    fold.loss.dt.list[[validation.fold]] <- data.table(
      validation.fold, fold.result)
  }
  rbindlist(fold.loss.dt.list)
}
train_model <- function(new_model, tt.index.list){
  fold.loss.dt <- loss_each_split(new_model, tt.index.list$train)
  mean.curves.dt <- fold.loss.dt[, .(
    mean=mean(value)
  ), by=.(variable, set, epoch)]
  selected.dt <- mean.curves.dt[
    set=="validation"&variable=="loss"][which.min(mean)]
  final.model <- new_model()
  final.loss <- gradient_descent(
    tt.index.list, final.model, selected.dt$epoch, "train")
  list(
    model=final.model,
    final.loss=final.loss,
    fold.loss=fold.loss.dt,
    mean.curves=mean.curves.dt)
}
test.loss.list <- list()
for(test.fold in seq_along(train.test.cv.list)){
  tt.index.list <- train.test.cv.list[[test.fold]]
  train.label.tab <- table(zip.y.int[tt.index.list$train])
  featureless.pred <- names(train.label.tab)[which.max(train.label.tab)]
  test.label <- zip.y.int[tt.index.list$test]
  is.test.error <- test.label != featureless.pred
  test.loss.list[[paste(test.fold,"featureless")]] <- data.table(
    test.fold, model.name="featureless", epoch=NA_integer_, set="test",
    variable="error.percent",
    value=100*mean(is.test.error))
  for(model.name in names(new_model_list)){
    out.name <- paste(test.fold,model.name)
    print(out.name)
    new_model <- new_model_list[[model.name]]
    result <- train_model(new_model, tt.index.list)
    test.loss.list[[out.name]] <- data.table(
      test.fold, model.name,
      result$final.loss[epoch==max(epoch)&set=="test"])
  }
}
(test.loss <- rbindlist(test.loss.list))
```

The for loops above are embarassingly parallel (can be done independently,
in any order), so great speedups could be obtained if you run in them
parallel, for example on a computer cluster such as NAU Monsoon, [see
my tutorial
here](https://tdhock.github.io/blog/2020/monsoon-batchtools/).

```{r}
ggplot()+
  geom_point(aes(
    value, model.name),
    shape=1,
    data=test.loss)+
  facet_grid(. ~ variable, labeller=label_both, scales="free")
```

Above is all data, below excludes outliers.

```{r}
test.some <- test.loss[
  !model.name %in% c("featureless","eight_dense_hidden_layers")]
ggplot()+
  geom_point(aes(
    value, model.name),
    shape=1,
    data=test.some)+
  facet_grid(. ~ variable, labeller=label_both, scales="free")
```

## Package versions used

```{r}
sessionInfo()
```