Using R on the CBIO cluster

The main idea is that qsub script.sh will process all the commands in script.sh on a cluster node. I explain first how to setup R, then how to setup your job in terms of 3 scripts:

Table of Contents

1 Easy setup using my binaries

  • Ask J-P for a login/pass for the CBIO server, cbio.ensmp.fr
  • Open a terminal and do ssh your_login@cbio.ensmp.fr to open a secure shell on the main CBIO server.
  • Do ssh thalassa to connect to the main cluster server where you can launch jobs using qsub. It is important to compile programs (i.e. R+packages) on this computer since it is binary-incompatible with cbio.
  • I have compiled R, several packages, emacs, and ESS on thalassa so you should be able to use them if you add the following your ~/.bashrc file:
export PATH=~/bin:/cbio/donnees/thocking/bin:$PATH

2 Compile your own R + packages

  • You may want to use some more recent version of R, or some new optimization packages, etc. So you may want to install your own version of R + packages under your $HOME.
  • CBIO does not allow http connections from the server so you can't download R sources directly nor use install.packages.
  • To compile R, first download R-devel.tar.gz to your local computer. Then use scp R-devel.tar.gz your_login@cbio.ensmp.fr:. to transfer it to your home directory on CBIO. Now from the terminal that is connected to thalassa you should be able to do the standard commands to install R to your home directory:
tar -xzf R-devel.tar.gz
cd R-devel
./configure --prefix=$HOME
make
make install
  • For installing packages, note that you have to manually download the packages from CRAN, and all their dependencies! Use scp as above to transfer all of them to the CBIO server, then use R CMD INSTALL pkg_version.tar.gz to install these packages to ~/lib64/R/library.

3 Interactive R usage on thalassa using ESS

  • I don't think the cluster has libreadline-dev, so the ~/bin/R that I compiled has no command line editing or completion facilities. That sucks for testing code interactively on the server, which I definitely want to do.
  • But in fact that's no problem, since I don't use R from the command line. I use iESS to use R from within Emacs. In a terminal, I do emacs code.R to start emacs with code.R in R-mode. I suspend emacs using C-z and resume it from the terminal using fg. When I am done C-x C-c quits emacs.
  • If there is not a line at the bottom that says something like (ESS[S] [R] Rox) then add this to your ~/.emacs file and restart emacs:
(add-to-list 'load-path "/cbio/donnees/thocking/R/ess-svn")
(autoload 'R "ess-site.el" "ESS" t)
(autoload 'R-mode "ess-site.el" "ESS" t)
(autoload 'r-mode "ess-site.el" "ESS" t)
(autoload 'Rd-mode "ess-site.el" "ESS" t)
(autoload 'noweb-mode "ess-site.el" "ESS" t)
(add-to-list 'auto-mode-alist '("\\.R$" . R-mode))
(add-to-list 'auto-mode-alist '("\\.r$" . R-mode))
(add-to-list 'auto-mode-alist '("\\.Rd$" . Rd-mode))
(add-to-list 'auto-mode-alist '("\\.Rnw$" . noweb-mode))
(setq ess-eval-visibly-p nil)
(setq ess-ask-for-ess-directory nil)
(require 'ess-eldoc "ess-eldoc" t)

From emacs, I use C-c C-n to evaluate a line of R code. Then emacs will split the screen and show what that R code does in a buffer called *R*. Here are a few more keyboard shortcuts valid in R-mode buffers:

ShortcutFunction
C-c C-nSend 1 line of R code
C-SPACEMark current position
C-c C-rSend code between mark and point
C-c C-lSend entire buffer
C-c TAB or C-c C-iComplete

Furthermore in the *R* buffer you can everything you can do in the terminal, and more:

ShortcutFunction
M-pRecall previous command
M-nRecall next command
C-pMove cursor up
C-nMove cursor down
C-rSearch text back
M-rSearch commands back

4 Creating jobs

  • First, think about how you want to parallelize your job. For example, in my work on breakpoint annotation model smoothing, I wanted to run a bunch of smoothing models on each array CGH profile.
  • Make R script #1 that implements what you want to do for 1 chunk of your data. For example, you can look at /cbio/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article/smooth-one.R, which has the following code:
clin.id <- commandArgs(trailingOnly=TRUE)
data(neuroblastoma,package="neuroblastoma")
one <- subset(neuroblastoma$profiles,profile.id==clin.id)
these.labels <- subset(neuroblastoma$annotations,profile.id==clin.id)
print(these.labels)

library(bams)
run.smoothers(one, these.labels)
  • Note that on the first line we use the commandArgs function to capture the trailing command line arguments when R is invoked. The idea is that we use the command line R --vanilla --args '1' < smooth-one.R to specify that we want to process profile id 1, and in this script we load and process only that subset of data, then save results to a file somewhere, in this case the ~/smooth directory. Once you have written this script, open a shell on thalassa and try it to make sure it works.
  • One common error is having code and data sets unavailable to cluster nodes. In particular, you may have to code some absolute paths to data and code files, as I have done implicitly here by loading the data from package neuroblastoma and the code in package bams (use the .libPaths function to see and edit where R looks for packages).
  • Then write R script #2 that will launch the previous script several times using qsub. Take /home/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article/smoothing-commands.R for example:
path.to.R <- R.home(file.path("bin","R"))
install.cmd <- sprintf("%s CMD INSTALL ../..",path.to.R)
system(install.cmd)

data(neuroblastoma,package="neuroblastoma")
cids <- levels(neuroblastoma$profiles$profile.id)
commands <-
  sprintf("%s --vanilla --args '%s' < %s",
          path.to.R,
          cids,
       system.file(file.path("article","smooth-one.R"),package="bams"))
system(commands[1]) ## TEST if one of these works without qsub!
for(cmd in commands){
  f <- tempfile()
  qsubcmd <- sprintf("qsub %s",f)
  writeLines(cmd,f)
  system(qsubcmd)
}
  • Note that the commands variable is a character vector containing all the command lines to execute, with full path names to ensure that each cluster node can locate the files:
> head(commands)
[1] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '1' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
[2] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '2' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
[3] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '4' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
[4] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '5' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
[5] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '6' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
[6] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '7' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
  • Note that I run this code interactively from the /home/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article directory and use system(commands[1]) and examine the results to check that it is working.
  • Finally, write each command line to its own R script, which I do here using tempfile. Then use qsub on each file to launch the job.

5 Monitoring jobs

  • The main command you can use to monitor the cluster is qstat. In particular, qstat -u "*"|head will show you the first few current jobs for all users, in chronological order. The following command will display the number of jobs remaining until you quit with C-c:
watch 'qstat -u "*"|tail -n +3|wc -l'

6 Publishing results

  • I maintain a webpage that benchmarks several smoothing algorithms for the task of breakpoint detection and update it using results run on thalassa.
  • The source files are in my ~/public_html/neuroblastoma/ directory. Anything you put in your public_html directory is accessible on the web.
  • Looking at neuroblastoma/Makefile shows how the web page is put together. In particular, the results in accuracy-table.html depend on zzz.stats.RData, which is a symlink to ~/bioviz/neuroblastoma/doc/bams/inst/article/zzz.stats.RData.
  • After running smoothing models on the cluster, I run the R code in R script #3 ~/bioviz/neuroblastoma/doc/bams/inst/article/make.all.stats.R to analyze the results in the ~/smooth directory, then save them in zzz.stats.RData.
  • Then I type make in the ~/public_html/neuroblastoma/ directory to update the results based on the new zzz.stats.RData file.

Author: Toby HOCKING <thocking@thalassa.local>

Date: 2012-08-01 11:18:51 CEST

HTML generated by org-mode 6.33x in emacs 23