Using R on the CBIO cluster
The main idea is that qsub script.sh
will process all the commands
in script.sh
on a cluster node. I explain first how to setup R, then
how to setup your job in terms of 3 scripts:
- Script #1 executes the job for 1 chunk of data, saves results to the filesystem, and can be run from the command line.
- Script #2 constructs command lines that execute script #1 for each chunk of data.
- Script #3 analyzes the results that were saved to the filesystem, and converts them to a format useful for later analyses.
Table of Contents
1 Easy setup using my binaries
-
Ask J-P for a login/pass for the CBIO server,
cbio.ensmp.fr
-
Open a terminal and do
ssh your_login@cbio.ensmp.fr
to open a secure shell on the main CBIO server. -
Do
ssh thalassa
to connect to the main cluster server where you can launch jobs usingqsub
. It is important to compile programs (i.e. R+packages) on this computer since it is binary-incompatible withcbio
. -
I have compiled R, several packages, emacs, and ESS on
thalassa
so you should be able to use them if you add the following your~/.bashrc
file:
export PATH=~/bin:/cbio/donnees/thocking/bin:$PATH
2 Compile your own R + packages
-
You may want to use some more recent version of R, or some new
optimization packages, etc. So you may want to install your own
version of R + packages under your
$HOME
. -
CBIO does not allow http connections from the server so you can't
download R sources directly nor use
install.packages
. -
To compile R, first download R-devel.tar.gz to your local
computer. Then use
scp R-devel.tar.gz your_login@cbio.ensmp.fr:.
to transfer it to your home directory on CBIO. Now from the terminal that is connected tothalassa
you should be able to do the standard commands to install R to your home directory:
tar -xzf R-devel.tar.gz cd R-devel ./configure --prefix=$HOME make make install
-
For installing packages, note that you have to manually download the packages from CRAN, and all their dependencies! Use
scp
as above to transfer all of them to the CBIO server, then useR CMD INSTALL pkg_version.tar.gz
to install these packages to~/lib64/R/library
.
3 Interactive R usage on thalassa using ESS
- I don't think the cluster has libreadline-dev, so the ~/bin/R that I compiled has no command line editing or completion facilities. That sucks for testing code interactively on the server, which I definitely want to do.
-
But in fact that's no problem, since I don't use R from the command
line. I use iESS to use R from within Emacs. In a terminal, I do
emacs code.R
to start emacs withcode.R
inR-mode
. I suspend emacs usingC-z
and resume it from the terminal usingfg
. When I am doneC-x C-c
quits emacs. -
If there is not a line at the bottom that says something like
(ESS[S] [R] Rox)
then add this to your~/.emacs
file and restart emacs:
(add-to-list 'load-path "/cbio/donnees/thocking/R/ess-svn") (autoload 'R "ess-site.el" "ESS" t) (autoload 'R-mode "ess-site.el" "ESS" t) (autoload 'r-mode "ess-site.el" "ESS" t) (autoload 'Rd-mode "ess-site.el" "ESS" t) (autoload 'noweb-mode "ess-site.el" "ESS" t) (add-to-list 'auto-mode-alist '("\\.R$" . R-mode)) (add-to-list 'auto-mode-alist '("\\.r$" . R-mode)) (add-to-list 'auto-mode-alist '("\\.Rd$" . Rd-mode)) (add-to-list 'auto-mode-alist '("\\.Rnw$" . noweb-mode)) (setq ess-eval-visibly-p nil) (setq ess-ask-for-ess-directory nil) (require 'ess-eldoc "ess-eldoc" t)
From emacs, I use C-c C-n
to evaluate a line of R code. Then emacs
will split the screen and show what that R code does in a buffer
called *R*
. Here are a few more keyboard shortcuts valid in
R-mode
buffers:
Shortcut | Function |
---|---|
C-c C-n | Send 1 line of R code |
C-SPACE | Mark current position |
C-c C-r | Send code between mark and point |
C-c C-l | Send entire buffer |
C-c TAB or C-c C-i | Complete |
Furthermore in the *R*
buffer you can everything you can do in the
terminal, and more:
Shortcut | Function |
---|---|
M-p | Recall previous command |
M-n | Recall next command |
C-p | Move cursor up |
C-n | Move cursor down |
C-r | Search text back |
M-r | Search commands back |
4 Creating jobs
- First, think about how you want to parallelize your job. For example, in my work on breakpoint annotation model smoothing, I wanted to run a bunch of smoothing models on each array CGH profile.
-
Make R script #1 that implements what you want to do for 1 chunk
of your data. For example, you can look at
/cbio/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article/smooth-one.R
, which has the following code:
clin.id <- commandArgs(trailingOnly=TRUE) data(neuroblastoma,package="neuroblastoma") one <- subset(neuroblastoma$profiles,profile.id==clin.id) these.labels <- subset(neuroblastoma$annotations,profile.id==clin.id) print(these.labels) library(bams) run.smoothers(one, these.labels)
-
Note that on the first line we use the
commandArgs
function to capture the trailing command line arguments when R is invoked. The idea is that we use the command lineR --vanilla --args '1' < smooth-one.R
to specify that we want to process profile id 1, and in this script we load and process only that subset of data, then save results to a file somewhere, in this case the~/smooth
directory. Once you have written this script, open a shell on thalassa and try it to make sure it works. -
One common error is having code and data sets unavailable to cluster
nodes. In particular, you may have to code some absolute paths to
data and code files, as I have done implicitly here by loading the
data from package
neuroblastoma
and the code in packagebams
(use the.libPaths
function to see and edit where R looks for packages). -
Then write R script #2 that will launch the previous script
several times using
qsub
. Take/home/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article/smoothing-commands.R
for example:
path.to.R <- R.home(file.path("bin","R")) install.cmd <- sprintf("%s CMD INSTALL ../..",path.to.R) system(install.cmd) data(neuroblastoma,package="neuroblastoma") cids <- levels(neuroblastoma$profiles$profile.id) commands <- sprintf("%s --vanilla --args '%s' < %s", path.to.R, cids, system.file(file.path("article","smooth-one.R"),package="bams")) system(commands[1]) ## TEST if one of these works without qsub! for(cmd in commands){ f <- tempfile() qsubcmd <- sprintf("qsub %s",f) writeLines(cmd,f) system(qsubcmd) }
-
Note that the
commands
variable is a character vector containing all the command lines to execute, with full path names to ensure that each cluster node can locate the files:
> head(commands) [1] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '1' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R" [2] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '2' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R" [3] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '4' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R" [4] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '5' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R" [5] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '6' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R" [6] "/cbio/donnees/thocking/lib64/R/bin/R --vanilla --args '7' < /cbio/donnees/thocking/lib64/R/library/bams/article/smooth-one.R"
-
Note that I run this code interactively from the
/home/donnees/thocking/bioviz/neuroblastoma/doc/bams/inst/article
directory and usesystem(commands[1])
and examine the results to check that it is working. -
Finally, write each command line to its own R script, which I do
here using
tempfile
. Then useqsub
on each file to launch the job.
5 Monitoring jobs
-
The main command you can use to monitor the cluster is
qstat
. In particular,qstat -u "*"|head
will show you the first few current jobs for all users, in chronological order. The following command will display the number of jobs remaining until you quit with C-c:
watch 'qstat -u "*"|tail -n +3|wc -l'
- You can store commands you want to use by default with qstat in your ~/.sge_qstat file.
6 Publishing results
- I maintain a webpage that benchmarks several smoothing algorithms for the task of breakpoint detection and update it using results run on thalassa.
-
The source files are in my
~/public_html/neuroblastoma/
directory. Anything you put in yourpublic_html
directory is accessible on the web. -
Looking at
neuroblastoma/Makefile
shows how the web page is put together. In particular, the results inaccuracy-table.html
depend onzzz.stats.RData
, which is a symlink to~/bioviz/neuroblastoma/doc/bams/inst/article/zzz.stats.RData
. -
After running smoothing models on the cluster, I run the R code in
R script #3
~/bioviz/neuroblastoma/doc/bams/inst/article/make.all.stats.R
to analyze the results in the~/smooth
directory, then save them inzzz.stats.RData
. -
Then I type
make
in the~/public_html/neuroblastoma/
directory to update the results based on the newzzz.stats.RData
file.
Date: 2012-08-01 11:18:51 CEST
HTML generated by org-mode 6.33x in emacs 23