The difficulty of reproducible research using R
I used R to prepare the figures and tables for my PHD thesis. R is a great tool for statistical data analysis, but I observed that it is quite difficult to make the research truly reproducible. In this note I discuss some barriers to truly reproducible research with R, and suggest possible solutions.
First, I will briefly explain how I put together my PHD thesis. I typed the text in a .tex file using Emacs, and I use pdflatex to compile it to a PDF. I wanted to be able to edit the figures and tables, so I describe them using R code that outputs PDF, PNG, or tikz .tex image files. I use a Makefile, so after changing one figure I only need to remake it and then I can re-build the main PDF.
The main problem is that my figures and tables are described using R code, and that the meaning of R code changes between versions of R. Furthermore, I use many R packages, and the meaning of the functions defined in these packages changes between package versions. So when I try to compile my PHD on several different computers, I may get several different results, or an error. To solve this problem we need some way to describe what version of R and packages are necessary to run the code.
The solution I adopted for my PHD thesis is to download the source code of R and all the packages that I use, and put them on my web site. So to re-build my PHD, I just need to re-compile R and install the packages. The main disadvantage to this approach is that there is a lot of code to download and store (>200MB).
Another solution involves checking the version of R and each package and stopping with an error if it is not the correct version. That could be accomplished by using the following declaration at the beginning of each R code file.
works_with_R("2.15.1",directlabels="2.8",tikzDevice="0.6.2")
This declaration is simple and readable. It means that this R code works with R version 2.15.1, package directlabels version 2.8, and package tikzDevice version 0.6.2. Even if this function is not defined, any person that reads this code should be able to figure out what it means. I use the following definition in my .Rprofile
works_with_R <- function(Rvers,...){ pkg_need_have <- function(pkg,need,have){ if(need != have){ stop("need ",pkg," version ",need,", have ",have) } } pkg_need_have("R",Rvers,getRversion()) pkg.vers <- list(...) for(pkg in names(pkg.vers)){ pkg_need_have(pkg,pkg.vers[[pkg]],packageVersion(pkg)) require(pkg,character.only=TRUE) } }
A variant of this is to take the specified package version and download that version from CRAN, and install it. This relies on the fact that CRAN archives all the old versions of packages, and that we can download and install these old packages. Writing a package that implements this would be non-trivial. And it indeed presents a chicken-and-egg problem: which version of that package should be required?
Date: 2012-11-09 10:56:56 CET
HTML generated by org-mode 6.33x in emacs 23