Tricks for efficiently reading large text files into R

There are 4 golden rules, which are explained in detail on the manpage of read.table:

  1. Use wc -l data.txt on the command line to see how many lines are in the file, then use nrows=1231238977 or whatever. When reading big files R usually reads 10000 lines a time, or something like that. It starts by allocating a vector of size 10000, then when it sees there is more data than that, it recopies everything into a new vector of size 20000, and so on. All this reallocation wastes lots of time and can be avoided if you just tell R how many lines there are (and thus what size of vectors to allocate).
  2. Use head data.txt on the command line to see what the data types are in each of the columns, then use colClasses=c("integer","numeric","NULL","factor"). This means the table has 4 columns: the first is an integer, the second is a real number, the 3rd you just ignore and the fourth is a categorical variable. If you don't need a column, specify "NULL" and this will save lots of memory and time.
  3. Use the save function to save intermediate results in .RData files. Compared to reading text files using read.table, RData files are orders of magnitude faster to load back into R using the load function. Only save the things you need for computations later.
  4. Finally, avoid doing large vector operations when possible. For example, in my PhD work I had to do a bunch of things (calculate annotation error, segmentation models, etc) for each of 575 copy number profiles independently. If I store all the profiles in 1 data.frame with 4616846 rows and a column for profile.id, then calculations are much slower than if I split it into a list of 575 data.frames, each with 4000 rows on average. However the speedups using this trick are very much dependent on the number of rows in the big table and the number of groups you are splitting into, so you kind of have to try a few approaches and see what works fastest.

Author: Toby HOCKING <thocking@cbio.ensmp.fr>

Date: 2012-11-30 12:34:56 CET

HTML generated by org-mode 6.33x in emacs 23