Tricks for efficiently reading large text files into R
There are 4 golden rules, which are explained in detail on the manpage
of read.table
:
-
Use
wc -l data.txt
on the command line to see how many lines are in the file, then usenrows=1231238977
or whatever. When reading big files R usually reads 10000 lines a time, or something like that. It starts by allocating a vector of size 10000, then when it sees there is more data than that, it recopies everything into a new vector of size 20000, and so on. All this reallocation wastes lots of time and can be avoided if you just tell R how many lines there are (and thus what size of vectors to allocate). -
Use
head data.txt
on the command line to see what the data types are in each of the columns, then usecolClasses=c("integer","numeric","NULL","factor")
. This means the table has 4 columns: the first is an integer, the second is a real number, the 3rd you just ignore and the fourth is a categorical variable. If you don't need a column, specify "NULL" and this will save lots of memory and time. -
Use the
save
function to save intermediate results in .RData files. Compared to reading text files usingread.table
, RData files are orders of magnitude faster to load back into R using theload
function. Only save the things you need for computations later. - Finally, avoid doing large vector operations when possible. For example, in my PhD work I had to do a bunch of things (calculate annotation error, segmentation models, etc) for each of 575 copy number profiles independently. If I store all the profiles in 1 data.frame with 4616846 rows and a column for profile.id, then calculations are much slower than if I split it into a list of 575 data.frames, each with 4000 rows on average. However the speedups using this trick are very much dependent on the number of rows in the big table and the number of groups you are splitting into, so you kind of have to try a few approaches and see what works fastest.
Date: 2012-11-30 12:34:56 CET
HTML generated by org-mode 6.33x in emacs 23