This chapter has focused on the R language for simple programming tasks.
The tools described in this chapter are the core tools for working with the fundamental data structures in R. In specific areas of research, particularly where data sets have a special format, there may be R packages that provide more sophisticated and convenient tools for working with a specific data format.
A good example of this is the zoo package for working with time series data. Other examples are the packages within the Bioconductor project9.4that provide tools for working with the results of microarray experiments.
The choice of R as a data processing tool was based on the fact that R is a relatively easy programming language to learn and it has good facilities for manipulating data. R is also an excellent environment for data analysis, so learning to process data with R means that it is possible to prepare and analyze a data set within a single system.
However, there are two major disadvantages to working with data using R: R is slower than many other programming languages, and R holds all data in RAM, so it can be awkward to work with extremely large data sets.
There are a number of R packages that enhance R's ability to cope gracefully with very large data sets. One approach is to store the data in a relational database and use the packages that allow R to talk to the database software, as described in Section 9.7.8. Several other packages solve the problem by storing data in mass storage rather than RAM and just load data as required into RAM; two examples are the filehash and ff packages.
If R is too slow for a particular task, it may be more appropriate to use a different data processing tool.
There are many alternative programming languages, such as C, Perl, Python, and Java, that will run much faster than R. The trade-off with using one of these programming languages is likely to involve writing more code, and more complex code, in order to produce code that runs faster.
It is also worth mentioning that many simple software tools exist, especially for processing text files. In particular, on Linux systems, there are programs such as sort for sorting files, grep for finding patterns in files, cut for removing pieces from each line of a file, and sed and awk for more powerful rearrangements of file contents. These tools are not as flexible or powerful as R, but they are extremely fast and will work on files of virtually any size. The Cygwin project9.5 makes these tools available on Windows.
Paul Murrell
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.