A computer program is often described as “data structures + algorithms.”
Most of this chapter has been concerned with data structures--how information is stored in computer memory--and a number of tools that allow us to convert one data structure into another sort of data structure in a single step.
Algorithms are the series of steps that we take in order to carry out more complex tasks. Algorithms are what we use to combine smaller tasks into computer programs.
The difference between data structures and algorithms is like the difference between the ingredients that we need to cook a meal and the recipe that we use to cook the meal. The focus of this book is on data structures. Much of what we need to achieve when processing a data set can be performed in only a few steps, with only one or two expressions. This book is mostly about boiling an egg rather than baking a soufflé.
However, as we perform more complex data processing tasks, it becomes useful to know something about how to combine and manage a larger number of expressions.
In this section, we will look at some slightly more advanced ideas for writing computer code. This treatment will barely scratch the surface of the topics available; the aim is to provide a very brief introduction to some useful ideas for writing larger amounts of more complex programming code.
We will use a case study to motivate the need for these more advanced techniques.
The data for the 2006 JSM Data Expo (Section 5.2.8) were obtained from NASA's Live Access Server as a set of 505 text files.
Seventy-two of those files contain near-surface air temperature measurements, with one file for each month of recordings. Each file contains average temperatures for the relevant month at 576 different locations. Figure 9.17 shows the first few lines of the temperature file for the first month, January 1995.
VARIABLE : Mean Near-surface air temperature (kelvin) FILENAME : ISCCPMonthly_avg.nc FILEPATH : /usr/local/fer_data/data/ SUBSET : 24 by 24 points (LONGITUDE-LATITUDE) TIME : 16-JAN-1995 00:00 113.8W 111.2W 108.8W 106.2W 103.8W 101.2W 98.8W ... 27 28 29 30 31 32 33 ... 36.2N / 51: 272.1 270.3 270.3 270.9 271.5 275.6 278.4 ... 33.8N / 50: 282.2 282.2 272.7 272.7 271.5 280.0 281.6 ... 31.2N / 49: 285.2 285.2 276.1 275.0 278.9 281.6 283.7 ... 28.8N / 48: 290.7 286.8 286.8 276.7 277.3 283.2 287.3 ... 26.2N / 47: 292.7 293.6 284.2 284.2 279.5 281.1 289.3 ... 23.8N / 46: 293.6 295.0 295.5 282.7 282.7 281.6 285.2 ... ... |
With the data expo files stored in a local directory, NASA/Files, the complete set of 72 filenames for the files containing temperature recordings can be generated by the following code (only the first six filenames are shown).
> nasaAirTempFiles <- file.path("NASA", "Files", paste("temperature", 1:72, ".txt", sep="")) > head(nasaAirTempFiles)
[1] "NASA/Files/temperature1.txt" [2] "NASA/Files/temperature2.txt" [3] "NASA/Files/temperature3.txt" [4] "NASA/Files/temperature4.txt" [5] "NASA/Files/temperature5.txt" [6] "NASA/Files/temperature6.txt"
|
We will conduct a simple task with these data: calculating the near-surface air temperature for each month, averaged over all locations. In other words, we will calculate a single average temperature from each file. The result will be a vector of 72 monthly averages.
As a starting point, the following code reads in the data from just a single file, using the first filename, temperature1.txt, and averages all of the temperatures in that file.
> tempDF <- read.fwf(nasaAirTempFiles[1], skip=7, widths=c(-12, rep(7, 24))) > tempMatrix <- as.matrix(tempDF) > mean(tempMatrix)
[1] 295.1849
|
The call to read.fwf() ignores the first 7 lines of the file and the first 12 characters on each of the remaining lines of the file. This just leaves the temperature values, which are stored in RAM as a data frame, tempDF. This data frame is then converted into a matrix, tempMatrix, so that, in the final step, the mean() function can calculate the average across all of the temperature values.
At this point, we have simple code that uses the functions and data structures that have been described in previous sections. The next step is to repeat this task for all 72 air temperature files.
One way to do that is to write out 72 almost identical copies of the same piece of code, but, unsurprisingly, there are ways to work much more efficiently. The next section describes how.
In a programming language, if our code contains more than one expression, the expressions are run one at a time, in the order that they appear.
For example, consider the following two expressions from the code in the previous section.
tempMatrix <- as.matrix(tempDF) mean(tempMatrix)
In this code, the expression mean(tempMatrix) relies on the fact that the previous expression has already been run and that there is a matrix data structure stored in RAM for it to access via the symbol tempMatrix.
However, programming languages also provide ways to modify this basic rule and take control of the order in which expressions are run. One example of this is the idea of a loop, which is a way to allow a collection of expressions to be run repeatedly.
Returning to our example of calculating an average value for each of 72 different files, the following code shows how to calculate these averages using a loop in R.
> avgTemp <- numeric(72) > for (i in 1:72) { tempDF <- read.fwf(nasaAirTempFiles[i], skip=7, widths=c(-12, rep(7, 24))) tempMatrix <- as.matrix(tempDF) avgTemp[i] <- mean(tempMatrix) }
|
The first expression is just a set-up step that creates a vector of 72 zeroes. This vector will be used to store the 72 average values as we calculate them.
The components of the actual loop are shown below.
keywords: | for (i in 1:72) { | |
parentheses: | for (i in 1:72 ) { | |
loop symbol: | for ( i in 1:72) { | |
loop symbol values: | for (i in 1:72) { | |
open bracket: | for (i in 1:72) { | |
loop body: |
|
|
close bracket: | } |
The keywords, for and in, the brackets, and the parentheses will be the same in any loop that we write. The most important bits are the loop symbol, in this case the symbol i, and the values that the loop symbol can take, in this case, 1:72.
The idea of a loop is that the expressions in the body of the loop are run several times. Specifically, the expressions in the body of the loop are run once for each value of the loop symbol. Also, each time the loop repeats, a different value is assigned to the loop symbol.
In this example, the value 1 is assigned to the symbol i and the expressions in the body of the loop are run. Then, the value 2 is assigned to the symbol i and the body of the loop is run again. This continues until we assign the value 72 to the symbol i and run the body of the loop, and then the loop ends.
If we look closely at the expressions in the body of the loop, we can see that the loop symbol, i, is used twice. In the first case, nasaAirTempFiles[i], this means that each time through the loop we will read a different air temperature file. In the second case, avgTemp[i], this means that the average that we calculate from the air temperature file will be stored in a different element of the vector avgTemp, so that by the end of the loop, we have all 72 averages stored together in the vector.
The important idea is that, although the code in the body of the loop remains constant, the code in the body of the loop can do something slightly different each time the loop runs because the loop symbol changes its value every time the loop runs.
The overall result is that we read all 72 files, calculate an average from each one, and store the 72 averages in the vector avgTemp, which is shown below.
> avgTemp
[1] 295.1849 295.3175 296.3335 296.9587 297.7286 298.5809 [7] 299.1863 299.0660 298.4634 298.0564 296.7019 295.9568 [13] 295.3915 296.1486 296.1087 297.1007 298.3694 298.1970 [19] 298.4031 298.0682 298.3148 297.3823 296.1304 295.5917 [25] 295.5562 295.6438 296.8922 297.0823 298.4793 299.3575 [31] 299.7984 299.7314 299.6090 298.4970 297.9872 296.8453 [37] 296.9569 296.9354 297.0240 298.0668 299.1821 300.7290 [43] 300.6998 300.3715 300.1036 299.2269 297.8642 297.2729 [49] 296.8823 297.4288 297.5762 298.2859 299.1076 299.1938 [55] 299.0599 299.5424 298.9135 298.2849 297.0981 296.2639 [61] 296.1943 296.5868 297.5510 298.6106 299.7425 299.5219 [67] 299.7422 300.3411 299.5781 298.6965 297.0830 296.3813
|
When we have to write code to perform the same task several times, one approach, as described in the previous section, is to write a loop.
In this section, we will look at another option: writing functions.
In our example, we want to calculate an overall average for each of 72 separate files. One way to look at this task is that there are 72 filenames stored in a character vector data structure and we want to perform a calculation for each element of this data structure. As we saw in Section 9.8.7, there are several R functions that are designed to perform a task on each element of a data structure.
What we can do in this case is use sapply() to call a function once for each of the filenames in the vector nasaAirTempFiles.
The call to the sapply() function that we would like to make is shown below.
sapply(nasaAirTempFiles, calcMonthAvg)
The bad news is that the function calcMonthAvg() does not exist! There is no predefined function that can read a NASA air temperature file and return the average of all of the values in the file.
The good news is that we can very easily create this function ourselves.
Here is code that creates a function called calcMonthAvg() to perform this task.
> calcMonthAvg <- function(filename) { tempDF <- read.fwf(filename, skip=7, widths=c(-12, rep(7, 24))) tempMatrix <- as.matrix(tempDF) mean(tempMatrix) }
|
This code creates a function and assigns it to the symbol calcMonthAvg. In most examples in this chapter, we have stored data structures in RAM, but in this case, we have stored a function in RAM; we have stored some computer code in RAM. Just like with any other symbol, we can retrieve the function from RAM by referring to the symbol calcMonthAvg.
> calcMonthAvg
|
function (filename) { tempDF <- read.fwf(filename, skip = 7, widths = c(-12, rep(7, 24))) tempMatrix <- as.matrix(tempDF) mean(tempMatrix) }
However, because this is a function, we can do more than just retrieve its value; we can also call the function. This is done just like any other function call we have ever made. We append parentheses to the symbol name and include arguments within the parentheses. The following code calls our new function to calculate the average for the first air temperature file.
> calcMonthAvg(nasaAirTempFiles[1])
[1] 295.1849
|
In order to understand what happens when we call our function, we need to take a closer look at the components of the code that was used to create the function.
function name: | calcMonthAvg <- function(filename) { | |
keyword: | calcMonthAvg <- function(filename) { | |
parentheses: | calcMonthAvg <- function (filename ) { | |
argument symbol: | calcMonthAvg <- function( filename) { | |
open bracket: | calcMonthAvg <- function(filename) { | |
function body: |
|
|
close bracket: | } |
The keyword function, the brackets, and the parentheses will be the same for every function that we write. The important bits are the name we choose for the function and the argument symbols that we choose for the function.
The idea of a function is that the expressions that make up the body of the function will be run when the function is called. Also, when the function is called, the value that is provided as the argument in the function call will be assigned to the argument symbol.
In this case, the argument symbol is filename and this symbol is used once in the body of the function, in the call to read.fwf(). This means that the argument to the function is used to select which file the function will read and calculate an average value for.
The value returned by a function is the value of the last expression within the function body. In this case, the return value is the result of the call to the mean() function.
As a simple example, the following code is a call to our calcMonthAvg() function that calculates the overall average temperature from the contents of the file temperature1.txt.
> calcMonthAvg(nasaAirTempFiles[1])
[1] 295.1849
|
That function call produces exactly the same result as the following code.
> tempDF <- read.fwf(nasaAirTempFiles[1], skip=7, widths=c(-12, rep(7, 24))) > tempMatrix <- as.matrix(tempDF) > mean(tempMatrix)
|
The advantage of having defined the function calcMonthAvg() is that it is now possible to calculate the monthly averages from all of the temperature files using sapply().
We supply the vector of filenames, nasaAirTempFiles, and our new function, calcMonthAvg(), and sapply() calls our function for each filename. The USE.NAMES argument is employed here to avoid having large, messy names on each element of the result.
> sapply(nasaAirTempFiles, calcMonthAvg, USE.NAMES=FALSE)
[1] 295.1849 295.3175 296.3335 296.9587 297.7286 298.5809 [7] 299.1863 299.0660 298.4634 298.0564 296.7019 295.9568 [13] 295.3915 296.1486 296.1087 297.1007 298.3694 298.1970 [19] 298.4031 298.0682 298.3148 297.3823 296.1304 295.5917 [25] 295.5562 295.6438 296.8922 297.0823 298.4793 299.3575 [31] 299.7984 299.7314 299.6090 298.4970 297.9872 296.8453 [37] 296.9569 296.9354 297.0240 298.0668 299.1821 300.7290 [43] 300.6998 300.3715 300.1036 299.2269 297.8642 297.2729 [49] 296.8823 297.4288 297.5762 298.2859 299.1076 299.1938 [55] 299.0599 299.5424 298.9135 298.2849 297.0981 296.2639 [61] 296.1943 296.5868 297.5510 298.6106 299.7425 299.5219 [67] 299.7422 300.3411 299.5781 298.6965 297.0830 296.3813
|
This is the same as the result that was produced from an explicit loop (see page ), but it uses only a single call to sapply().
The previous example demonstrates that it is useful to be able to define our own functions for use with functions like apply(), lapply(), and sapply(). However, there are many other good reasons for being able to write functions. In particular, functions are useful for organizing code, simplifying code, and for making it easier to maintain code.
For example, the code below reproduces the loop that we wrote in Section 9.11.2, which calculates all 72 monthly averages.
for (i in 1:72) { tempDF <- read.fwf(nasaAirTempFiles[i], skip=7, widths=c(-12, rep(7, 24))) tempMatrix <- as.matrix(tempDF) avgTemp[i] <- mean(tempMatrix) }
We can use the calcMonthAvg() function to make this code much simpler and easier to read.
for (i in 1:72) { avgTemp[i] <- calcMonthAvg(nasaAirTempFiles[i]) }
This sort of use for functions is just an extension of the ideas of laying out and documenting code for the benefit of human readers that we saw in Section 2.4.
A further advantage that we can obtain from writing functions is the ability to reuse our code.
The calcMonthAvg() function nicely encapsulates the code for calculating the overall average temperature from one of these NASA files. If we ever need to perform this calculation in another analysis, rather than writing the code again, we can simply make use of this function again.
In this sense, functions are another example of the DRY principle because a function allows us to create and maintain a single copy of a piece of information--in this case, the computer code to perform a specific task.
All of the admonitions about writing code in small pieces and changing one thing at a time, which were discussed first in Section 2.6.2, are even more important when writing code in a programming language because the code tends to be more complex.
A specific point that can be made about writing R code, especially code that involves many expressions, is that each individual expression can be run, one at a time, in the R environment. If the final result of a collection of R code is not correct, the first thing to do is to run the individual expressions, one at a time, in order, and check that the result of each expression is correct. This is an excellent way to track down where a problem is occurring.
Matters become far more complicated once we begin to write our own functions, or even if we write loops. If the code within a function is not performing as expected, the debug() function is very useful because it allows us to run the code within a function one expression at a time.
Another useful simple trick, for both functions and loops, is to use the functions from Section 9.10 to write messages to the screen that show the values from intermediate calculations.
Functions allow a piece of code to be reused.
In their simplest form, functions are like macros, simply a recording of a series of steps.
The best way to debug R code is to run it one expression at a time.
Paul Murrell
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.