Thursday 03 December 2015
Consider the following short R script that shows the development of a collapse() function to produce a single object from two input objects.
# Function to collapse two objects into # a single object collapse <- function(x, y) { c(x, y) } # Special collapse operator to provide # syntactic sugar "%c%" <- collapse # Tests test1 <- collapse(1:3, 3:5) test2 <- 1:3 %c% 3:5 identical(test1, test2)
The scenario that we want to focus on in this document is one where we are developing this R code in a somewhat ad hoc fashion. We record our work in a text file, and then cut-and-paste or otherwise submit sections of the code to R. For example, first we define the collapse() function itself ...
> collapse <- function(x, y) { + c(x, y) + }
... then we define a special operator for convenience ...
> "%c%" <- collapse
... and finally we run some tests to check that the function and operator work as expected ...
> test1 <- collapse(1:3, 3:5) > test2 <- 1:3 %c% 3:5
> identical(test1, test2)
[1] TRUE
This approach of submitting code piecemeal from a text file is not the safest way to develop code; for example, it would be safer to source() the entire script at once after every edit. However, submitting code piecemeal is something that we find ourselves doing, and it is something that we see students doing, so we will assume that this ad hoc approach is not entirely unfamiliar to at least some R users, at least some of the time. The "Source on Save" feature in RStudio provides further evidence that there is some demand from users for this sort of protection.
The reason this approach of submitting code piecemeal is not best practice is because, with this approach, it is very easy to shoot ourselves in the foot. For example, consider a scenario where we decide to modify the definition of the collapse() function so that the script now looks like the following.
# Function to collapse two objects into # a single object collapse <- function(x, y) { unique(c(x, y)) } # Special collapse operator to provide # syntactic sugar "%c%" <- collapse # Tests test1 <- collapse(1:3, 3:5) test2 <- 1:3 %c% 3:5 identical(test1, test2)
After modifying our script, we resubmit the collapse() function to R ...
> collapse <- function(x, y) { + unique(c(x, y)) + }
... and rerun the tests ...
> test1 <- collapse(1:3, 3:5) > test2 <- 1:3 %c% 3:5
> identical(test1, test2)
[1] FALSE
... which now give different results because the special operator, %c%, is still working from the old definition of the collapse() function (but you had noticed that already, right?).
That was a very stupid thing to do, it would have been avoided if we had source()d the entire script in again, but when we develop code in an ad hoc fashion, this sort of thing sometimes happens.
The purpose of this document is to explore an idea for protecting ourselves against this sort of stupidity.
The essence of the problem is that it is possible to submit R expressions from a script in the wrong order (in an order that does not correspond to the order of the expressions within the script file). In the example above, our mistake was to run the tests before redefining the %c% operator (when the tests come after the definition of the %c% operator in the script).
The proposed solution is to monitor every R expression as it is submitted and detect when expressions are evaluated in the wrong order. We do this by recording a time stamp whenever a value is assigned to a symbol; this gives every symbol an "age". We also record which other symbols are involved in every assignment and this gives every symbol a list of "dependents". This allows us to determine whether an expression involves a "stale" symbol - a symbol that is older than its dependents.
In the example above, when we tried to run the expression ...
> test2 <- 1:3 %c% 3:5
... we would get a warning that the %c% operator is older than the collapse() function, upon which it depends.
It turns out not to be too hard to produce a simple demonstration of the solution. However, to do so, we need to simplify the example problem even further. We will work with the following R script.
a <- 1 b <- a + 1 c <- b + 1
The first thing we need is a place to record time stamps and dependencies.
> timeDB <- new.env() > depDB <- new.env()
Now, we define a new function, safeAssign() to carry out "safe" assignments. This function does the same work as assign(), plus it records a time stamp and a set of dependents for the symbol being assigned to.
> safeAssign <- function(x, value, env=.GlobalEnv) { + x <- deparse(substitute(x)) + expr <- substitute(value) + assign(x, value, envir=env) + assign(x, as.numeric(proc.time()[3]), envir=timeDB) + dependents <- all.vars(expr) + assign(x, dependents, envir=depDB) + }
Here is this function in action and the resulting time stamps and dependencies that are recorded.
> safeAssign(a, 1) > safeAssign(b, a + 1) > safeAssign(c, b + 1)
We can see that the assignments happened in the order 'a', then 'b', then 'c'.
> sapply(ls(timeDB), get, timeDB)
a b c 0.457 0.458 0.458
We can see that 'a' has no dependencies, 'b' is dependent on 'a', and 'c' is dependent on 'b'.
> sapply(ls(depDB), get, depDB)
$a character(0) $b [1] "a" $c [1] "b"
With that information recorded, we can write a function that determines whether a symbol is "stale" (whether it is older than its dependents or any of its dependents are stale).
> stale <- function(x) { + deps <- get(x, depDB) + any(get(x, timeDB) < sapply(deps, get, timeDB)) || + any(sapply(deps, stale)) + }
None of our symbols are currently stale, either because they have no dependents, or because they are younger than their dependents ...
> sapply(c("a", "b", "c"), stale)
a b c FALSE FALSE FALSE
... but if we assign a new value to 'a' (using safeAssign()), 'b' and 'c' become stale ...
> safeAssign(a, 2)
> sapply(c("a", "b", "c"), stale)
a b c FALSE TRUE TRUE
We can work this stale() function into safeAssign() so that it warns when attempting an assignment that uses a stale symbol.
> safeAssign <- function(x, value, env=.GlobalEnv) { + x <- deparse(substitute(x)) + expr <- substitute(value) + assign(x, value, envir=env) + assign(x, as.numeric(proc.time()[3]), envir=timeDB) + dependents <- all.vars(expr) + staleDeps <- sapply(dependents, stale) + if (any(staleDeps)) + warning(staleWarnMsg(dependents[staleDeps])) + assign(x, dependents, envir=depDB) + }
Now if we try to do something stupid, like reassigning 'c' without first reassigning 'b', we get a warning.
> safeAssign(c, b + 1)
Warning in safeAssign(c, b + 1): Dependent 'b' is stale!
There are several major problems with the naive solution presented in the previous section.
For a start, it is unlikely that users are going to want to change to using a safeAssign() function in place of the normal <- or = assignment operators.
Even if users were prepared to do this, assignments of the form ...
> names(x) <- "a"
... or ...
> x[1] <- 2
... would be hard to support.
An alternative could be to define a special operator, say %<-%, so that code could be of the form ...
a %<-% 1 b %<-% a + 1 c %<-% b + 1
... which would require of users only a simple search-and-replace. However, this quickly runs into problems of its own, including the fact that operator precedence places special operators ahead of common arithmetic operators, so we get problems like this ...
> a %<-% 1 > b %<-% a + 1
Error in b %<-% a + 1: non-numeric argument to binary operator
... and the solution, bracketing the right-hand side of the assignment, is again not something users are likely to embrace willingly.
> b %<-% { a + 1 }
A further problem is that this naive solution only detects problems on assignment, not on each use of a symbol. For example, if we reassign 'a', just using 'b' is not enough to trigger a warning, we must make an assignment involving 'b'.
> a %<-% 1 > b + 1
[1] 3
> c %<-% { b + 1 }
Warning in c %<-% {: Dependent 'b' is stale!
The script we are working with is also extremely simple and identifying the dependent symbols will be much harder for more complex code.
In this section, we look at a more comprehensive attempt at a solution.
The same basic approach is used to record time stamps and dependencies so that we can determine whether a symbol is stale, but more sophisticated tools are used to determine the dependencies between symbols.
Firstly, the scriptInfo() function from the 'CodeDepends' package is used to identify the symbols involved in an expresson, including which are "inputs" and which are "outputs". The package is designed for use with R scripts, so we have to provide the R expression as text to the readScript() function, then feed the result to scriptInfo() ...
> library(CodeDepends)
> sc <- readScript("", txt="y <- x + 1") > scriptInfo(sc)
An object of class "ScriptInfo" [[1]] An object of class "ScriptNodeInfo" Slot "files": character(0) Slot "strings": character(0) Slot "libraries": character(0) Slot "inputs": [1] "x" Slot "outputs": [1] "y" Slot "updates": character(0) Slot "functions": + FALSE Slot "removes": character(0) Slot "formulaVariables": character(0) Slot "sideEffects": character(0)
This information allows us to determine whether an assignment took place and which symbols were the target(s) of the assignment. Importantly, scriptInfo can identify situations like assignment to a subset and multiple assignments ...
> sc <- readScript("", txt="x[1] <- 2") > scriptInfo(sc)[[1]]@updates
[1] "x"
> sc <- readScript("", txt="x <- y <- 1") > scriptInfo(sc)[[1]]@outputs
[1] "y" "x"
In addition to scriptInfo() from 'CodeDepends', we make use of the findGlobals() function from the 'codetools' package. In this case, the function requires a function (or closure) as input ...
> library(codetools)
> f <- function() { } > body(f) <- expression(y <- x + 1) > findGlobals(f)
[1] "<-" "+" "x"
This function is useful because it will not identify symbols that are involved in a formula (or are otherwise quoted), which helps us to avoid creating unnecessary dependencies ...
> body(f) <- expression(lm(mpg ~ disp, mtcars)) > findGlobals(f)
[1] "~" "lm" "mtcars"
> body(f) <- expression(y <- quote(x + 1)) > findGlobals(f)
[1] "<-" "quote"
Another major difference with our more complete solution is that, instead of defining a function or a special operator, we create a "safe mode" sub-session to handle all input from the user. This approach is based on Ross Ihaka's script() function.
A complete explanation of the code for the 'safemode' package is available in the literate document that is installed with the package. This section now just focuses on a demonstration of some of the useful features of this "safe mode."
We start a "safe mode" session by calling the safemode() function. The prompt changes to safe> to indicate that we are in "safe mode".
> library(safemode)
> safemode()
There now follows a set of examples:
We can reproduce our original problem to show that "safe mode" can detect that the %c% operator is stale ...
safe> collapse <- function(x, y) { safe+ c(x, y) safe+ }
safe> "%c%" <- collapse
safe> test1 <- collapse(1:3, 3:5) safe> test2 <- 1:3 %c% 3:5 safe> identical(test1, test2)
[1] TRUE
safe> collapse <- function(x, y) { safe+ unique(c(x, y)) safe+ }
safe> test1 <- collapse(1:3, 3:5)
safe> test2 <- 1:3 %c% 3:5
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol '%c%' is stale!
We can detect problems on any use of a stale symbol, not just on assignments. Furthermore, "safe mode" works with normal R expressions, so an arithmetic expression on the right-hand side of an assignment is not a problem.
safe> a <- 1 safe> b <- a + 1
safe> a <- 2
safe> b
[1] 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'b' is stale!
We can handle assignments that only "update" the left-hand side of an assignment, for example, assigning to a subset and assigning new names to an object ...
safe> x <- 1 safe> y <- x + 1
safe> names(x) <- "a"
safe> y
[1] 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'y' is stale!
safe> y <- x + 1 safe> y
a 2
safe> x[1] <- 2
safe> y
a 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'y' is stale!
An important detail is that "safe mode" only records dependencies between symbols that we have assigned in the current session (which reflects the fact that we are interested in preventing users from evaluating expressions within their own code in an inappropriate order) ...
safe> x <- pi
safe> pi <- 3.14
safe> x
[1] 3.141593
safe> x <- pi safe> pi <- "pie"
safe> x
[1] 3.14
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'x' is stale!
We can detect that symbol 'z' is stale when it depends on 'y', which depends on 'x', and 'y' is stale ...
safe> x <- 1 safe> y <- x + 1 safe> z <- y + 1
safe> x <- 2
safe> z
[1] 3
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'z' is stale!
We can report that more than one dependency is stale at once ...
safe> x <- 1 safe> y <- x + 1 safe> z <- y + 1 safe> k <- y + z
safe> x <- 2
safe> k <- y + z
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbols 'y' and 'z' are stale!
We can handle a function with a global variable ...
safe> x <- 1 safe> f <- function() { x + 1 } safe> f()
[1] 2
safe> x <- 2
safe> f()
[1] 3
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'f' is stale!
We can handle a multiple assignment ...
safe> x <- y <- 1 safe> z <- y + 1
safe> y <- 2
safe> z
[1] 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'z' is stale!
We can handle a compound expression ...
safe> x <- 1; y <- x + 1
safe> x <- 2
safe> y
[1] 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'y' is stale!
We can cope with the right-assignment operator ...
safe> 1 -> x safe> x + 1 -> y
safe> 2 -> x
safe> y
[1] 2
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'y' is stale!
We can work with a for loop ...
safe> x <- rnorm(10) safe> sum <- 0 safe> for (i in 1:10) { safe+ sum <- sum + x[i] safe+ } safe> sum
[1] 1.871742
safe> x <- rnorm(10)
safe> sum
[1] 1.871742
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'sum' is stale!
It is possible to create dependency loops, but "safe mode" reports those happily without, for example, infinite looping itself (it is up to the user to break the dependency loop) ...
safe> x <- 1 safe> y <- x
safe> x <- y
safe> x
[1] 1
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'x' is stale!
safe> y
[1] 1
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'y' is stale!
Finally, a demonstration that "safe mode" works properly in a negative-result sense (it does not report problems where there are none) ...
safe> x <- 1 safe> f <- function(x) { x + 1 } safe> f(1)
[1] 2
safe> x <- 2
safe> f(1)
[1] 2
To exit "safe mode", we type q().
The 'safemode' package provides a safemode() function that creates a "safe mode" session in R. In "safe mode", all symbols have an "age" (a last-modified time stamp) and a set of dependent symbols, and a warning is issued whenever a symbol is used in an expression and its age exceeds the age of any of its dependents (i.e., there is warning whenever a "stale" symbol is used in an expression).
The idea of "safe mode" solves a problem that should not exist (in an ideal world). If people developed R scripts in a strictly disciplined fashion, "stale" symbols would not arise. However, the reality is that this sort of problem can occur, probably does occur, and the 'safemode' package demonstrates an idea for protecting users from harming themselves in this way.
This is not a tool that would be useful for developers of R packages, but it might help to prevent some nasty slip ups when developing scripts in a more casual fashion for small one-off projects.
The code analysis involved in determining dependencies between symbols is based on heuristics (and some special cases), so it can be defeated by things like non-standard evaluation. In the following example, "safe mode" incorrectly creates a dependency between 'mpg' and 'x' ...
safe> mpg <- 1 safe> x <- subset(mtcars, mpg < 15)
safe> mpg <- 2
safe> x
mpg cyl disp hp drat wt qsec vs am gear carb Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Warning message: In withCallingHandlers(warning(staleWarnMsg(tracked[staleDeps])), ... : Symbol 'x' is stale!
Because safemode creates its own sub-session, there may be problems with anything else that tries to manipulate the R command line. For example, within "safe mode", auto-completion does not work. Also, "safe mode" has been tested mostly on the raw R terminal interface; there may be interactions with GUIs that provide a more sophisticated interface (though a simple test showed that it might be ok within the RStudio Console window and when submitting expressions with Ctrl-Enter from an R script within RStudio).
There are several R packages that provide code analysis. The 'safemode' package has similar aims to the 'CodeDepends' package and 'safemode' relies heavily on sophisticated functions from 'CodeDepends' (and 'codetools'). The main difference is that 'safemode' is focused on a more dynamic scenario and acts as the user enters expressions, rather than being aimed at static analysis of script files.
The 'lintr' package is integrated with several editors and IDEs for R to provide dynamic code checking, but it has more of an emphasis on syntax checking and code style.
The first next step will be to try out "safe mode" in a more realistic setting, rather than just on toy examples. For example, we will get students in an undergraduate course to use it in computer labs to see if it does help to catch any problems.
If "safe mode" proves to have some worth, a number of improvements immediately suggest themselves for future development:
There could be some global options to control "safe mode" behaviour, such as customising the safe> prompt and generating errors rather than warnings for stale symbols.
It might be useful to be able to deliberately clear the time stamp and dependency databases (though this happens automatically on exit from "safe mode").
It might be useful to have a function that produces a summary of the status of all tracked symbols in "safe mode" and possibly a function that reports, if there are stale symbols, which symbols need updating (and in which order).
Something else to look at is whether the time spent monitoring and checking for "stale" symbols can become burdensome and ways to make that more efficient.
Finally, an alternative approach to this problem, rather than monitoring expressions as they are entered, might be to monitor and check for staleness entirely within the script file (within an IDE environment). For example, whenever a line is edited, highlight every other line in the file that needs to be re-evaluated as a result of that change; put another way, highlight all lines within a script that have become "stale".
The 'safemode' package is available from github. It can be installed with the following ...
library(devtools) install_github("pmur002/safemode", subdir="pkg")
There is also a tar ball for installing on Linux and a zip file for installing on Windows, for version 0.1-0.
Ross Ihaka's script() function (upon which "safe mode" is based) is available here.
Luke Tierney's 'codetools' package is distributed with R, but is also available from CRAN.
Duncan Temple Lang's 'CodeDepends' package is available from the Omegahat repository. However, to get all of the examples in this document to run (particularly the ones involved a character value on the left-hand side of an assignment), you will need a fork of 'CodeDepends' from github. Hopefully, this will eventually turn up in Duncan's github repo. There is a zip file for the forked 'CodeDepends' for installing on Windows.
Jim Hester's 'lintr' package is available from CRAN or github.
The source and additional resources used to build this document are available from the parent page for this document.