Currently Topical Topic: R Efficiency

(aka I don't think that R code is as efficient as you think it is)

Quick Conclusion

sample with dbinom beats rbinom in generating random numbers from a binomial distribution.

Main Post

A few years back in a Stochastic Processes paper we were examining Gambler's Ruin or some such that required generating what was essentially a bernoulli random variable (in truth, needed to generate -1 or 1 with equal probability). Now I remember there being a discussion on an efficient way to generate such numbers in R, and I suggested using rbinom(n, 1, 0.5) * 2 - 1, which doesn't seem too bad, right? rbinom generates 0's and 1's (and surely rbinom must be quite efficient?), and we can transform these to -1's and 1's with * 2 - 1,

Armed with my current R knowledge however, I wondered if there was a faster (perhaps even easier) way. The results may (or may not) surprise you. So we could use rbinom as I suggested all those years ago:


> rbinom(100, 1, 0.5) * 2 -1
  [1] -1  1  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1 -1  1  1  1  1  1  1 -1
 [26]  1  1 -1  1  1  1 -1 -1  1 -1 -1 -1 -1 -1  1  1 -1 -1  1 -1  1  1  1  1 -1
 [51]  1 -1  1  1 -1  1  1 -1  1 -1  1  1 -1  1  1 -1  1 -1  1  1 -1 -1 -1 -1 -1
 [76] -1  1  1  1  1 -1  1  1 -1  1  1  1 -1 -1  1  1 -1  1  1  1 -1 -1 -1  1 -1

An alternative is to use sample.


> sample(c(-1, 1), 100, TRUE) # TRUE for sampling with replacement
  [1]  1 -1  1  1 -1 -1  1  1  1 -1  1  1  1 -1 -1  1 -1 -1  1 -1  1 -1 -1  1  1
 [26] -1 -1  1  1  1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1  1 -1  1
 [51]  1 -1 -1  1  1  1  1 -1 -1  1 -1 -1  1 -1 -1 -1  1 -1  1 -1  1  1  1  1  1
 [76]  1 -1 -1  1 -1 -1 -1  1 -1  1  1  1 -1 -1  1  1  1 -1 -1  1  1  1 -1  1  1

So how do these stack up against each other?


> system.time(rbinom(10^8, 1, 0.5) * 2 - 1)
   user  system elapsed
  10.11    0.38   10.49
> system.time(sample(c(-1, 1), 10^8, TRUE))
   user  system elapsed
   2.86    0.07    2.94

For reference:


> system.time(rbinom(10^8, 1, 0.5))
   user  system elapsed
   9.75    0.38   10.13
> system.time(sample(c(0, 1), 10^8, TRUE))
   user  system elapsed
   2.73    0.19    2.92

So sample beats rbinom quite resoundingly, even for generating a binomial with n = 1 and p = 0.5.

Now here's where it really gets interesting, what if it's no longer a simple bernoulli and we try a binomial with say, n = 2 and p = 0.5


> system.time(rbinom(10^7, 2, 0.5))
   user  system elapsed
   0.95    0.00    1.00

We can specify the binomial probabilities to sample to generate the same results:


> system.time(sample(0:2, 10^7, TRUE, dbinom(0:2, 2, 0.5)))
   user  system elapsed
   0.34    0.01    0.36

So sample beats rbinom even for larger n!

Now the question is, how long does this relationship hold? Behold, a function to test it:


sampletestfunc =
  ## A function to test if sample is better than rbinom
  function(startn = 2, maxn = 30, pr = 0.5){
    samplebetter = TRUE
    n = startn
    while(samplebetter && n < maxn){
      binomt = system.time(rbinom(10^6, n, pr))[3]
      samplet = system.time(sample(0:n, 10^6, TRUE, dbinom(0:n, n, pr)))[3]
      samplebetter = binomt > samplet
      cat("n =", n, samplebetter, binomt, "vs", samplet, "\n")
      n = n + 1
    }
  }

The results of this seems to suggest that sample beats rbinom resoundingly until we reach a point where we have so many numbers that the probabilities get a little small and the sample function returns a warning.

Try:

sampletestfunc()
sampletestfunc(pr = 0.2)
sampletestfunc(pr = 0.7)
sampletestfunc(100, 110)
sampletestfunc(1000, 1010)
sampletestfunc(10000, 10010) ## Warning produced by sample

The more you know...