What is this?

This is a rarely updated page of sometimes-useful knowledge.

A Menu of Sorts

Last Updated 2014-12-19
"Presumably Counts as Official" Edition

Links

Major Software

Homepage of WeBIPP - a web-based interactive tool for building statistical graphics from scratch without having to write code. Intended to bridge the gap between easy-to-use GUI tools and the hard-to-use but powerful code-based tools like D3

Homepage of TableToLongForm - an R package that automatically converts hierarchical Tables intended for a human reader into a LongForm Dataframe that is machine readable.

Minor/Experimental Software

SelNavSVG - A Technical Report presenting a way to select and navigate overlapping SVG objects, with possible applications in interactive visualisations.

Other Research Related Work

Literature Reviews are somewhat out of date, having been conducted in 2012.

Literature Review of Open Data in New Zealand State Sector sources (link to .tex)

Literature Review of free Graphical Tools that produce interactive output ideal for web-based viewing (link to .tex)

Dumps

Listing dump of things related to my research.

Listing dump of older Topical Topics.

Why is your website so ugly!?

I take certain pleasure in writing out HTML code by hand, some may even call it perverse. I just like to know exactly how it all works, and the best way is to do it by hand.

Currently Topical Topic: R Efficiency

(aka I don't think that R code is as efficient as you think it is)

Quick Conclusion

sample with dbinom beats rbinom in generating random numbers from a binomial distribution.

Main Post

A few years back in a Stochastic Processes paper we were examining Gambler's Ruin or some such that required generating what was essentially a bernoulli random variable (in truth, needed to generate -1 or 1 with equal probability). Now I remember there being a discussion on an efficient way to generate such numbers in R, and I suggested using rbinom(n, 1, 0.5) * 2 - 1, which doesn't seem too bad, right? rbinom generates 0's and 1's (and surely rbinom must be quite efficient?), and we can transform these to -1's and 1's with * 2 - 1,

Armed with my current R knowledge however, I wondered if there was a faster (perhaps even easier) way. The results may (or may not) surprise you. So we could use rbinom as I suggested all those years ago:


> rbinom(100, 1, 0.5) * 2 -1
  [1] -1  1  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1 -1  1  1  1  1  1  1 -1
 [26]  1  1 -1  1  1  1 -1 -1  1 -1 -1 -1 -1 -1  1  1 -1 -1  1 -1  1  1  1  1 -1
 [51]  1 -1  1  1 -1  1  1 -1  1 -1  1  1 -1  1  1 -1  1 -1  1  1 -1 -1 -1 -1 -1
 [76] -1  1  1  1  1 -1  1  1 -1  1  1  1 -1 -1  1  1 -1  1  1  1 -1 -1 -1  1 -1

An alternative is to use sample.


> sample(c(-1, 1), 100, TRUE) # TRUE for sampling with replacement
  [1]  1 -1  1  1 -1 -1  1  1  1 -1  1  1  1 -1 -1  1 -1 -1  1 -1  1 -1 -1  1  1
 [26] -1 -1  1  1  1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1  1 -1  1
 [51]  1 -1 -1  1  1  1  1 -1 -1  1 -1 -1  1 -1 -1 -1  1 -1  1 -1  1  1  1  1  1
 [76]  1 -1 -1  1 -1 -1 -1  1 -1  1  1  1 -1 -1  1  1  1 -1 -1  1  1  1 -1  1  1

So how do these stack up against each other?


> system.time(rbinom(10^8, 1, 0.5) * 2 - 1)
   user  system elapsed
  10.11    0.38   10.49
> system.time(sample(c(-1, 1), 10^8, TRUE))
   user  system elapsed
   2.86    0.07    2.94

For reference:


> system.time(rbinom(10^8, 1, 0.5))
   user  system elapsed
   9.75    0.38   10.13
> system.time(sample(c(0, 1), 10^8, TRUE))
   user  system elapsed
   2.73    0.19    2.92

So sample beats rbinom quite resoundingly, even for generating a binomial with n = 1 and p = 0.5.

Now here's where it really gets interesting, what if it's no longer a simple bernoulli and we try a binomial with say, n = 2 and p = 0.5


> system.time(rbinom(10^7, 2, 0.5))
   user  system elapsed
   0.95    0.00    1.00

We can specify the binomial probabilities to sample to generate the same results:


> system.time(sample(0:2, 10^7, TRUE, dbinom(0:2, 2, 0.5)))
   user  system elapsed
   0.34    0.01    0.36

So sample beats rbinom even for larger n!

Now the question is, how long does this relationship hold? Behold, a function to test it:


sampletestfunc =
  ## A function to test if sample is better than rbinom
  function(startn = 2, maxn = 30, pr = 0.5){
    samplebetter = TRUE
    n = startn
    while(samplebetter && n < maxn){
      binomt = system.time(rbinom(10^6, n, pr))[3]
      samplet = system.time(sample(0:n, 10^6, TRUE, dbinom(0:n, n, pr)))[3]
      samplebetter = binomt > samplet
      cat("n =", n, samplebetter, binomt, "vs", samplet, "\n")
      n = n + 1
    }
  }

The results of this seems to suggest that sample beats rbinom resoundingly until we reach a point where we have so many numbers that the probabilities get a little small and the sample function returns a warning.

Try:


sampletestfunc()
sampletestfunc(pr = 0.2)
sampletestfunc(pr = 0.7)
sampletestfunc(100, 110)
sampletestfunc(1000, 1010)
sampletestfunc(10000, 10010) ## Warning produced by sample

The more you know...