Automating R Demonstration Videos

by Paul Murrell http://orcid.org/0000-0002-3224-8858

This document is licensed under a Creative Commons Attribution 4.0 International License.

This document describes a proof-of-concept for producing R demonstration videos in a fully-automated manner. The "script" for the video consists of a text file containing code chunks paired with text commentary. The video is produced by running the code while recording a screen capture, using text-to-speech software to record audio of the commentary, then combining video and audio with appropriate timings and pauses.

Introduction

As part of some paranoid preparations for a conference presentation at NZSA 2016, I wanted to create some short videos of R code demonstrations. The presentation slides included samples of R code and I was planning to run the R code samples live as part of the presentation. As a backup, in case the R code did not work on the day, I wanted to produce videos of the R code working.

I have dabbled with creating short videos before and drew two main conclusions from the experience: my decision not to pursue a career in hollywood was a sound one; and generating videos is an expensive process in terms of human time.

The cost of generating a video is of course compounded by the fact that, inevitably, any single video must be generated multiple times. For example, any mistake in typing R code, or in narrating the commentary requires a new recording. Furthermore, any changes to the R code or to the commentary at a later date force additional recordings.

This situation closely mirrors the creation of figures and R output in written reports, where manual cutting-and-pasting of images and R output used to waste a lot of time. That problem has been solved through the adoption of literate documents and tools like Sweave and knitr that include R code within the document itself, with automated processing taking care of embedding images and inserting R output in the final report.

This document describes an attempt to bring that same level of automation to the generation of videos that demonstrate R code samples.

Overview

The task of creating a video of running R code combined with audio commentary was broken into the following sub-tasks:

Writing the "script", a document that contains the R code to run and the accompanying text commentary.
Recording audio of the text commentary.
Setting the stage by arranging the desktop (e.g., opening a terminal window to run R within).
Recording video of the R code chunks running on screen.
Post-production by combining the audio and video, with appropriate timings.

Each of these sub-tasks is addressed in a separate section below. This is only a proof of concept, which means, due to the nature of my work environment, the solutions are mostly based on Ubuntu Linux commands.

Writing the script

The first step is to create a description of the video, consisting of R code chunks and text commentary. The examples in this document are based on a simple XML document structure for the video description. An example of a simple video description file, demo.xml, is shown below.

R code is contained with an <action> element and text commentary is within a <dialogue> element; a <shot> indicates a pairing of code and commentary that should start at the same time; a <scene> may contain one or more <shot>s and the complete <script> consists of the scenes run one after another.

XML is used as the description format because it is so easy and reliable to extract components from XML. For example, the following code extracts all of the dialogue content from demo.xml.

It is also very easy and reliable to transform from one XML document to another. The slides for my presentation were XML (HTML), which meant that it was easy to extract R code examples, plus (hidden) text commentary, to produce an XML video description with the structure above.

Recording audio

Having extracted text commentary from the video description, the screen reader espeak can be used to automate the recording of the text commentary (to a WAV file).

For timing purposes, we need to determine the duration of each audio segment. This can be achieved with the 'tuneR' package in R, by dividing the number of samples by the sampling rate.

Also for timing purpose, the recording may need padding with silence at the end; this can also be achieved with 'tuneR'. The following code generates 3 seconds of silence, makes the recording ready for appending (by finding a spot where the recording is zero), and appends the silence to the end of the recording.

Setting the stage

Recording a video of R code running on screen requires a combination of several tools. The first step is to arrange the screen. For the purposes of my presentation, this meant clearing the desktop and arranging a terminal window to run R within.

A tiny package called 'wmctrl' has been created to provide an R interface to the wmctrl command, which allows us to perform this sort of task. For example, the following code clears the desktop (minimizes all windows).

The openWindow function from 'wmctrl' can be used to run a program that opens a window. For example, the following code opens a terminal.

The return value from openWindow is a unique identifier for the window and we can use this to locate and size the window. The following code first ensures that the window is not maximized in either direction and then sets the position and size of the terminal window.

We can clean up the desktop once we are finished by calling closeWindow.

Recording video

The next step is to run R in a terminal window. This is straightforward by passing further arguments to openWindow.

However, in addition to running R, we want to feed R code to the R session. For this task, another tiny package called 'xdotool' has been created as a simple wrapper around the xdotool command. In the following code, the first step is to ensure that the terminal with R running has focus. The typestring function then simulates key strokes, which are consumed by the R commmand line.

An additional detail about this example is that the result of feeding that code to R is an R Graphics window. Futher functions from 'wmctrl' allow us to get information about that window and control its position as well.

The 'xdotool' package also provides functions for moving the mouse and simulating mouse clicks.

The next step is to capture an R session as a video. A tiny package called 'ffmpeg' has been created as a simple wrapper around the ffmpeg command, which can help with recording and transforming videos. The code below records 5 seconds of screen activity and saves it to the file video-feedR.webm in WebM format (with VP8 video codec). These are patent-unencumbered video codecs and container formats that should have support on most modern browsers. For this video, we feed the code from demo.xml.

The screenInput function is used to tell ffmpeg to capture the screen as input for the video. The fileOutput function is used to describe the name and format for the output file that ffmpeg should generate. The argument wait=FALSE means that the R session will not block while the video is recording, so that the subsequent focusWindow and typestring calls will be executed during the video recording.

The feeding of code to the R session can be made more complex. For example, code can be drip-fed one character at a time to simulate a person typing the code. We can also type spaces as fast as possible (it is tedious to watch indenting being typed one character at a time). The following code lightly processes the code chunk from demo.xml to break it into blocks of whitespace and non-whitespace, and makes use of the delay argument to typestring to slow down the typing (of non-space characters). It also creates a video of the resulting activity on screen.

Post-production

The final step is to combine the audio and video streams. Simply "muxing" an audio file with a video file is straightforward with the ffmpeg function from the 'ffmpeg' package. In the following code, there are two input files, one audio stream and one video stream, which are combined into a single video file. The WAV audio input is re-encoded using the Vorbis audio codec anad the VP8 video input is included without re-encoding.

Several shots can be combined one after the other using concatInput with ffmpeg. The following code just concatenates the video-with-audio.webm movie with itself.

The only difficulty with combining multiple <slot>s arises with aligning the audio and video segments. In the simple case above, the audio is slightly longer than the video, but ffmpeg automatically pads the video to match. Things get more complicated when we have multiple shots of text commentary and code chunks because we want there to be pauses in either the audio or the video so that the two streams are aligned at the start of each shot. The basic algorithm for aligning shots is as follows:

    for each shot {
        audioLength = audio duration
        codeLength = (numChars - numSpaces)*delayBetweenChars
        if (audioLength > codeLength) {
            pause after typing code
        } else {
            add silence to end of audio
        }
    }

Making Movies

A tiny package called 'director' has been created to provide a convenient wrapper around the steps described above. This package supports a slightly more complex XML format for the script description. An example is provided in the file demo-2.xml.

The 'director' package provides a single main function, shootVideo, that produces a movie from the XML script. The return value of the function is a list containing paths to the video files that are produced; the complete movie, plus individual scenes.

The XML script file has the following structure:

The <script> element contains a <stage> element followed by one or more <scene>s. The <stage> describes how the desktop should be set up for recording and the <scene>s provide the code to run and the text commentary to accompany each piece of code.
The <stage> element has x, y, width and height attributes that describe the area of the screen that will be captured on video. The <stage> contains one or more <location> elements.
A <location> element has attributes id, program, x, y, width, and height. These are used to open a window at the specified location, using the named program. The id attribute defaults to a numeric index. The dimensions of the <stage> default to the bounding box of the <location> elements. An example of a <location> element is shown below.
A <scene> element has an id attribute that defaults to a numeric index. A separate video is produced for each scene and the scenes are strung together to produce the complete movie. A <scene> element contains one or more <shot> elements.
A <shot> element has location and duration attributes. The location should match the id of one of the <location> elements within the <stage> element; this is used to give focus to the correct window for typing code. The duration gives the shot duration in seconds; if specified, this overrides the duration of the audio and code within the shot. A <shot> element contains zero or one <action> elements and zero or one <dialogue> elements. It is possible for a <shot> to be empty and just have a duration attribute (as shown below); this produces a pause in the video.
The content of an <action> element is the code to run; this code is typed into the window specified by the location attribute of the parent <shot> element. Because code is just typed in a window, the code does not have to be R code. The example below is used to type R in a terminal to start R.
An <action> element has an echo attribute and if this is set to "FALSE", the code is not sent to the window with focus. The code is run, but only within the R session that is coordinating the video recording. In the example, demo-2.xml, this is used to reposition an R graphics window (as shown below).
An <action> element also has keydelay and linedelay attributes. These control the delay between typing (non-space) characters and the delay after each newline in a chunk of code (both in milliseconds). The former can be used to speed up the typing of code and the latter can be useful to pause after an expression that may take a moment to execute (as shown below).
The content of a <dialogue> element is the text commentary to record as audio.

Discussion

The original motivation for this work was to automate the generation of a video of code examples for a conference presentation.

It turned out to be a good idea to produce the video because the live demonstration of the code examples did not work during the conference presentation, and I was able to fall back on the video. However, the true value of this work to automate the creation of the video was reflected in the fact that I generated the video several times. Because the video generation was automated, it was very quick and easy to make changes to the code in the examples, to the text commentary, or to the timing, or even the order of the scenes and shots in the video.

In summary, this work has already proven its value for my own purposes.

Other applications

The only other time that I have attempted to produce short videos of R code demonstrations was to support teaching. I have done this only sparingly in the past because of the time and effort cost involved in generating videos, but even more because of the cost of updating and maintaining them. I am hopeful that this automated approach to generating videos will encourage me to make more videos to support my teaching.

Another benefit of automating video generation from a simple script is that it becomes easy to share and reuse videos. It also becomes easy to place a video script under version control (e.g., host the video script on github). The ability to easily (and programmatically) modify and regenerate a video may facilitate the production of multiple versions of a video, e.g., translations of text commentary to other languages. It could also be expanded to automatic generation of "scene selection" menus, etc.

Limitations

The major limitation of this proof-of-conecept solution is that it is Linux-only at this stage. The ffmpeg program is cross-platform, so there is some hope that the 'ffmpeg' package might be portable to other platforms, but replacements would need to be sought for 'wmctrl' and 'xdotool' on Windows, for example.

Another issue with this solution is that, although recording is automated, the desktop is unusable for the duration of the recording; the desktop is being captured so I have to step away from the keyboard and mouse while recording is happening. This problem can be avoided by running the entire recording off-screen, using something like Xvfb. This is the approach taken for the Docker build of this document (see the Resources section).

Related work

The audio recording and manipulation component of this work was based on the existing 'tuneR' package. The 'wmctrl' and 'xdotool' packages were created because there did not seem to be any existing package that could help with setting up the desktop and simulating key strokes. The 'RSelenium' package exists for simulating interactions with a web browser, but that is specific to keyboard and mouse activity in a browser window.

Several existing packages provide a limited interface to ffmpeg, including 'animation' (for running still frames together to create a video) and 'imager' (for loading images from a video), but none appeared to provided a generic interface like the package 'ffmpeg' is attempting to do.

The 'rDVR' package provides the ability to capture a video of the screen, but is based on the Monte Media Library rather than ffmpeg and it does not automate the running of R code or the audio recording of text commentary like the 'director' package.

In the Python world, there is a 'MoviePy' module (which is based on the Python module 'imageio', which is a front-end for ffmpeg). 'MoviePy' provides similar support for scripting videos (see, for example, the star worms tutorial). This module is a more low-level and general-purpose tool than 'director', with no prescribed "script" format, and no specific focus on running code examples with text commentary. It might provide an excellent basis for creating a Python equivalent of 'director'.

Future work

There are many ways in which this proof-of-concept could be expanded and improved:

It would be nice to try some other voices for the espeak text-to-speech step. For example, the MBROLA project provides some more natural-sounding voices that should work with espeak.
In addition to recording text commentary as audio, it would be useful to explore embedding text commentary as sub-titles within a video.
Tools for automating control of windows, mouse, and keyboard on Windows and Mac machines could be explored.
The video recordings are currently using default settings that do not necessarily create very nice quality videos. This could be improved by exploring and making available more of the quality settings for ffmpeg.
There are many other features of ffmpeg that could be added to the interface of the 'ffmpeg' package (e.g., video filters to manipulate video content).
If videos are being modified and updated, it would be useful to be able to perform regression testing to make sure that only intended changes appear in the video. There is already a 'VideoComparison' package that may provide a starting point.

Technical requirements

The examples and discussion in this document relate to 'wmctrl' version 0.1-2, 'xdotool' version 0.1, 'ffmpeg' version 0.1-1, and 'director' version 0.1,

This document was generated within a Docker container (see Resources section below).

Resources

The 'wmctrl' package is available on github.
The 'xdotool' package is available on github.
The 'ffmpeg' package is available on github.
The 'director' package is available on github.
The example video script files demo.xml and demo-2.xml.
The raw source file for this document, a valid XML transformation of the source file, a 'knitr' document generated from the XML file, two XSL files that are used to transform the XML to the 'knitr' document, and a Makefile that contains code for the other transformations and coordinates everything.
This document was generated within a Docker container. The Docker command to build the document is included in the Makefile above. The Docker image for the container is available from Docker Hub; alternatively, the image can be rebuilt from its Dockerfile.
The Dockerfile depends on a keyboard file and a shell script.

References

wmctrl, http://tripie.sweb.cz/utils/wmctrl/, URL visited
xdotool, Jordan Sissel, http://www.semicomplete.com/projects/xdotool/, URL visited
eSpeak, http://espeak.sourceforge.net/index.html, URL visited
FFmpeg, https://ffmpeg.org/, URL visited
T. DUTOIT, V. PAGEL, N. PIERRET, F. BATAILLE, O. VAN DER VREKEN, "The MBROLA Project: Towards a Set of High-Quality Speech Synthesizers Free of Use for Non-Commercial Purposes" Proc. ICSLP'96, Philadelphia, vol. 3, pp. 1393-1396.
Uwe Ligges, Sebastian Krey, Olaf Mersmann, and Sarah Schnackenberg (2016). tuneR: Analysis of music. URL: https://cran.r-project.org/web/packages/tuneR/.
John Harrison (2014). rDVR. URL: https://cran.r-project.org/web/packages/rDVR/.
Yihui Xie (2015). animation: A Gallery of Animations in Statistics and Utilities to Create Animations. URL: https://cran.r-project.org/web/packages/animation/.
Simon Barthelme (2016). imager: Image Processing Library Based on 'CImg'. URL: https://cran.r-project.org/web/packages/imager/.
John Harrison (2016). RSelenium: R Bindings for 'Selenium WebDriver'. URL: https://cran.r-project.org/web/packages/RSelenium/.
MoviePy, http://zulko.github.io/moviepy/, URL visited
Video for Everybody!, http://camendesign.com/code/video_for_everybody, URL visited
Video on the Web, http://diveintohtml5.info/video.html, URL visited