This report describes an update to the R package 'dvir' to add support for the LuaTeX engine. The immediate advantage of this support is the ability to draw typeset text in R with a wider variety of fonts and font features.

1. The 'dvir' package

The idea behind the 'dvir' package (Murrell, 2020, Murrell, 2018) is to be able to typeset text using Donald Knuth's TeX system (Knuth, 1986), but render the result in R. The aim is to render the result using R graphics, but to obtain identical rendered output compared to the output produced by a "normal" TeX renderer like pdflatex.

For example, the following code typesets a simple equation using TeX and renders it in R. The image below the code is text drawn in R using the locations and fonts dictated by TeX.

The approach used by the 'dvir' package involves generating DVI output from TeX code. This DVI output provides precise information about every individual character of text, including where to draw each character and which font to use. The 'dvir' package reads the DVI information and converts it to 'grid' grobs (graphical objects) for rendering in R.

One of the major limitations of 'dvir' version 0.1-0 was that it was focused just on the TeX Computer Modern fonts. This produces nice mathematical equations, but for normal text can be quite limiting.

2. The LuaTeX engine

The LuaTeX engine (Hagen et al., 2020) differs from the original TeX engine in several important ways:¹ it works with Unicode text (so it is easy to specify a very wide range of characters); it provides support for modern font technologies like TrueType and OpenType; and it has an embedded scripting language (called Lua).

The first two of those features mean that, if we can add support for the LuaTeX engine to 'dvir', we will be able to produce a much wider range of typeset text in R graphics.

3. Adding LuaTeX support to 'dvir'

From a user perspective, there is not much to tell. This is the "ecstacy"; when everything works as planned, we get R graphics output that contains TeX-quality typesetting.

As a simple example, the following LuaTeX document, luatex-demo.tex, contains simple text that makes use of a Lato Light (non-Computer-Modern) font, within a paragraph that is 3 inches wide.

The following R code runs LuaTeX to typeset the text and generate DVI output, reads the DVI output into R and generates 'grid' grobs, then draws the grobs on an R graphics device.

The new features of this code, compared to the previous version of 'dvir', are the new arguments to grid.latex: engine, to specify that we want to use the LuaTeX engine, rather than the standard TeX engine; and preamble and postamble, which allow us to use a complete LuaTeX document as input, rather than automatically wrapping the input with LaTeX begin/end code.

The next section goes into the technical detail of getting LuaTeX support in 'dvir'. It is safe to skip ahead to the Examples Section, which contains more elaborate demonstrations of LuaTeX output within R graphics.

4. Technical details

This section describes the technical details of adding LuaTeX support. This is the "agony".

This detailed description serves two purposes: when things do not work as planned, these details may provide some suggestions for what went wrong (and perhaps even how to fix it); and this is an important record of the internal design, which is complicated, and will certainly be forgotten if not recorded properly.

DVI output from LuaTeX

The normal way to use LuaTeX is to run the lualatex program on a LuaTeX document, which produces a typeset PDF document. However, the 'dvir' package consumes DVI files, which we can get from LuaTeX by running the dvilualatex program instead.

The following bash code generates a DVI file, luatex-demo.dvi, from the simple LuaTeX document, luatex-demo.tex, which was shown above.

The DVI file mostly consists of the usual instructions that adjust the drawing location, like right3 and down3, and the usual instructions to draw a character, like set_char_83 (an 'S' character). However, One major difference between this DVI output from dvilualatex and the DVI output from latex is the fnt_def instruction, which defines a font to use for drawing text, and in particular its fontname parameter.

With the TeX engine, and standard Computer Modern fonts, the fontname is just the name of a font, like cmr12 (meaning Computer Modern Roman at 12 pt size). The fontname in the DVI above contains a font name, plus additional information about how the font is being used.

So the first complication for 'dvir' with LuaTeX support is that we must pull out the information that we need from the fontname parameter in fnt_def instructions. For now, we will just grab the font name at the start, in this case, LatoLight. We will come back to some of the other information later.

Resolving LuaTeX font names

Having a font name is actually sufficient to draw some text on some graphics devices in R. For example, the Cairo graphics devices can make use of just the font name, as shown below.

However, this does not work for all LuaTeX text or for all R graphics devices; as we will see, what we really need is the actual font file that corresponds to the font that LuaTeX used. Fortunately, LuaTeX includes a tool called luaotfload-tool that can help us.

This font file will allow us to find out much more about the font than just its name and that will help us to use the font for more complex text.

Determining which character to draw

The simple LuaTeX example above only makes use of ASCII characters. This means that all of the DVI instructions to draw a character in that example are of the form set_char_n. The n gives the ASCII encoding for the character: e.g., set_char_83 means a capital 'S' character.

However, LuaTeX documents support Unicode text, so we can have, for example, text with accents or diacritics, like the c-cedilla, ç, in the document below.

... and the start of this DVI file is very much like the previous one (just with the 'D' from "Du" instead of the 'S' from "Some").

However, if we look at the DVI information for the text around the word "français", we see a set_char1 instruction (a single byte character outside the range 0-127), with a parameter e7.

This shows how LuaTeX expresses Unicode text (within the range 128-255). The value e7 is a UTF16BE encoding for the Unicode character "LATIN SMALL LETTER C WITH CEDILLA" (the full encoding is 00e7, but the 00 is dropped to allow the DVI to just record a single byte).

The next example demonstrates a full 2-byte UTF16BE encoding. This time we have a LuaTeX document that contains the character sequence "fi" (in the word "fine").

If we look at the DVI information around the word "fine", we see two interesting points: the "fi" has been reduced down to a single character; and that character is expressed as a set_char2 instruction (a 2-byte character), with parameter fb 01.

What has happened is that LuaTeX has replaced the two characters 'f' followed by 'i' with a single ligature character that combines the 'f' and the 'i' (to deal with the fact that the top of the 'f' and the dot on the 'i' may interfere with each other). The value fb 01 is the UTF16BE encoding for the Unicode character 'LATIN SMALL LIGATURE FI'.

So another part of supporting LuaTeX in 'dvir' is being able to convert these UTF16BE encodings into something that R understands. The 'dvir' package uses the iconv function to perform this conversion.

These UTF-8 character values are sufficient for drawing on Cairo-based R graphics devices at least, because the Cairo graphics device handles UTF-8 text.

The next example increases the complexity significantly. In this LuaTeX document, we have the character sequence "ti" (in the word "timely").

If we look at the DVI information around the word "timely", we again see that the "ti" has been reduced down to a single character (a ligature) and this time we have a set_char3 instruction (a 3-byte character), with parameter 0f 02 d5.

Similar to the "fi" example, the "ti" character sequence has been reduced to a single ligature. However, this example is more complicated because the "ti" ligature does not exist in Unicode. The bytes 0f 02 d5 do not represent an encoding for a Unicode character.

This lack of Unicode representation presents two problems: how do we map the 0f 02 d5 value to an R character value? and how do we express that character to an R graphics device?

Non-unicode characters in LuaTeX DVI output

The fact that LuaTeX has generated DVI information with 3 bytes (with 0f as the first byte) is an indication that the character we need to draw is not a Unicode character.

The remaining two bytes, in this case 02 d5, are an integer index, in this case 725, that means we should use the 725th non-Unicode character in the current font. To be more accurate, it means that we should use the 725th non-Unicode glyph in the current font.

If the expression "the 725th non-Unicode glyph in the current font" seems confusing to you, you are not alone. We need to unpack it a little to understand what is going on.

First of all, a character is a concept, while a glyph is a concrete symbol representing that concept. The character 'A' is represented by different glyphs in different fonts; the 'A' in a serif font like Times New Roman looks different to the 'A' in a monospace font like Courier.

A font is a collection of glyphs, most of which are shapes representing characters. A font also contains (or refers to) an encoding, which maps each glyph to a numeric value. For example, the Lato Light font contains a glyph representing the character 'S' and an encoding that maps 'S' glyph to the number 83. It also contains a glyph representing the "fi" ligature and an encoding that maps the "fi" ligature glyph to the number 64257 (fb 01 in hexadecimal form).

Normally, when we draw text, we specify the characters to draw, the font to use, and (often implicitly) an encoding. The encoding maps the characters to the correct glyph within the font.

If we ignore the encoding, a font consists of just a collection of glyphs, from 1 to the number of glyphs in the font. We do not usually access fonts this way, but the TTX tool (part of the fonttools project; van Rossum et al., 2020) can extract this information for us. The following code extracts the names and order of the glyphs (the GlyphOrder table) from the font file Lato-Light.ttf to a new file called Lato-Light-GlyphOrder.ttx.

Looking at the start of the Lato-Light-GlyphOrder.ttx file, we can see that the glyph for 'A' is the fourth glyph in the font.

What we can also see from this TTX output is that each glyph has a name. In some cases, the name is quite familiar, e.g., "A" or "S" and in other cases, the name is less familiar, but still useful, because it points us to the Unicode code point for the glyph, e.g., "uniFB01" for the "fi" ligature. But there is a third set of glyph names that are totally inscrutable. These names are of the form "glyphi, where i simply reflects the rank of the glyph within the font.

If we treat glyphs with an inscrutable name as non-Unicode glyphs, we can find the 725th non-Unicode glyph within the font. The "ti" glyph is the 2472nd glyph in the font. (We have two off-by-one adjustments here, at least one of which is just accounting for zero-based versus one-based indexing.)

Non-Unicode characters in R graphics output

Having established which glyph within the font LuaTeX is referring to, we are still left with the problem of actually accessing that glyph from within R. When we draw text in R, what gets sent to an R graphics device is a character value, not a glyph number. We need a way to specify a character value that selects the glyph that we want.

The approach taken by 'dvir' to solve this problem involves creating a new mini-font that just contains the non-Unicode glyph (with an encoding that maps the glyph to an ASCII character). This requires several steps.

We can extract a single glyph from a font using pyftsubset from fonttools. The following code extracts glyph 2471 from Lato-Light.ttf and saves it in a new font called Lato-Light-glyph02471.ttf. The name-IDs argument will be explained later.

This new font contains no encoding, so the glyph within it is still inaccessible. Furthermore, the new font has exactly the same name as the old font, which will make it difficult to specify this new font separately from the original font. To rename the font and add the encoding, we can convert the font to an XML format, with TTX, modify the font name (edit the name table), insert an encoding (a cmap table), and then convert back to TrueType format.

In order to insert a cmap table, we need to know the name of the glyph in the new font. Unfortunately, this is not the same as the name of the glyph in the original font. The following code shows the GlyphOrder table for the new font.

There are a couple of surprises: there is a .notdef glyph that we did not ask for (we always get this); none of the glyphs are called glyph02417; and there are four glyphs besides .notdef, not just the one we asked for.

This has happened because the glyph that we want is actually composed from several other glyphs in the original font. We can see this by looking at the glyf table of the original font.

When there are no component elements, the glyph we want will be the only glyph in the subsetted font, so we can just use glyph00001 from the new font. But when the glyph we want is composed from other glyphs, the subsetted font will contain several glyphs, in this case four (the glyph we want plus the three that it was composed from).

There is now the problem of figuring out which of these four glyphs is the glyph that we want. To do this, we need to look at the names of the four glyphs and find out what order they are arranged in within the original font (and assume that that order is retained in the subsetted font).

In this case, the answer is quite simple because the names of the glyphs tell us their order; glyph02471 is the third out of these four glyphs (2471 comes after 294 and 1528, but before 3025).

In general, we can match the names of the glyphs from the glyf table to glyph names in the GlyphOrder table to determine their order.

In summary, the glyph that we want in the subsetted font has the name glyph00003. The R code below inserts a cmap table that maps the number 65 (hex 41, ASCII for character "A") to glyph00003 in the new font.

A similar approach can be used to give the new font a different name, by editing the name table within the XML. The name-IDs argument that we saw earlier in the call to pyftsubset exports the existing font name, so in this step we are just modifying the name table rather than creating a new one from scratch. The following code changes the "Font Family" name; in the full code, we also modify the "Full" font name, which includes possible modifiers like "Bold" or "Light", and the PostScript name for the font.

Finally, we write the modified XML to a new file and reverse the conversion from XML back to a TrueType font.

We can use fc-scan to see that the new font has a different name from the original font.

The last step (for Cairo graphics devices in R) is to make sure that Fontconfig (Packard, 2020, Packard, 2002) can see the new font. This can be done by creating a configuration file, as shown below, and placing that in a directory that Fontconfig can see.

We also need to force FontConfig to look at this new configuration, which we can do using the 'systemfonts' package (Pedersen et al., 2020).

From R, we can now select this font and draw the "ti" ligature by asking the font for an 'A' character.

Because R graphics can only use one font for drawing a piece of text, we have to draw this special character as an individual piece of text. However, this is how 'dvir' works generally, because the DVI output that it works from contains information about every individual character, so this is not a problem (or at least not a new problem).

Character metrics

The DVI output that 'dvir' is working from contains two types of adjustments to the drawing location: explicit moves, e.g., the space between words and kerning adjustments between letters; and implicit moves based on instructions to draw a character. The latter adjusts the drawing location so that we can just draw the next character alongside this character (if this is not the end of a word and there is no kerning).

We can see this in the DVI from our very first example. The set_char_83 instruction draws an 'S', which implicitly adjusts the current location to just after the 'S'. The right2 makes an explicit kerning adjustment. We then draw an 'o', which implicitly adjusts the location, and then an 'm', which implicitly adjusts the location.

In order to make the implicit adjustments, we must figure out the width of a character (so that we can shift the current location to just after the character).

R provides functions to calculate this font metric information, e.g., grid::stringWidth and grid::grobWidth, but on some Cairo graphics devices the information is not accurate enough (for drawing individual characters).

The following code demonstrates this problem. We create a 'grid' text grob containing the letter 'o' with the Lato Light font. We open a PDF graphics device (and load the 'extrafont' package so that PDF graphics devices can use system fonts like Lato Light) and calculate the width of the letter "o". ("bigpts" are PostScript's 1/72in, as compared to TeX "pts", which are 1/72.27in.)

Now we do the same thing on a Cario PDF graphics device, but we get a different width (because the Cairo graphics devices only get metric values to the nearest "pixel").

What this means is that, in order to position text exactly the same as LuaTeX has described in its DVI output, we cannot use the Cairo graphics metric information.

Two solutions to this problem have been implemented. In the previous version of 'dvir', using the TeX engine, a PDF graphics device is used to calculate character metrics even on Cairo PDF devices. In this new version of 'dvir', when we use the LuaTeX engine and a non-Computer Modern font, character metrics are calculated by extracting the information from the font file directly. This is because using a PDF graphics device to calculate character metrics would be difficult because we are trying to support a very wide range of character input (Unicode).

Again, the first step is to use TTX to extract metric information to an XML format. This time we are extracting the hmtx (horizontal metrics) table.

We can see that the resulting file contains width information for each glyph in the font.

One complication is determining the scale of those widths. This requires looking at the head table from the font, in particular the unitsPerEm information.

The character width, in points, is the font size (in whole points) multiplied by the width metric (scaled to 1000 unitsPerEm and then scaled to a unit square). The following code and output shows how we can obtain the correct width of a "o" character.

All that remains is to locate the correct character metric. We can see from Lato-Light-hmtx.ttx that this requires finding the correct glyph name for the character.

The fontTools documentation describes how the glyph names in the TTX files are determined and 'dvir' attempts to mimic that in order to obtain an appropriate glyph name. For example, an 'o' character will be a set_char_111 (6F in hex) in DVI and we can use the Adobe Glyph List to map 006F to the glyph name "o".

Because the glyphs names in TTX output do not always use the Adobe Glyph List names, we also generate glyph names of the form uniXXXX, where XXXX is the relevant UTF8 code point. Furthermore, in some fonts, a single glyph may be used for multiple characters and, in that case, the font may not contain a glyph with the expected name. To help with this case, we also generate a glyph name from the Unicode cmap table within a font (if it exists).

For example, a "-" (dash or hyphen or "hyphen-minus") character will be a set_char_45 (2D in hex) in DVI. The Adobe Glyph List maps 002D to the name "hyphen", so that is the first name that we will try. Unfortunately, a font may not contain a glyph named "hyphen".

We also try the glyph name uni002D, but in this case, there is no glyph of that name either.

Finally, we look in the Unicode cmap table to see which glyph is being used for the code point 002D.

This tells us that the glyph uni00AD (a "soft hyphen") is being used for code point 002D ("hyphen-minus"), so we look for that name as well. And there is metric information for that glyph.

Armed with all of that information, we can now consume DVI output from LuaTeX and draw the result in R. Of course, the 'dvi' package wraps all of that detail within a more convenient interface, e.g., the grid.latex function. The next section provides more demonstrations of the use of that interface.

5. Examples

As demonstrated by the simple example in the Section Adding LuaTeX support to 'dvir', one benefit of adding LuaTeX support to the 'dvir' package is that we can easily make use of fonts beyond the Computer Modern family. To be clear, this is extending the range of fonts that we can use to draw text with TeX-quality typesetting in R graphics; it was already possible to draw a normal R character value with a wide range of fonts, but that sort of text is not typeset with any sophistication.

The next example makes it clearer that we are typesetting text (using LuaTeX) rather than just relying on R's text-drawing facilities. The LuaTeX document below describes text with several interesting features: the main font is Economica (a Google font that has been downloaded, in TrueType format, to a local file); the first line of text is bold; and the remaining text is typeset in a paragraph with steadily decreasing line length.

The following code combines that typeset text with an R plot (in R). First, we draw a simple barplot. Then we call grid.latex from the 'dvir' package to typeset the text using LuaTeX and draw the result within the R plot. Some of the clues that tell us this is typeset text rather than just R drawing a character value are the stretching of space between words (e.g., on the first line of non-bold text) and the hyphenation of some words at line breaks.

The next example demonstrates that, in addition to providing an easy way to select a wider range of fonts, adding LuaTeX support to 'dvir' provides access to LuaTeX's sophisticated font handling features. For example, the following LuaTeX document typesets the same piece of text twice: once with the standard Computer Modern font (actually LuaTeX selects the Latin Modern font, but that is very similar to Computer Modern) and once using the Lato Light font with so-called "discretionary" ligatures (ch, ct, ck) and "old-style" numerals (with varying heights and alignments) selected. These font features are not accessible from R's standard text drawing functions.

The final example comes from Figure 4.1 in Thomas Rahlf's "Data Visualisation with R" (Rahlf, 2017). In the original example in the book, an R plot is included within a LaTeX document with the LaTeX text overlaid on the plot using the LaTeX package 'overpic' (i.e., using LaTeX to combine the plot and text). The example below demonstrates an R-driven alternative, where we overlay text on a plot using R to combine the two.

The R code for the plot is provided in a separate file (downloaded from the book web site). The result of running the code is shown below; note the use of the Lato Light font for labelling.

The LaTeX code that describes the text to overlay on the plot is shown below (also taken from the book web site). Important points to note here are the use of Lato Light font and the fact that the text is typeset in a two-column format.

The following code performs the overlay of the text on the plot using 'dvir'. A little bit of set up is necessary to convert the original 'graphics' plot to a 'grid' plot (using the 'gridGraphics' package; Murrell and Wen, 2020, Murrell, 2015), but the drawing of the text requires just a single function call to grid.latex.

One important thing to note is how the text is positioned relative to the plot; the top-left of the text is exactly 1cm in from the top-left corner of the plot region. This very explicit and precise positioning is in contrast to the LaTeX-driven approach of the original figure in Rahlf's book. The relevant piece of LaTeX code that performs the positioning in that case is shown below; the position (60,128), which dictates the location of the text on the plot, is 6cm in and 12.8 cm up from the bottom-left of the entire plot image; it is not relative to the plot region within the plot. That is very much a trial-and-error location compared to the deliberate and expressive positioning that is possible when combining the text with the plot in R.

6. Discussion

The 'dvir' package provides a way to include typeset text within an R plot. The new version of the 'dvir' package, which adds support for the LuaTeX engine, provides access to a much wider range of typeset text, including a wider range of fonts.

On the downside, many limitations of the original package remain. Drawing is very slow and support is only limited to specific R graphics devices (so far). Even worse, LuaTeX support is only offered for Cairo-based graphics devices (at this point). Furthermore, the package has only been tested (and is only expected to run) on Linux; it only makes use of Linux-based font-related tools like FontConfig and fonttools. This package remains a proof-of-concept.

On the other hand, for anyone working on Linux and/or prepared to make use of Docker it might provide a way to produce graphical results that are not obtainable any other way.

When the package does not produce the desired result, the most likely source of problems is fonts. LuaTeX may make use of system fonts or TeX fonts. For system fonts, 'dvir' uses 'extrafont' to specify the font to R graphics. If a font is not found, try running extrafont::font_import first to make sure that 'extrafont' knows about all of your fonts and/or install additional font packages on your system. The extrafont::font_import function can also be used to make sure that 'dvir' knows about local fonts that are not installed on the system (like the Economica font used in one example in this report).

7. Technical requirements

The examples and discussion in this document relate to version 0.2-0 of the 'dvir' package and the development version of the 'systemfonts' package (for the reset_font_cache function).

This report was generated within a Docker container (see the Resources section below).

8. Resources

How to cite this document

Murrell, P. (2020). "The Agony and the Ecstacy: Adding LuaTeX support to 'dvir'" Technical Report 2020-02, Department of Statistics, The University of Auckland. [ bib | DOI | http ]

9. References

Footnotes

¹ Some terminology: A TeX document consists of a combination of text to typeset and (low-level) TeX commands that describe the typesetting. A LaTeX document consists of a combination of text to typeset and (higher-level) LaTeX commands the describe the typesetting. A LuaTeX document consists of a combination of text to typeset and (higher-level) LaTeX commands that describe the typesetting and (optionally) Lua code to do crazy things. A TeX document is processed by a TeX engine, an executable program, to produce a typeset document, in some format. Most TeX engines provide a LaTeX variant that processes LaTeX documents. The original TeX engine provides a latex program to typeset a LaTeX document in DVI format. The pdfTeX engine provides a pdflatex program to typeset a LaTeX document in PDF format. The LuaTeX engine provides a lualatex program to typeset a LuaTeX document in PDF format and a dvilualatex program to typeset a LuaTeX document in DVI format.

The Agony and the Ecstacy: Adding LuaTeX support to 'dvir'

Table of Contents: