Paul Murrell
The University of Auckland
This talk is about how I am currently creating research publications and why I think the process and tools that I use are absolutely genius.
I acknowledge that not everyone will have the technical background or interest or freedom to indulge in these technologies.
Document (noun): a computer file that contains text that you have written.
http://www.merriam-webster.com/dictionary/document
Document (verb): support or accompany with documentation.
https://en.oxforddictionaries.com/definition/document
I will use the word "document" to describe the piece of work that I create to document my research. This is instead of more loaded words like report (which implies something less formal or worthy) or article (which implies something that must be published in a journal).
I care about not wasting my time and effort. I care about producing something good. I care about everyone being able to access my work and pass it on to others. I care about other people being able to repeat what I did and build on what I did. SIX things I care about!
Working Efficiently and Effectively
This section covers tools that I use to *create* a document.
<h2>Namespaces</h2>
<p>
It is now possible to add SVG content to a web page with
the 'DOM' package.
</p>
<rcode>
page = htmlPage()
appendChild(page, svg, ns="SVG")
</rcode>
<p class="img">
<a href="svgPage.html">
<img style="border: solid 1px" src="svgPage.png"/>
</a>
</p>
This is an example of the sort of text file that I write to create a document. EVERYTHING within the file is text. Images are references to external files. The file consists of content plus a description of the structure of the content (e.g., what is a heading, what is normal text and what is code, ...). This is the same idea as a LaTeX file or a Markdown file, but NOT the same as a Word document.
I do not write the final document; I write a text description of the final document, which is processed by various software tools to generate the final document. MANY programs can create and modify text files, so you are not tied to a particular piece of sofware to write your document AND you are not placing any burden on anyone else who wants to view or use your document. It is easy to write code to generate text, which means we can automate the generation of parts of a document. This is the basis for tools like 'Sweave' and 'knitr' in R. Writing text also means that I can write exactly what I mean and I can write complicated things - this becomes more important when I layer XML structure on top ...
<h2>Namespaces</h2>
<p>
It is now possible to add SVG content to a web page with
the 'DOM' package.
</p>
<rcode>
page = htmlPage()
appendChild(page, svg, ns="SVG")
</rcode>
<p class="img">
<a href="svgPage.html">
<img style="border: solid 1px" src="svgPage.png"/>
</a>
</p>
This is an example of the sort of XML that I write to create a document. It is basically HTML, plus some tags that I made up for my own convenience. It is VERY easy to transform this to pure HTML. It is also very easy to transform it to other things (e.g., .Rhtml).
One weakness of text files is that the content can be unstructured, which makes it hard to process. XML solves this problem by using tags to label the structure of the document. Tools like XPath and XLST take advantage of that structure to provide excellent processing tools. This means that we can do more than just process from XML to a publication format (e.g., HTML). We can also process to other formats, e.g., these slides come from an XML file that has been transformed to HTML, but the same XML file also gets transformed to produce speaker notes (and a version for printing as handouts). The fact that XML is an Open Standard means that there are lots of editors and processing tools for working with XML (so a low burden for me and for anyone I want to share or collaborate with). In effect, I create a document by writing code and the computer creates the document from my code. BUT the code that I am writing is in a very simple language. I use XML rather than Markdown because the Markdown does not give me enough control over the final HTML format and because Markdown is limited to generating the final format, whereas XML allows for a much wider range of transformations. The source file for these slides is an XML document that is transformed to HTML for the slides and a .txt file for speaker notes (and a modified HTML document for printing handouts). However, Markdown also works if XML is too much, especially if you have the freedom to select Markdown as part of your workflow and you have the freedom to make use of the final format that Markdown produces for you. LaTeX just does not process as easily as XML and it only easily transforms to PDF.
<xsl:template match="//rcode">
<xsl:comment>begin.rcode
<xsl:apply-templates/>end.rcode</xsl:comment>
</xsl:template>
$ xsltproc replace.xsl document.xml > document.Rhtml
This is an example of the XPath and XSLT code used to transform an XML document (to an .Rhtml file). The XPath bit is '//rcode', which matches ANY rcode element ANYWHERE in the document. The XSLT bit is everything else, which says, IF you get a match to '//rcode', start an XML comment, followed by 'begin.rcode', followed by the content of the rcode element, followed by 'end.rcode', followed by the end of the XML comment. The 'xsltproc' program can be used to apply the XSLT code to the XML document.
<h2>Namespaces</h2>
<p>
It is now possible to add SVG content to a web page with
the 'DOM' package.
</p>
<!--begin.rcode
page <- htmlPage()
appendChild(page, svg, ns="SVG")
end.rcode-->
<p class="img">
<a href="svgPage.html">
<img style="border: solid 1px" src="svgPage.png"/>
</a>
</p>
This is an example of the .Rhtml file that is produced by processing the XML document that I wrote using XPath and XSLT.
A literate document allows code chunks to be included in the document so that, when the document is processed, the code can be run to generate some of the document content.
$ Rscript -e 'library(knitr); knit("document.Rhtml")'
The process is now: I write an XML document, then I transform it with XPath and XSLT to a .Rhtml document, then I transform it again with 'knitr' to generate an HTML document.
<h2>Namespaces</h2>
<p>
It is now possible to add SVG content to a web page with
the 'DOM' package.
</p>
<pre>
page = htmlPage()
appendChild(page, svg, ns="SVG")
</pre>
<pre class="knitr r">
<svg xmlns="http://www.w3.org/2000/svg">
<circle r="50"/>
</svg>
</pre>
<p class="img">
<a href="svgPage.html">
<img style="border: solid 1px" src="svgPage.png"/>
</a>
</p>
This is an example of the HTML that is produced by processing the .Rhtml file that was produced by processing the literate XML document that I wrote.
The final document that I produce is an HTML document. HTML does not do typesetting as well as LaTeX does, but you can still produce a nice-looking result (like these slides), usually by making use of someone else's efforts with CSS. HTML is a great publication format because it is an Open Standard, so there are lots of (free) viewers (including web browsers), so no burden is placed on your audience. Oh, and it's still text. Where HTML nails other options (like PDF) is that it is easy to produce dynamic and interactive effects in HTML. HTML is also text and (informal) XML, so inherits all of their nice editing and processing properties. For example, suppose that the tool I used to generate the final HTML document (from my XML document) does not produce EXACTLY what I want; with HTML, I can easily tweak the final result with further processing (much more easily than I could if I had generated a PDF document as my final document). The relevance of HTML as a publication format is demonstrated by the fact that traditional publishers now offer HTML versions of articles online. Some web browsers now have "native" support for viewing PDFs, but not as part of a web page. PDF is not a web format.
This is an example of an HTML report with an interactive feature: click on the plus to toggle visibility of the code chunks.
If the pubication format is HTML, then the best format for images is SVG. It is vector (rather than raster), so it looks good at any size, it is an Open Standard so there is lots of software support (all browsers now have native support), PLUS it is XML with all the benefits previously mentioned about structured text. It is also easy to add dynamic and interactive features to an SVG image, which is handy if you are writing a document that describes the creation of dynamic and interactive statistical graphics.
This is an example of an SVG image (within an HTML document) with an interactive feature: drag the blue rectangle to scroll the window shown in the large plot.
This process is similar to the R Markdown workflow that we have our 20x students use (right?), EXCEPT that I am working at a lower level, so it is more flexible, and I understand more about what is happening, and I have more of a focus on HTML as the final document format and SVG as the graphics format.
Open Access and Sharing
Having created a document using efficient and effective tools, we now turn to the issue of disseminating the document. The tools in this section focus on making it as efficient as possible for others to access and make use of my work.
Electronic distribution implies BOTH an electronic format (rather than hard copy) AND availability on the world wide web. The best way (the only sensible way) to publish a document is in electronic format (rather than print). Copies are (virtually) free, copying is fast and *exact*, and we gain features like colour and interactivity. It is not controversial to claim that an electronic format is good, but how many of you are still preparing documents for print (e.g., PDFs in A4 page format) ? I enjoy writing a document with the screen as my main format (e.g., freely using colour and interactive content). It is possible to format HTML nicely for print, but the screen format is now the primary concern. JSS was originally aiming for HTML, but backslid to LaTeX/PDF.
Distribution of an electronic document via the web means that the publication can be accessed from virtually anywhere virtually instantly. Distribution on the web can be just a matter of placing material on a public web server, but existing search engines and social media can further increase the visibility and discoverability of the material. People care about these things - the web gives them to us for free. Again, it is not crazy to suggest that putting a document on the web is good, but how many of you are writing for the web (e.g., producing HTML documents or interactive SVGs) ? I enjoy writing documents for the web.
If you are conducting publicy funded research, the results have already been paid for. It makes sense to provide the results as openly as possible. The CC BY licence fits this situation perfectly.
... the owner of the copyright in this work, hereby grants and assigns exclusively to ASA, IMS, and IFNA all rights in and to the above manuscript now or hereafter protected by the copyright laws of the United States and all foreign countries, including, but not limited to,the right to prepare derivative works and the right to publish the manuscript in computer-based formats.
(emphasis mine)
In the traditional publishing model the copyright is signed over to the publisher and they limit access in order to charge for access (business model). Note that the transfer of copyright is quite substantial and quite persistent. My children are unlikely to live long enough to see my article enter the public domain. This is not good. I enjoy publishing documents with a CC BY licence.
You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material for any purpose, even commercially. Under the following terms: Attribution — You must give appropriate credit ... No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
(emphasis mine)
Contrast the expressions used here with those used in the traditional publisher contract. These words are compatible with sharing and unrestricted reuse. This is good.
Publishing outside of a traditional journal makes it possible to really take advantage of the available tools. We are not restricted by journal format rules, we are not hindered by the slow peer review process, we are not restricted by journal copyright assertions. If we no longer need to publish through a journal, we can think about escaping other artificial constraints like bundling articles into volumes or issues. A publication can be published on its own. For example, the Journal of Statistical Software ONLY publishes in electronic format and it publishes individual articles. JSS and R Journal still dictate the format (PDF) and still have a slow review process I withdrew an article from JSS after 2 years in the review process because the software had changed so much that the article had essentially become a bunch of lies!
http://stattech.wordpress.fos.auckland.ac.nz/
If you don't want to run your own web server, the department has a Technical Blog where you can easily publish a document. My latest publication is a technical report published on the department's technical blog. I now have authored (or co-authored) 29 publications on the technical blog. DIY publishing allows for a greater variety of publishing models - rather than a one-size-fits-all journal article, we can have shorter or longer pieces of work. Smaller publications allows for documenting smaller pieces of work, such as student projects. More than 10 of my publications on the technical blog are based on student projects (BScHons, Masters, or Summer Scholarship). There are several more that are single-authored student publications (PhD or research assistant). I could also have used something like ArXiv, though that is still focused on preprints of print articles.
In addition to making a document available *now*, we should be concerned with the document *remaining* available. There are independent services like figshare. These promise to provide persistent storage and increase visibility for works. UoA has its own figshare portal. UoA library has ResearchSpace. This has the advantage of having a reasonable chance of existing for as long as UoA exists.
Reproducibility and Reuse
We have already talked about literate documents and tools like 'knitr', which make it easy to create a document that can be reproduced by someone other than the original author. This part of the talk describes some other important pieces that we can share. The focus is still on others having access to my work, but we have moved on to allowing others to do more than just *view* the work.
In addition to the final HTML document, I also distribute ...
EVERYTHING needed to reproduce the report is available online under a permissive licence. The open standard text formats and permissive licences also make it easy to create new work based on these materials.
This shows the list of materials provided for one of my recent publications. A lot of these materials have nice features like the fact that they are all text files. However, even with all of these resources, there is no guarantee that someone else has all of the tools available to work with them (I work on Linux). This is where Docker comes in ...
Dockerfile
# Base image
FROM ubuntu:16.04
MAINTAINER Paul Murrell <paul@stat.auckland.ac.nz>
# Install additional software
RUN apt-get update && apt-get install -y \
xsltproc \
r-base=3.3.1* \
wget \
libxml2-dev \
libcurl4-openssl-dev \
libssl-dev
A Dockerfile is a text description of a computer (operating system and installed programs); a sort of virtual machine.
Docker image
docker build -t pmur002/dom-v0.2 ./Dockerfile
docker run pmur002/dom-v0.2 make DOM-v0.2.html
docker login -u=pmur002
docker push pmur002/dom-v0.2
We can build a Docker image from a Dockerfile, then we can run that image (virtual machine) and run a command within the virtual machine. We can also publish the Docker image on the internet (DockerHub) so that others can easily access and reuse the virtual machine (e.g., to reproduce the research document).
Recognition
This section is about "recognition" in the sense of avoiding confusion and ambiguity with regards to who wrote a document. There are a couple of slides at the end that briefly address "recognition" in the sense of being rewarded for work (with fame and professional advancementXS).
ORCiD provides a unique identifier for every researcher, so that I cannot be confused with someone else, just in case there is another Paul Murrell in the world who happens to specialise in producing slow statistical graphics software.
Adopting modern tools leads to self-sufficiency,
which leads to ...
Technology is allowing us to do more and more of the publishing process ourselves. DIY publishing means greater access to and freedom to explore these technologies. Another way to look at it is that these technologies allow you to work (on your publication) more like a programmer - yet another reason for computing to have a greater presence in our curriculum!
If you are a publicly-funded researcher
and your primary goal is to disseminate
the results of your work
as efficiently
and effectively and rapidly
and widely and persistently
and reproducibly as possible
(and you have a weakness/affinity for computer tech)
there are some great tools available to
help you to
achieve your goal.
My *primary* concern is publishing and sharing my work. The problem of *measuring* my worth is *secondary*. I believe that is the correct order. Recently, I have been conducting small pieces of work, consisting of adding new features to an R package. Within a very short time, I can develop the R package, make it available, write a document describing the changes and demonstrating their use (if only as a reminder to myself of what I have done!), and make that document available. The entire research cycle can occur within the space of a couple of weeks.
Publish for Pleasure
by Paul
Murrell is licensed under a Creative
Commons Attribution 4.0 International License.
If you want independent peer review, there are already services offering that (for a price), e.g., Rubriq.
Once it becomes easy to publish, number of publications is less meaningful. The emphasis is more on quality, as measured by things like citations, or other measures of how much your work is used and valued by others.