What do you need to connect people with data?
Access to data
Domain knowledge
Data Science skills
Statistical Graphics skills
Graphical Design skills
Few individuals possess all of these
Some problems:
Some solutions:
Open Data, Open Government, Open Access
Education and experience
Software that does everything for you
Some problems:
Some solutions:
Education and experience
Software that does everything for you
Software that lets everyone do a small piece
Allows more ways to contribute
Allows small contributions
Allows contributions to be combined
"Birth rates - DFMA (Annual-Dec)","" "","Total Population" 1855,39.25 1856,37.81 1857,39.47 1858,38.24 ...birthrate-file.R
# Read in original data source from Stats NZ ...
# 'brsrcfile' ("LTD404701_20140509_101154_10.csv")
# ... and tidy it to produce nicer CSV ...
# "birthrate.csv"
lines <- readLines(brsrcfile)
# Drop any line that does not start with a digit
writeLines(lines[grep("^[0-9]", lines)], "birthrate.csv")
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="fileSystem"/> <description><![CDATA[This module provides a CSV file call ... <output name="brsrcfile" type="external" ref="data/LTD404701_20140509_101154_10.csv"/> </module>
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="R"/> <description><![CDATA[This module takes a CSV file and pro ... <input name="brsrcfile" type="external"/> <output name="brfile" type="external" ref="birthrate.csv"/> <source ref="src/birthrate-file.R"><![CDATA[]]></source> </module>
<pipeline xmlns="http://www.openapi.org/2014/" version="0.1"> <component name="brsource"/> <component name="birthrate"/> <component name="brplot-R"/> <pipe> <start component="brsource" name="brsrcfile"/> <end component="birthrate" name="brsrcfile"/> </pipe> <pipe> <start component="birthrate" name="brfile"/> <end component="brplot-R" name="brfile"/> </pipe> </pipeline>
library(oaglue)
p <- readPipeline("birthrate-pipe") results <- runPipeline(p)
compname name type format formatType brsrcfile "birthrate-pipe" "brsrcfile" "external" "" "text" brfile "birthrate-pipe" "brfile" "external" "" "text" brsvg "birthrate-pipe" "brsvg" "external" "" "text" ref brsrcfile "data/LTD404701_20140509_101154_10.csv" brfile "birthrate-pipe/Components/birthrate/birthrate.csv" brsvg "birthrate-pipe/Components/brplot-R/birthrate-R.svg"
Andrew Balemi wanted to add an annotation to the Wiki New Zealand plot of NZ birth rate to show the end of World War II (the onset of the baby boomers)
births <- read.csv(brfile, col.names=c("year", "births")) svg("birthrate-R.svg") plot(births, type="l") abline(v=1945 + as.numeric(as.Date("1945-09-02") - as.Date("1945-01-01"))/365) dev.off()
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="R"/> <description><![CDATA[This module reads a CSV file and pro ... <input name="brfile" type="external"/> <output name="brsvg" type="external" ref="birthrate-R.svg"/> <source ref="src/birthrate-plot-custom.R"><![CDATA[]]></source> </module>
<pipeline xmlns="http://www.openapi.org/2014/" version="0.1"> <component name="brsource"/> <component name="birthrate"/> <component name="brplot-R"/> <component name="brplot-R-custom"/> <pipe> <start component="brsource" name="brsrcfile"/> <end component="birthrate" name="brsrcfile"/> </pipe> <pipe> <start component="birthrate" name="brfile"/> <end component="brplot-R" name="brfile"/> </pipe> <pipe> <start component="birthrate" name="brfile"/> <end component="brplot-R-custom" name="brfile"/> </pipe> </pipeline>
Scripts can be in any programming language
import numpy as np import matplotlib.pyplot as plt import matplotlib.dates as mdates year, births = np.loadtxt(brfile, unpack=True, delimiter=",") plt.plot_date(x=year, y=births, fmt="r-") plt.grid(True) plt.savefig("birthrate-py.svg")
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="python"/> <description><![CDATA[This module reads a CSV file and pro ... <input name="brfile" type="external"/> <output name="brsvg" type="external" ref="birthrate-py.svg"/> <source ref="src/birthrate-plot.py"><![CDATA[]]></source> </module>
<pipeline xmlns="http://www.openapi.org/2014/" version="0.1"> <component name="brsource"/> <component name="birthrate"/> <component name="brplot-R"/> <component name="brplot-py"/> <pipe> <start component="brsource" name="brsrcfile"/> <end component="birthrate" name="brsrcfile"/> </pipe> <pipe> <start component="birthrate" name="brfile"/> <end component="brplot-R" name="brfile"/> </pipe> <pipe> <start component="birthrate" name="brfile"/> <end component="brplot-py" name="brfile"/> </pipe> </pipeline>
What if all of the New Zealand Youth (18-24) who did NOT vote in 2011 all voted for the Internet Party in 2014?
What do I need?
Access to data
Domain knowledge
Data Science skills
Statistical Graphics skills
Graphical Design skills
The accessibility of the data is problematic
The accessibility of the data is problematic
What if there was a nice web site that already made pictures of these data?
What if there was a nice web site that already made pictures of these data?
What if those pictures were part of a modular and reusable framework?
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="fileSystem"/> <output name="nvfile" type="external" ref="data/non-voters.csv"/> </module>
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="fileSystem"/> <output name="popfile" type="external" ref="data/TABLECODE7511_Data_821b2c90-79e3-4462-9994-4ae796f6e654.csv"/> </module>
<?xml version="1.0"?> <module xmlns="http://www.openapi.org/2014/" version="0.1"> <platform name="R"/> <input name="nvfile" type="external"/> <input name="popfile" type="external"/> <output name="nonvoters" type="internal"/> <output name="pop2013" type="internal"/> <output name="pop2013grouped" type="internal"/> <source ref="src/tidy.R"><![CDATA[]]></source> </module>
What if you could search for an existing script?
Or request a script?
Or request a module wrapper for an existing script?
Or write your own wrapper on an existing script?
The purple wedge shows what proportion of the overall vote the Internet Party would get.
Where did I get 5.51 from?
My statement is informed
We can share and remix the data and the code
Our discussion can be informed
We want to connect people with data
more informed individuals
more informed discussion
We propose a framework that enables small contributions
more ways to contribute
more contributors
more contributions
We have made a start
modules
pipelines
glue systems
Two (partly compatible) experimental frameworks are under development (both on GitHub): 'oaglue' and 'conduit'
These are R packages that have functions to read, write, and run modules and pipelines.
These slides and the resources used to create them, including example data, scripts, modules, and pipelines, are available at: https://www.stat.auckland.ac.nz/~paul/Talks/OpenAPI2014/.
Bradley Drayton helped with some early exploratory work supported by a Faculty of Science Research Development Grant.
Ashley Noel Hinton was supported by a Faculty of Science Research Development Grant.
The birth rate data, 2011 election data, and population estimates are all from Statistics NZ.
The 2011 election results pie chart came from the Rock Enrol web site.
As simple as possible
no looping contructs or conditionals
As few concepts as possible
modules, pipelines, glue system
As independent as possible
script author, module author, pipeline author can all be different people
As open as possible
open source, open data, open access
As portable as possible
cross platform, language agnostic
Multiple glue systems
Repositories of modules and pipelines
Request sites for modules and pipelines
Search, ratings, and recommendations for modules and pipelines
Tools for creating modules and pipelines
(including GUIs)
Automated pipeline creation
(from simple script markup)
Debugging modules and pipelines
Documentation of modules and pipelines
Reliability of modules and pipelines
Availability of required packages/libraries
Speed of execution, strong typing, security
Bundling modules and pipelines (and resources)
Relative/absolute and remote/local resources
Creating general-purpose modules
Capturing "non-functional" scripts as modules
Modules for "annotating" data
Why should people contribute?