STATS 784 Statistical Data Mining

Welcome to STATS 784. This webpage is supplementary to Canvas. Rather than making lots of announcements I tend to put material potentially useful here.

Notices

Some office hours during the exam period:

Fri  9 June 15:00--16:00
Thu 15 June 15:00--16:00
Fri 16 June 15:00--      online only
Mon 19 June 15:00--16:00
Wed 21 June 16:00--      online only
Fri 23 June 15:00--      online only

I'll login to Zoom at the beginning of the hour and logoff soon after that if nobody is around.

Thanks to those who filled in the SET Course and Teaching Evaluations. Your feedback will be invaluable the next time the course is taught in the future.
Typo on p.87 of ch.1: 0.0756 should be 0.756. Sorry!

Assignments

Please show your working in general when doing assignments. Interpret/comment on any plots that you produce. Handwritten assignments are acceptable; if so you need to upload scans or photos---but upload all .R files too. Any computer code that you use should be displayed so that the marker can see it and its output. I highly recommend knitr .Rnw files or Rmarkdown .Rmd files for reproducibility of research... they are fantastic software tools. Please knit them into .pdf if possible (LaTeX might need to be installed on your machine), however HTML or .doc should be okay too.

You must not copy each other; the penalty is high. And please don't submit ChatGPT output.

General Info

Often, presentation marks will be given. Marks will be deducted for messy and hideously complicated code. Have you ever noticed that experts can do things really compactly and elegantly? In contrast, amateurs write much more code to try do the same thing, and it's unreadable and unmaintainable.
In your assignments I will sometimes change a question slightly depending on your UPI. Please don't confuse your UPI with your ID number. Your UPI looks like jcha054 so the first four characters are not digits but the last three characters are.
The S language implemented in R is elegant, and most things can be done nice and neatly. Messy and long-winded code just shows everybody that you are an amateur, a novice. Here is a list of common mistakes... they represent bad style. In contrast, some good style is presented. Please avoid these mistakes in your assignments otherwise marks will be deducted.
- Created variables based on variables in a data frame should be done within that data frame too, e.g.,
```
  feuro.df <- transform(feuro.df, bmi = weight / height^2)  # Create 'bmi'
```
  or
```
  feuro.df$bmi <- with(feuro.df, weight / height^2)
```
  or
```
  feuro.df$bmi <- feuro.df$weight / feuro.df$height^2
```
  are okay, but
```
bmi <- feuro.df$weight / feuro.df$height^2
```
  is a standalone vector in the workspace, which is bad style. I like transform() best. And to access a variable in a data frame without a dollar sign, use with(), e.g.,
```
with(feuro.df, bmi)
myss <- with(feuro.df, smooth.spline(age, bmi))
```
  To obtain a subset of a data frame, use subset(), e.g.,
```
subgp1.df <- subset(feuro.df, age < 30 & chol > 1.5)
```
  To sort by a certain variable in a data frame, use order(), e.g.,
```
ooo <- with(feuro.df, order(age))
feuro.df <- feuro.df[ooo, ]
```
  Sorting is a good idea if lines() is used.
- Plots should have a suitable title, xlab and ylab. If the variables are in a data frame then
```
  plot(y ~ x, data = d.df, main = "main", xlab = "xlab", ylab = "ylab")
```
  is the most basic of plots. This is bad:
```
plot(df.df$x, d.df$y)
```
  The font should be readable, e.g., not too small. And the aspect ratio of plots should be okay. If colour is used then make sure it can be read okay when printed in black and white. Use different plotting symbols, use different line types, add a legend, etc. I find that 90% of the time, plot() followed by points() and lines() is sufficient for most analyses. For example,
```
plot(runif(99), runif(99), main = "mymain", las = 1, sub = "mysub",
     xlab = "myxlab", ylab = "myylab", col = "blue")
points(runif(9), runif(9), col = "red", pch = "+", cex = 2)
lines(sort(runif(9)), runif(9), col = "limegreen", lty = "dashed")
```
  Occasionally matpoints() and matlines() are useful for adding multiple lines and points in colour.
- Extracting quantities from fitted objects using generic functions, e.g., fitted(fit), predict(fit), resid(fit), coef(fit). Don't use fit$fitted as it's bad style.
- Similar to plots, S formulas should be like
```
glm(y ~ x2 + x3, poisson, data = d.df)
```
  It should not be any of
```
glm(d.df$y ~ x2 + x3, poisson, data = d.df)
glm(d.df$y ~ d.df$x2 + d.df$x3, poisson)
```
  (Why?)
- If the VGAM package is used, followed by the mgcv package, then remember to detach("package:VGAM") beforehand. That's because s() is different between packages.
- Don't call variable names any of the following: c, t, s, C, gamma, beta, D, I, F, T. The reason is that R has functions called by those names, and your masking them will break code elsewhere.
- If attach() is used then there should be a matching detach(). However, attach() is not recommended at all.

Miscellanea

Here is a list of links collected over the years. They are FYI only.

What editor do data scientists use? One used by hard core developers is emacs. It can be used for R, LaTeX, C, .Rnw, and many other file types. It is highly extensible, powerful, runs on different operating systems, and will probably serve your entire career even if you start learning it today. There is potentially so much to learn that you'll never learn it all. I use it, and even though I probably know only about 5% of it, that's more than enough. On another note, many hard core developers use operating systems such as Linux or Mac: they are often run command driven and are excellent for software development because all the tools are there or are easily installed.
A short article from BBC business section about data science.
Stats NZ custom data request, which is a free service for researchers: see this link.
An influential article entitled 50YearsofDataScience.pdf by Donoho.
An article entitled Taking R to its limits: 70+ tips. My rejoinder to his comments on the slower speed of VGAM is that the package focusses on generality rather than speed. It offers a lot more features than the two alternatives he mentions, hence is not as fast. If one of those two models is all the modeller is interested in, by all means use the alternative, but there are huge advantages using a larger framework, e.g., http://www.jstatsoft.org/v32/i10/, which is a "Highly Cited Paper" according to the Web of Science ("As of May/June 2017, this highly cited paper received enough citations to place it in the top 1% of its academic field based on a highly cited threshold for the field and publication year.") Some possibly useful R packages: Rfast and Rfast2.
Looking for a data science job? Here is some job application tips.
To learn SAS Enterprise Miner by yourself try the following file: cross-sell.swf. It refers to an old version so it is probably out-of-date. It is some sort of media file, and you might have to apply some plugin for your web browser to get it working. Firefox under Linux has no problems. It might be very useful but be warned: it is boring!
R is no. 5 in the IEEE list of top 10 programming languages for 2016.
The Global Terrorism Database is an interesting data set...
LAWA connects you with New Zealand's environment through sharing scientific data: http://www.lawa.org.nz. In other words, it helps you find good fishing spots!
Miscellaneous data sets available from https://communities.sas.com/docs/DOC-4361.
Big data sets available from http://aws.amazon.com/public-data-sets/.
27 free books on DM analysis is a superset of 9 free books on DM analysis.
http://www.wikinewzealand.org is new to me and probably to you too.
Some bedtime reading: Big data: are we making a big mistake? and The Sexiest Job of the 21st Century is Tedious, and that Needs to Change. Thanks to Matt Regan for pointing these out.
What am I doing here? 10 Tech Skills That Will Earn You Over $US100,000.
Here is the James, Witten, Hastie and Tibshirani (2013) book: An Introduction to Statistical Learning (9.9 MB). Some parts of it are suitable for background reading.
Here is an article High-tech hand behind tiller about the Americas Cup yacht race in 2013.
In Feb 2014, Nature started calling for submissions for a repository called Scientific Data that is a new open access, peer reviewed, online-only, publication for descriptions of scientifically valuable datasets. It introduces a new type of content called the Data Descriptor designed to make your data more discoverable, interpretable and reusable. It will launch in May 2014.
For an article entitled "Scientists losing data at a rapid rate: Decline can mean 80% of data are unavailable after 20 years" see this link. See also this link.
These late-2013 The Economist articles are short and make some good points: Link1, Link2. Evidently there is now a Society of Data Miners.
This The Economist article about making data public is good: A new goldmine. This is the Shakespeare Review.
These articles were sent to me from Louise and are worth reading. They are from the Intelligence Unit of The Economist and are: BigDataandConsumerProductsCompanies.pdf, Fosteringadata-drivenculture.pdf, Insearchofinsightandforesight.pdf, Roleofpeopleintechnology-drivenorganisations.pdf. The article Privacy Uncovered is about the sharing and storage of personal data online.
This article, ReadingInBigData.pdf, is an interesting read; it measures the time it takes to read in big spreadsheets into R. Thanks to the author, Glenn Thomas, for allowing its distribution.
This datablog link has interesting data. For example, here is their Olympics data with height and weight etc in particular at height and weight data.
Here is a presentation to the Statistics Dept at Auckland University in Nov 2012: 201211DataminePresentationtoStatsDeptAucklandUniversity.pptx.
Secret sifter article shows what the Americans are up to in terms of terrorist detection.
A news clip on Obama and big data.
An Auckland data mining company that might have job openings: http://www.datamine.com.
"To exploit the data flood, America will need many more like her. A report last year by the McKinsey Global Institute, the research arm of the consulting firm, projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, whether retrained or hired." Interesting? Read more here.

And from Big data: The next frontier for innovation, competition, and productivity:

There will be a shortage of talent necessary for organizations
to take advantage of big data. By 2018, the United States alone
could face a shortage of 140,000 to 190,000 people with deep
analytical skills as well as 1.5 million managers and analysts
with the know-how to use the analysis of big data to make effective
decisions.

And What is a data scientist? I like the quote from Jeremy Howard, President & Chief Scientist, Kaggle:

The four qualities of a great data scientist are creativity,
tenacity, curiosity, and deep technical skills. They use skills in
data gathering and data munging, visualization, machine learning,
and computer programming to make data driven decisions and data
driven products. They prefer to let the data do the talking.

Interested in ecology? The recently launched journal Dataset Papers in Ecology, a part of Datasets International, has short articles that describe a piece of experimental or observational data that an author has collected. The underlying data will be made freely available for readers to download.
Here is a some info re. Oracle R Enterprise and Hadoop.
Here is a simple article written for statistically illiterate people: muzhu.pdf.
Want to solve data mining problems and win a big prize? See kaggle.com.
Here is a plenary talk by Chris Bishop.
Here is an excellent resource for classifiers: http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages.
There is an interesting article in The Economist called "Data, Data, Everywhere": around 27 Feb 2010. It mentions on p.7 that "A free programming language called R lets companies examine and present big data sets...".
This 2012 New York Times article gives some thoughts about The Age of Big Data: click here.
This New York Times article gives some thoughts about data and computing: click here.
This New York Times article gives some excellent background on some of the history and current trends in statistical software (not just SAS and R) (a must for 301). There is a strong flavour to data mining about this article.
Revolution has a whitepaper called Big Data Analysis with Revolution R Enterprise which is interesting.
Another popular data mining piece of software is RapidMiner.
There is an interesting article from Feb 2011 at RSS significance magazine" by Hal Varian.
There is an interesting article from Feb 2011 called "Tips for statisticians starting a career in business" from Amstat News.
The Open Data Project was a mid-2011 initiative by the NZ government.
Here are Free Statistical Tools on the WEB.
This article comes from the business section of the Herald which was prompted by World Statistics Day (2010-10-20). Or maybe it was in the National Business Review. It features our one and only David Scott.
For graphs see this article (it comes from the RSS).
A ten year campaign to build a society in which our lives and choices are enriched byan understanding of statistics: http://www.getstats.org.uk.
Q: Why R is better than Excel for teaching statistics? A: click here.
An interesting article entitled "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete".
What is data science? According to Mike Driscoll (@dataspora), statistics is the "grammar of data science."
An article on R and SPSS.
Revolution.
Here are some links referred to in the notes:
- http://www.assda.edu.au/,
- http://www.nzssds.org.nz/,
- http://www.iassistdata.org/,
- http://www.cessda.org,
- http://www.iq.harvard.edu,
- Harvard MIT Data Center,
- Dataverse Network,
- http://www.data.gov.uk,
- http://www.data.gov,
- http://www.matthewckeller.com/html/memory.html,
- http://www.stat.auckland.ac.nz/~ihaka/courses/120,
- http://www.stat.auckland.ac.nz/~ihaka/courses/787,
- http://www.stats.ox.ac.uk/~ripley,
Here are some disasters caused by numerical errors.
Here is some links to mainly USA data.
Here is Google Public Data Explorer.
Here is an article entitled "Government opens data to public" regarding data release in UK: http://news.bbc.co.uk/2/hi/technology/8311627.stm.
Here are some chapters from Hastie, Tibshirani and Friedman (2009). Parts of these chapters are useful as background reading. (Note that the book is available electonically at UoA Voyager. In fact, through http://www.springerlink.com/content/ng8j76 or http://dx.doi.org/10.1007/b94608.)
- ch04LinearMethodsForClassification.pdf
- ch09AdditiveModelsTrees.pdf
- ch11NeuralNetworks.pdf
- ch13PrototypeMethods.pdf
- ch10BoostingAdditiveTrees.pdf. This is of less importance, it gives some information about boosting etc.
FYI here is a two day course on data mining, inference and prediction: this link.
Here is The Korean Datamining Society.
Here is a 2009 article from the New York Times entitled For Today's Graduate, Just One Word: Statistics.
FYI, what has Oprah got to do with SAS? (It's like asking what a fish has got to do with a bicycle). Whatever your answer, here is Oprah. Warning: it is 13 Mb.
FYI, here is a talk from a Professor Dennis Lin on data mining.
This 2009 NY Times article refers briefly to R. It also has some interesting things to say about the usefulness of data analysis and the demand for statisticians. Note the last sentence!
There are many resources for learning R. The following "R: A self-learn tutorial" should be useful if your knowledge is not very good. The last page is dated.
FYI, here is a talk given by Trevor Hastie about 3 years ago entitled "Modern Trends in Data Mining". Apart from Support Vector Machines and the LASSO, most of the talk reinforces ideas given in 784 and provides good background reading.
In Feb 2008 a new journal called Statistical Analysis and Data Mining was published. Here is an article entitled "Data Mining Research: Current Status and Future Opportunities" from Issue 2; it is of mediocre quality.
FYI, here is an article "Introduction to Data Mining and Knowledge Discovery, Third Edition" by Two Crows Corporation. They say it has been used as a teaching tool at graduate schools of business including Stanford, M.I.T., and Harvard. It is at a very elementary level.
The Official R homepage is http://www.R-project.org and it is best for you to use http://cran.stat.auckland.ac.nz for downloads.

deparse1 <- function(expr, collapse = " ", width.cutoff = 500L, ...)
  paste(deparse(expr, width.cutoff, ...), collapse = collapse)


deparse1 <- deparse

Handouts and Files

Here is some info about rpart(): minitech.pdf and techrep.pdf. I will cover a bit of minitech.pdf in class - it's like a vignette and is recommended reading, especially as you will need to run rpart(). Also, here is car.R, outcar2023.txt, kyphosis.R, outkyphosis2023.txt; scanKyphosisImprovementRPART.pdf gives details behind some of the output.
Here is an additional R script, for Section 5.5.6: ROCR_roceg.R. And here are baggingeg.R and boostingeg.R.

Previous Exams and Terms Tests

Exams and tests from previous years are available from Canvas. Please note that the course has evolved over time, therefore there are several topics examined in previous years that are not applicable this year. Also, during the years of Covid-19, tests and exams were done online so there were no recall questions. Furthermore, note that my emphasis in most questions is either problem solving or understanding the concepts and material. Rote memorization simply won't work.

Data Sets

NZ sea temperatures around mid-2024: NZ sea temperatures data.
Here are some Australian homicide data that has been looked at before by a statistician.
Here are some medical data sets that look useful, especially for biostatisticians.
Twenty data sets you should know.
Google has a data set search: https://toolbox.google.com/datasetsearch.
Statistics NZ has data sets on a wide range of topics including health, finance and industry sectors: http://archive.stats.govt.nz/browse_for_stats.aspx.
Education Counts has numerous data sets covering many facets of education, including early childhood, Maori/Pacific education, special education, primary/secondary and tertiary and international education: https://www.educationcounts.govt.nz/statistics.
ProCivicStat group has numerous suggestions to inspire students to investigate social phenomena. Topics include malnutrition, gender pay gap, income levels, malnutrition, natural disasters, traffic accidents, demography, air pollution, racial bias to name a few: http://www.procivicstat.org/.
From end of Nov 2016 this policedata website has interesting crime data for NZ. Evidently, it seems based on SAS.
Some possible R packages containing big data sets that look interesting: UScensus2000blkgrp.
http://www.scotland.gov.uk and http://www.statisticsauthority.gov.uk have a pre-release access policy.
Free data for London sounds interesting. It may contain the locations of the bombs that dropped on London during WWII. If not, I recall that data being available elsewhere.
The Southern California Earthquake Data Center has interesting data.
Here is the Human Mortality Database (HMD).
Here are some more websites where potential data mining data sets are available.
Please let me know if any of these are of particular interest to you; we might then use it.

Handouts

Here is a history of handouts released or to be released on Canvas. They are all examinable in general.

27 Feb  Quiz 1 released (2%)
27 Feb  Study guide
27 Feb  Chapter 1: Introduction to Data Mining
27 Feb  Tutorial 1

Last modified: Thu 11 Jul 2024 12:07:30 NZST

Return to Thomas Yee's home page

Return to Departmental Homepage