Fri 9 June 15:00--16:00 Thu 15 June 15:00--16:00 Fri 16 June 15:00-- online only Mon 19 June 15:00--16:00 Wed 21 June 16:00-- online only Fri 23 June 15:00-- online onlyI'll login to Zoom at the beginning of the hour and logoff soon after that if nobody is around.
Please show your working in general when doing assignments. Interpret/comment on any plots that you produce. Handwritten assignments are acceptable; if so you need to upload scans or photos---but upload all .R files too. Any computer code that you use should be displayed so that the marker can see it and its output. I highly recommend knitr .Rnw files or Rmarkdown .Rmd files for reproducibility of research... they are fantastic software tools. Please knit them into .pdf if possible (LaTeX might need to be installed on your machine), however HTML or .doc should be okay too.
You must not copy each other; the penalty is high. And please don't submit ChatGPT output.
feuro.df <- transform(feuro.df, bmi = weight / height^2) # Create 'bmi'or
feuro.df$bmi <- with(feuro.df, weight / height^2)or
feuro.df$bmi <- feuro.df$weight / feuro.df$height^2are okay, but
bmi <- feuro.df$weight / feuro.df$height^2is a standalone vector in the workspace, which is bad style. I like transform() best. And to access a variable in a data frame without a dollar sign, use with(), e.g.,
with(feuro.df, bmi) myss <- with(feuro.df, smooth.spline(age, bmi))To obtain a subset of a data frame, use subset(), e.g.,
subgp1.df <- subset(feuro.df, age < 30 & chol > 1.5)To sort by a certain variable in a data frame, use order(), e.g.,
ooo <- with(feuro.df, order(age)) feuro.df <- feuro.df[ooo, ]Sorting is a good idea if lines() is used.
plot(y ~ x, data = d.df, main = "main", xlab = "xlab", ylab = "ylab")is the most basic of plots. This is bad:
plot(df.df$x, d.df$y)The font should be readable, e.g., not too small. And the aspect ratio of plots should be okay. If colour is used then make sure it can be read okay when printed in black and white. Use different plotting symbols, use different line types, add a legend, etc. I find that 90% of the time, plot() followed by points() and lines() is sufficient for most analyses. For example,
plot(runif(99), runif(99), main = "mymain", las = 1, sub = "mysub",
     xlab = "myxlab", ylab = "myylab", col = "blue")
points(runif(9), runif(9), col = "red", pch = "+", cex = 2)
lines(sort(runif(9)), runif(9), col = "limegreen", lty = "dashed")
Occasionally
matpoints() and
matlines() are useful for adding multiple lines and points
in colour.
glm(y ~ x2 + x3, poisson, data = d.df)It should not be any of
glm(d.df$y ~ x2 + x3, poisson, data = d.df) glm(d.df$y ~ d.df$x2 + d.df$x3, poisson)(Why?)
There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.
The four qualities of a great data scientist are creativity, tenacity, curiosity, and deep technical skills. They use skills in data gathering and data munging, visualization, machine learning, and computer programming to make data driven decisions and data driven products. They prefer to let the data do the talking.
deparse1 <- function(expr, collapse = " ", width.cutoff = 500L, ...) paste(deparse(expr, width.cutoff, ...), collapse = collapse) deparse1 <- deparse
Exams and tests from previous years are available from Canvas. Please note that the course has evolved over time, therefore there are several topics examined in previous years that are not applicable this year. Also, during the years of Covid-19, tests and exams were done online so there were no recall questions. Furthermore, note that my emphasis in most questions is either problem solving or understanding the concepts and material. Rote memorization simply won't work.
27 Feb Quiz 1 released (2%) 27 Feb Study guide 27 Feb Chapter 1: Introduction to Data Mining 27 Feb Tutorial 1