Preface

0.1 Purpose and Prerequisites

There are one thousand and one introductory courses on data science using the statistical software R. This is another one of those. Aimed at teaching the course “WiMa-Praktikum” at Ulm University in the summer semester of 2021, these lecture notes are my own take at teaching a selection of topics in R and data science I picked up throughout my time using R and reading a couple of those one thousand and one introductory courses. The corresponding lecture videos can be found on YouTube. Notable resources which heavily influenced this text include

  • R for Data Science - Visualize, Model, Transform, Tidy and Import Data by Hadley Wickham & Garret Grolemund (freely available online)
  • Introduction to Data Science - Data Analysis and Prediction Algorithms with R by Rafael A. Irizarry (freely available online)
  • Statistical Inference via Data Science - A ModernDive into R and the Tidyverse by Chester Ismay & Albert Y. Kim (freely available online)
  • Data Science in a Box (freely available online)
  • Tidy Modeling with R by Max Kuhn & Julia Silge (freely available online)
  • An Introduction to Statistical Learning with Applications in R by James et al (freely available online)
  • RStudio blog
  • R weekly
The goal of these lecture notes is to cover the basics within what I refer to as the data science workflow1:
Data science workflow as sketched by Hadley Wickham & Garret Grolemund. Source: [R4DS](https://r4ds.had.co.nz/introduction.html) licensed under [CC BY-NC-ND 3.0 US](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)

Figure 0.1: Data science workflow as sketched by Hadley Wickham & Garret Grolemund. Source: R4DS licensed under CC BY-NC-ND 3.0 US

That means that we need to at least partially cover what each step in Figure 0.1 means and how this can be implemented in R. This is actually quite a lot of material for one semester and the amount of programming involved in the exercises (depending on prior programming skills) can result in a time-consuming workload for the student.

Consequently, a major prerequisite for this course is a high resilience in coping with failure (as is inevitable when learning a new programming language) and the willingness to independently learn new material (e.g. for dealing with programming errors). Other than that, students should be familiar with elementary probability theory.

0.2 Schedule

The current preliminary schedule is as follows

Week Chapters Description
1 Preliminaries We get started with some technical aspects of R such as installation, vectors, scripts, notebooks and general handling of R and RStudio. Also, we will rely on the version control system git/GitHub throughout this course, so we need to cover the technicalities of that too.
2 Data Exploration We want to explore a couple of data sets visually and generate some insights into data from pictures alone.
3 Wrangling Data We take data sets and compute our own variables of interest from the present variables. In between, we visualise our new variables to get a better understanding. Also, we discover aspects of our data through the lens of popular statistical quantities. Broadly speaking, we will learn how to wrangle data.
4 Random Number Generators In this week, we want to talk about the simulation of random numbers. We approach the topic from a theoretical point of view and implement simulation approaches in R. To do so, we extend our R skillset with if conditions and for loops.
5 The Monte Carlo Method Now that we’ve got random number generation all figured out, using the Law of Large Numbers we can take it one step further. We generate a large quantity of random numbers and use them to solve some problems that are hard to tackle analytically. This is known as the Monte Carlo Method.
6 Hypothesis Testing We use the ‘infer’ package to make rigorous conclusions about our data using confidence intervals and hypothesis tests. Widespread quantities such as ‘pvalues’ and others will be discussed.
7 Linear Models Linear models are the mother of all statistical models. Well, maybe that is a bit of an exaggeration but still, the ideas behind linear models are fundamental building block of a lot of other models. Thus, we will look at these models more closely and apply them to some data sets.
8 Linear Models on Multiple Data Sets We continue to talk about linear models and learn how to apply them to multiple data sets. More broadly, this chapter aims to extend our R repertoire by powerful concepts such as lists, map functions and nested tibbles in R.
9 Classification We learn to apply linear regression to categorial outcomes and extend our hodgepodge1 of models by introducing logistic regression as a classical method to solve classification problems.
10 How to Build a Model We tackle the infamous titanic data set2 and work our way through general ideas of statistical modelling and common obstacles.
11 Random Forests Ok, we got a glimpse at what awaits us when we try to come up with a statistical model using a data set. Let’s do it again. This time, we switch from using a logistic regression to using random forests.
12 Communication I TBA, probably a bit about reports and Shiny dashboards
13 Communication II TBA, probably a bit about reports and Shiny dashboards
14 Overflow / student project Placeholder in case some topics take more time than expected or time for student project at the end of the semester
1 While trying to translate the German word ‘Sammelsurium’ I stumbled upon gems like ‘hodgepodge’, ‘mingle-mangle’ and ‘mishmash’ which sound rather delighting to me. So much so that it took me an embarrassingly long time to figure out how to add a footnote to share my discovery. Well, I learned a couple of new words and R commands.
2 See https://www.kaggle.com/c/titanic

0.3 Session info

These lecture notes were build with:

#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2021-06-21                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version     date       lib source                            
#>  assertthat     0.2.1       2019-03-21 [1] CRAN (R 4.0.3)                    
#>  backports      1.2.1       2020-12-09 [1] CRAN (R 4.0.3)                    
#>  bookdown       0.21.4      2021-01-12 [1] Github (rstudio/bookdown@cc0f6d4) 
#>  broom          0.7.6       2021-04-05 [1] CRAN (R 4.0.5)                    
#>  bslib          0.2.3.9000  2021-01-12 [1] Github (rstudio/bslib@80a5059)    
#>  cellranger     1.1.0       2016-07-27 [1] CRAN (R 4.0.3)                    
#>  cli            2.5.0       2021-04-26 [1] CRAN (R 4.0.3)                    
#>  colorspace     2.0-1       2021-05-04 [1] CRAN (R 4.0.5)                    
#>  crayon         1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                    
#>  data.table     1.13.6      2020-12-30 [1] CRAN (R 4.0.3)                    
#>  DBI            1.1.0       2019-12-15 [1] CRAN (R 4.0.3)                    
#>  dbplyr         2.1.1       2021-04-06 [1] CRAN (R 4.0.5)                    
#>  digest         0.6.27      2020-10-24 [1] CRAN (R 4.0.3)                    
#>  downlit        0.2.1       2020-11-04 [1] CRAN (R 4.0.3)                    
#>  dplyr        * 1.0.6       2021-05-05 [1] CRAN (R 4.0.5)                    
#>  ellipsis       0.3.2       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  evaluate       0.14        2019-05-28 [1] CRAN (R 4.0.3)                    
#>  fansi          0.5.0       2021-05-25 [1] CRAN (R 4.0.5)                    
#>  farver         2.1.0       2021-02-28 [1] CRAN (R 4.0.4)                    
#>  fastmap        1.1.0       2021-01-25 [1] CRAN (R 4.0.5)                    
#>  forcats      * 0.5.1       2021-01-27 [1] CRAN (R 4.0.5)                    
#>  fs             1.5.0       2020-07-31 [1] CRAN (R 4.0.3)                    
#>  gapminder    * 0.3.0       2017-10-31 [1] CRAN (R 4.0.5)                    
#>  generics       0.1.0       2020-10-31 [1] CRAN (R 4.0.3)                    
#>  gganimate    * 1.0.7       2020-10-15 [1] CRAN (R 4.0.5)                    
#>  ggplot2      * 3.3.3       2020-12-30 [1] CRAN (R 4.0.4)                    
#>  ggrepel      * 0.9.1       2021-01-15 [1] CRAN (R 4.0.3)                    
#>  gifski         0.8.6       2018-09-28 [1] CRAN (R 4.0.3)                    
#>  glue         * 1.4.2       2020-08-27 [1] CRAN (R 4.0.3)                    
#>  gtable         0.3.0       2019-03-25 [1] CRAN (R 4.0.3)                    
#>  haven          2.3.1       2020-06-01 [1] CRAN (R 4.0.3)                    
#>  highr          0.9         2021-04-16 [1] CRAN (R 4.0.5)                    
#>  hms            1.0.0       2021-01-13 [1] CRAN (R 4.0.5)                    
#>  htmltools      0.5.1.9003  2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#>  htmlwidgets    1.5.3       2020-12-10 [1] CRAN (R 4.0.3)                    
#>  httr           1.4.2       2020-07-20 [1] CRAN (R 4.0.5)                    
#>  jquerylib      0.1.3       2020-12-17 [1] CRAN (R 4.0.3)                    
#>  jsonlite       1.7.2       2020-12-09 [1] CRAN (R 4.0.3)                    
#>  kableExtra   * 1.3.1       2020-10-22 [1] CRAN (R 4.0.3)                    
#>  knitr          1.33        2021-04-24 [1] CRAN (R 4.0.5)                    
#>  lazyeval       0.2.2       2019-03-15 [1] CRAN (R 4.0.3)                    
#>  lifecycle      1.0.0       2021-02-15 [1] CRAN (R 4.0.4)                    
#>  lubridate    * 1.7.10      2021-02-26 [1] CRAN (R 4.0.4)                    
#>  magrittr       2.0.1       2020-11-17 [1] CRAN (R 4.0.3)                    
#>  modelr         0.1.8       2020-05-19 [1] CRAN (R 4.0.3)                    
#>  munsell        0.5.0       2018-06-12 [1] CRAN (R 4.0.3)                    
#>  mvtnorm      * 1.1-1       2020-06-09 [1] CRAN (R 4.0.3)                    
#>  nycflights13 * 1.0.2       2021-04-12 [1] CRAN (R 4.0.5)                    
#>  pillar         1.6.1       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  pkgconfig      2.0.3       2019-09-22 [1] CRAN (R 4.0.3)                    
#>  plotly       * 4.9.3       2021-01-10 [1] CRAN (R 4.0.3)                    
#>  prettyunits    1.1.1       2020-01-24 [1] CRAN (R 4.0.3)                    
#>  progress       1.2.2       2019-05-16 [1] CRAN (R 4.0.3)                    
#>  purrr        * 0.3.4       2020-04-17 [1] CRAN (R 4.0.3)                    
#>  R6             2.5.0       2020-10-28 [1] CRAN (R 4.0.3)                    
#>  rappdirs       0.3.1       2016-03-28 [1] CRAN (R 4.0.3)                    
#>  Rcpp           1.0.6       2021-01-15 [1] CRAN (R 4.0.4)                    
#>  readr        * 1.4.0       2020-10-05 [1] CRAN (R 4.0.5)                    
#>  readxl         1.3.1       2019-03-13 [1] CRAN (R 4.0.3)                    
#>  reprex         2.0.0       2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang          0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)      
#>  rmarkdown      2.8.1       2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#>  rstudioapi     0.13        2020-11-12 [1] CRAN (R 4.0.3)                    
#>  rvest          1.0.0       2021-03-09 [1] CRAN (R 4.0.5)                    
#>  sass           0.2.0.9005  2021-01-07 [1] Github (rstudio/sass@403002c)     
#>  scales         1.1.1       2020-05-11 [1] CRAN (R 4.0.3)                    
#>  sessioninfo    1.1.1       2018-11-05 [1] CRAN (R 4.0.3)                    
#>  stringi        1.5.3       2020-09-09 [1] CRAN (R 4.0.3)                    
#>  stringr      * 1.4.0       2019-02-10 [1] CRAN (R 4.0.3)                    
#>  tibble       * 3.1.2       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  tidyr        * 1.1.3       2021-03-03 [1] CRAN (R 4.0.5)                    
#>  tidyselect     1.1.1       2021-04-30 [1] CRAN (R 4.0.3)                    
#>  tidyverse    * 1.3.1       2021-04-15 [1] CRAN (R 4.0.5)                    
#>  tweenr         1.0.1       2018-12-14 [1] CRAN (R 4.0.3)                    
#>  utf8           1.2.1       2021-03-12 [1] CRAN (R 4.0.3)                    
#>  vctrs          0.3.8       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  viridisLite    0.4.0       2021-04-13 [1] CRAN (R 4.0.5)                    
#>  webshot        0.5.2       2019-11-22 [1] CRAN (R 4.0.3)                    
#>  withr          2.4.2       2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun           0.22        2021-03-11 [1] CRAN (R 4.0.5)                    
#>  xml2           1.3.2       2020-04-23 [1] CRAN (R 4.0.5)                    
#>  yaml           2.2.1       2020-02-01 [1] CRAN (R 4.0.3)                    
#> 
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library