14 Choose Your Own Data Science Adventure

Recently on the Tidyverse blog, Julia Silge, data scientist and software engineer at Rstudio, wrote an article about “choosi[ing] your own tidymodels adventure”. This piece is designed to help people discover the tidymodels framework, to direct them towards helpful resources and to demonstrate a couple of use cases.

In the same spirit, I want to take the opportunity in these lecture notes’ last chapter to direct your attention at a few R/data science-related topics we did not cover but which may nevertheless spike your interest. A lot of these topics are chosen because I’d wish I would know more about them myself. Thus, you may consider this chapter as some form of bucket list of R/data science-related adventures I want to go on.

Finally, as these lecture notes are coming to a closei, I realize that I actually enjoy writing about R and data science topics which is why I decided to make this a hobby of mine. Thus, if you wish to follow me while I explore the R/data science universe, you may check out the blog I am currently creating.

14.1 Math/Statistics

By its nature, data science uses a good amount of tools from mathematics and more specifically from the field of stochastics (which includes statistics). Consequently, if one wants to apply a statistical technique as part of some data analysis, it is advisable to understand the theoretical background too. Thus, it helps to have some good resources to look up stuff.

Personally, I have had good experiences with the following classics:

The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie et al. which you can find here
An Introduction to Statistical Learning with Applications in R by James et al. which you can find here
Regression: Models, Methods and Applications by Fahrmeir et al.

Honestly, I cannot say that I have read any of these books from cover to cover but I found these books to be helpful resources anyway. Further, I do not think that it is advisable to aim at reading the books in full. Simply pick out a topic you are currently interested in and start reading the corresponding chapter.

14.2 Data Analysis/Tidymodels and Visualizations

The best way to learn more about data analysis and model building is of course to practice more data analysis and model building. As discussed before, kaggle may offer a vast amount of data sets to practice on. Simply find a data set or challenge that interests you and then start exploring.

If you want to learn more about tidymodels and some of its technical aspects, you may refer to Tidy Modeling with R by Max Kuhn and Julia Silge. Also, if you are simultaneously reading the intro book by James et al. from above, then you may be delighted to hear that ISLR tidymodels Labs by Emil Hvitfeldt demonstrates how to use tidymodels to solve the labs in the intro book.

Alternatively, if you are not interested in creating statistical models right now but still enjoy exploring data sets, you may practice your data wrangling and visualization skills instead. In fact, this is the purpose of the weekly Tidy Tuesday event which is aimed at just that. Finally, if you need inspiration on what you can even “do” with a data set, I recommend checking out David Robinson’s weekly tidy tuesday screencasts.

Finally, if you want to learn more about data visualizations, a good place to start learning more about good visualization practice is given in the Tidy Tuesday repository. There, you will find links to free books such as Fundamentals of Data Viz by Claus Wilke or Kieran Healy’s Data Visualization. Also, if you want to explore how to create animations in R, it might be advisable to check out the packages animation and gganimate.

14.3 Rtistry

If you enjoy working on visualizations and feel particularly artsy, then you might also enjoy creating art with R. I am not kidding. There is a whole sub-community within the R community dedicated to generative R.

On twitter under the hashtag #rtistry, you can get a glimpse of what beautiful picture or animations people create using R. For instance, there I found this nifty little animation which was created using the code that is written in the tweet⁹².

N=120
t=(0:4e4)/1e4
x=1i^t
l=c(-2,2)
gifski::save_gif(lapply(1:N,\(b){
a=sin(9pi(t+4/3b/N)/3)^40
z=x+x4/2-exp(-3ipi(t+2/3b/N))/2+a/5*x^1e3
par(bg=“black”,mar=rep(1,4))
plot(z,cex=0.1,col=hsv(.6+sin(t)/4,a,1),xlim=l,ylim=l)
}),w=600,h=600,d=1/40)#rtistry #rstats #mathart pic.twitter.com/s7AsBQLalq
— George Savva ((???)) June 27, 2021

If you want to see more generative art, check out the flametree package or art by Danielle Navarro.

14.4 Write Your Own Package

When you work on a (large) project in R, chances are good that at some point you will have coded a lot of functions tailored to your specific purposes. For convenience, it could be helpful to arrange them in your very own R package. Even if you do not plan to share any of these functions on CRAN or anywhere else, a package may offer the right kind of infrastructure for your project’s purposes. An introduction to how to create R packages can be found in R Packages by Hadley Wickham and Jenny Bryan.

14.5 Text Mining

In these lecture notes we have rarely worked with data that comes in the form of texts. Usually, we worked with tabular data which contain only very little text. Nevertheless, R offers tools to work with them such as stringr, stringi and tidytext.

For an introduction to stringr you may want to check out the corresponding chapter of R for Data Science by Hadley Wickham & Garrett Grolemund. Further, for an introduction to text mining with R, consider giving Text Mining with R by Julia Silge & David Robinson a try.

14.6 Web Scraping

The internet is full of useful (and not so useful) information. In fact, there is so much information that one might be inclined to try to extract that information automatically. For instance, imagine that you would like to download the Apple Inc. stock price each morning at 9 am from yahoo! finance to track the price over time or download every comic from xkcd.com.

With the help of the package RSelenium or the tidyverse’s rvest package you could automate this process by writing an R script that actually goes online and extracts that price/picture for you.

14.7 More Shiny

In these lecture notes we have barely scratched the surface of what the shiny package can do. If you are interested in creating applications which are interactive and accessible from the internet, shiny can be a great starting point from within R.

As you evolve into more advanced things you want to accomplish with your app, you will likely need to learn more about other programming languages (see below) to extend what shiny can do. But until then, you may wish to master Shiny first.

14.8 Take a Look under R’s Hood

As you begin to increasingly work more and more with R, you will find yourself wondering about how R does things. Also, chances are good that you will want to improve your R programming skills to simply do things faster and more effectively.

Then, you may want to take a look at Advanced R by Hadley Wickham. There, you will find a huge amount of information on all things relating to R. What’s more, this book will allow you to achieve incredible things in R.

For instance, have you ever wondered how the select() function from the dplyr package is able to select the correct column? If you’re thinking “No, but what’s so special about this?”, then you may want to notice that it is actually not that simple to define your own select() function even with the help of the dplyr function.

This is because defining an appropriate function to select a column from, say, the iris data set cannot be done like this:

my_select <- function(x) {dplyr::select(iris, x)}

Now, if you want to use the function the same way you would use dplyr::select(), i.e. simply passing, say, Sepal.Width (notice no "") to your new function, it would look like this

my_select(Sepal.Width)
# Error: object 'Sepal.Width' not found

This error appears because R will try to evaluate the argument as a variable from your current environment. But of course this variable is not present in your environment and only present within the iris data set. Therefore, what dplyr::select() accomplishes is that it lets R know to evaluate the input argument only later on, i.e. when the variable is “available” from the data set.

Finally, with help from the rlang package, our own function can be modified using {{ }} (pronounced curly-curly), so that it works the way we intend it to.

my_select <- function(x) {dplyr::select(iris, {{x}})}
head(my_select(Sepal.Width), 10)
#>    Sepal.Width
#> 1          3.5
#> 2          3.0
#> 3          3.2
#> 4          3.1
#> 5          3.6
#> 6          3.9
#> 7          3.4
#> 8          3.4
#> 9          2.9
#> 10         3.1

As another example, consider the scenario when you have a variable TMax and a data set that contains a column TMax. Now, what you may want to do is filter the data set such that it delivers only results where the TMax column equals the TMax variable. However, notice that the straightforward approach does not work as planed:

TMax <- 1000
tib <- tibble(
  TMax = c(500, 1000, 2000)
)

tib %>%
  filter(TMax == TMax)
#> # A tibble: 3 x 1
#>    TMax
#>   <dbl>
#> 1   500
#> 2  1000
#> 3  2000

Here, the second TMax was understood as refering to the column TMax instead of the variable TMax with value 1000 which is why the filtering returned the complete tibble. Instead, if we only want to get the row with value 1000 in it, we will need to let filter() know that it is supposed to look for that second variable elsewhere and not in tib.

This is achieved via !! (pronounced bang-bang) and works as follows:

tib %>%
  filter(TMax == !!TMax)
#> # A tibble: 1 x 1
#>    TMax
#>   <dbl>
#> 1  1000

Needless to say, functionalities like these can go a long way to help you define functions you may need in your data analysis. So, if you want to start learning how this and more works, take a look at Advanced R.

14.9 Other Programming Languages

Technically, learning a different programming language is not an R adventure but still it will probably help you to become a better data scientist. Further, it is best to get the idea of using only one language out of your system early on as a lot of things are achieved through a combined effort of multiple languages.

Of course, there are hundreds of languages out there and we cannot mention all of them here. So, instead consider this short list of programming languages I personally wish to learn more about.

14.9.1 Python

Python is probably the most obvious language to consider learning after having mastered R. In fact, there is an ongoing “debate” whether R or Python is THE most important i.e. best i.e. generally awesome programming language for data science purposes. Personally, I believe that these debates are similar to any dispute like

Git vs. SVN
Windows vs. Linux vs Mac. vs others (see here)
QWERTY vs DVORAK
Vim vs. Emacs (see editor war)
Cats vs dogs (I am not surprised that Wikipedia has an article on cat people and dog people)

Long story short, I believe it cannot hurt to be not only fluent in R but also know some Python or vice versa (depending on what is your “primary” language) as there are probably aspects where one language might be more convenient to use than the other. This is why the reticulate package is able to call Python commands from R and you have Python support in RStudio.

Good resources to start learning Python can be found at kaggle as they offer Python courses related to data science online for free. Additionally, I can recommend Automate the Boring Stuff with Python by Al Sweigart which is free to read online and is a more general introduction to Python.

14.9.2 HTML, CSS and JavaScript

The most common languages your web browser uses to interact with the internet are HTML, CSS and JavaScript. Therefore, if you want to do some internet-related stuff like setting up a blog or a personal website, knowledge about these languages are helpful. Similarly, if you want to scrape the web (i.e. extracting information from websites), you will run into these languages too.

Further, a standard file format many databases or APIs use is JSON which is derived from JavaScript. Finally, the JavaScript library D3.js is a great tool to visualize data. Check out beautiful examples here.

14.9.3 SQL

SQL is a popular programming language to interact with large databases. Since data is often stored in a database instead of in a single file, it might be necessary to learn how to communicate with said database via SQL.

13 Shiny Applications