14 Choose Your Own Data Science Adventure
Recently on the Tidyverse blog, Julia Silge, data scientist and software engineer at Rstudio, wrote an article about “choosi[ing] your own tidymodels adventure”. This piece is designed to help people discover the tidymodels framework, to direct them towards helpful resources and to demonstrate a couple of use cases.
In the same spirit, I want to take the opportunity in these lecture notes’ last chapter to direct your attention at a few R/data science-related topics we did not cover but which may nevertheless spike your interest. A lot of these topics are chosen because I’d wish I would know more about them myself. Thus, you may consider this chapter as some form of bucket list of R/data science-related adventures I want to go on.
Finally, as these lecture notes are coming to a closei, I realize that I actually enjoy writing about R and data science topics which is why I decided to make this a hobby of mine. Thus, if you wish to follow me while I explore the R/data science universe, you may check out the blog I am currently creating.
14.1 Math/Statistics
By its nature, data science uses a good amount of tools from mathematics and more specifically from the field of stochastics (which includes statistics). Consequently, if one wants to apply a statistical technique as part of some data analysis, it is advisable to understand the theoretical background too. Thus, it helps to have some good resources to look up stuff.
Personally, I have had good experiences with the following classics:
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie et al. which you can find here
- An Introduction to Statistical Learning with Applications in R by James et al. which you can find here
- Regression: Models, Methods and Applications by Fahrmeir et al.
Honestly, I cannot say that I have read any of these books from cover to cover but I found these books to be helpful resources anyway. Further, I do not think that it is advisable to aim at reading the books in full. Simply pick out a topic you are currently interested in and start reading the corresponding chapter.
14.2 Data Analysis/Tidymodels and Visualizations
The best way to learn more about data analysis and model building is of course to practice more data analysis and model building. As discussed before, kaggle may offer a vast amount of data sets to practice on. Simply find a data set or challenge that interests you and then start exploring.
If you want to learn more about tidymodels and some of its technical aspects, you may refer to Tidy Modeling with R by Max Kuhn and Julia Silge. Also, if you are simultaneously reading the intro book by James et al. from above, then you may be delighted to hear that ISLR tidymodels Labs by Emil Hvitfeldt demonstrates how to use tidymodels to solve the labs in the intro book.
Alternatively, if you are not interested in creating statistical models right now but still enjoy exploring data sets, you may practice your data wrangling and visualization skills instead. In fact, this is the purpose of the weekly Tidy Tuesday event which is aimed at just that. Finally, if you need inspiration on what you can even “do” with a data set, I recommend checking out David Robinson’s weekly tidy tuesday screencasts.
Finally, if you want to learn more about data visualizations, a good place to start learning more about good visualization practice is given in the Tidy Tuesday repository.
There, you will find links to free books such as Fundamentals of Data Viz by Claus Wilke or Kieran Healy’s Data Visualization.
Also, if you want to explore how to create animations in R, it might be advisable to check out the packages animation
and gganimate
.
14.3 Rtistry
If you enjoy working on visualizations and feel particularly artsy, then you might also enjoy creating art with R. I am not kidding. There is a whole sub-community within the R community dedicated to generative R.
On twitter under the hashtag #rtistry, you can get a glimpse of what beautiful picture or animations people create using R. For instance, there I found this nifty little animation which was created using the code that is written in the tweet92.
N=120
— George Savva ((???)) June 27, 2021
t=(0:4e4)/1e4
x=1i^t
l=c(-2,2)
gifski::save_gif(lapply(1:N,\(b){
a=sin(9pi(t+4/3b/N)/3)40
z=x+x4/2-exp(-3ipi(t+2/3b/N))/2+a/5*x^1e3
par(bg=“black”,mar=rep(1,4))
plot(z,cex=0.1,col=hsv(.6+sin(t)/4,a,1),xlim=l,ylim=l)
}),w=600,h=600,d=1/40)#rtistry #rstats #mathart pic.twitter.com/s7AsBQLalq
If you want to see more generative art, check out the flametree
package or art by Danielle Navarro.
14.4 Write Your Own Package
When you work on a (large) project in R, chances are good that at some point you will have coded a lot of functions tailored to your specific purposes. For convenience, it could be helpful to arrange them in your very own R package. Even if you do not plan to share any of these functions on CRAN or anywhere else, a package may offer the right kind of infrastructure for your project’s purposes. An introduction to how to create R packages can be found in R Packages by Hadley Wickham and Jenny Bryan.
14.5 Text Mining
In these lecture notes we have rarely worked with data that comes in the form of texts.
Usually, we worked with tabular data which contain only very little text.
Nevertheless, R offers tools to work with them such as stringr
, stringi
and tidytext
.
For an introduction to stringr
you may want to check out the corresponding chapter of R for Data Science by Hadley Wickham & Garrett Grolemund.
Further, for an introduction to text mining with R, consider giving Text Mining with R by Julia Silge & David Robinson a try.
14.6 Web Scraping
The internet is full of useful (and not so useful) information. In fact, there is so much information that one might be inclined to try to extract that information automatically. For instance, imagine that you would like to download the Apple Inc. stock price each morning at 9 am from yahoo! finance to track the price over time or download every comic from xkcd.com.
With the help of the package RSelenium
or the tidyverse’s rvest
package you could automate this process by writing an R script that actually goes online and extracts that price/picture for you.
14.7 More Shiny
In these lecture notes we have barely scratched the surface of what the shiny package can do. If you are interested in creating applications which are interactive and accessible from the internet, shiny can be a great starting point from within R.
As you evolve into more advanced things you want to accomplish with your app, you will likely need to learn more about other programming languages (see below) to extend what shiny can do. But until then, you may wish to master Shiny first.
14.8 Take a Look under R’s Hood
As you begin to increasingly work more and more with R, you will find yourself wondering about how R does things. Also, chances are good that you will want to improve your R programming skills to simply do things faster and more effectively.
Then, you may want to take a look at Advanced R by Hadley Wickham. There, you will find a huge amount of information on all things relating to R. What’s more, this book will allow you to achieve incredible things in R.
For instance, have you ever wondered how the select()
function from the dplyr
package is able to select the correct column?
If you’re thinking “No, but what’s so special about this?”, then you may want to notice that it is actually not that simple to define your own select()
function even with the help of the dplyr
function.
This is because defining an appropriate function to select a column from, say, the iris
data set cannot be done like this:
my_select <- function(x) {dplyr::select(iris, x)}
Now, if you want to use the function the same way you would use dplyr::select()
, i.e. simply passing, say, Sepal.Width
(notice no ""
) to your new function, it would look like this
my_select(Sepal.Width)
# Error: object 'Sepal.Width' not found
This error appears because R will try to evaluate the argument as a variable from your current environment.
But of course this variable is not present in your environment and only present within the iris
data set.
Therefore, what dplyr::select()
accomplishes is that it lets R know to evaluate the input argument only later on, i.e. when the variable is “available” from the data set.
Finally, with help from the rlang
package, our own function can be modified using {{ }}
(pronounced curly-curly), so that it works the way we intend it to.
my_select <- function(x) {dplyr::select(iris, {{x}})}
head(my_select(Sepal.Width), 10)
#> Sepal.Width
#> 1 3.5
#> 2 3.0
#> 3 3.2
#> 4 3.1
#> 5 3.6
#> 6 3.9
#> 7 3.4
#> 8 3.4
#> 9 2.9
#> 10 3.1
As another example, consider the scenario when you have a variable TMax
and a data set that contains a column TMax
.
Now, what you may want to do is filter the data set such that it delivers only results where the TMax
column equals the TMax
variable.
However, notice that the straightforward approach does not work as planed:
TMax <- 1000
tib <- tibble(
TMax = c(500, 1000, 2000)
)
tib %>%
filter(TMax == TMax)
#> # A tibble: 3 x 1
#> TMax
#> <dbl>
#> 1 500
#> 2 1000
#> 3 2000
Here, the second TMax
was understood as refering to the column TMax
instead of the variable TMax
with value 1000 which is why the filtering returned the complete tibble.
Instead, if we only want to get the row with value 1000 in it, we will need to let filter()
know that it is supposed to look for that second variable
elsewhere and not in tib
.
This is achieved via !!
(pronounced bang-bang) and works as follows:
tib %>%
filter(TMax == !!TMax)
#> # A tibble: 1 x 1
#> TMax
#> <dbl>
#> 1 1000
Needless to say, functionalities like these can go a long way to help you define functions you may need in your data analysis. So, if you want to start learning how this and more works, take a look at Advanced R.
14.9 Other Programming Languages
Technically, learning a different programming language is not an R adventure but still it will probably help you to become a better data scientist. Further, it is best to get the idea of using only one language out of your system early on as a lot of things are achieved through a combined effort of multiple languages.
Of course, there are hundreds of languages out there and we cannot mention all of them here. So, instead consider this short list of programming languages I personally wish to learn more about.
14.9.1 Python
Python is probably the most obvious language to consider learning after having mastered R. In fact, there is an ongoing “debate” whether R or Python is THE most important i.e. best i.e. generally awesome programming language for data science purposes. Personally, I believe that these debates are similar to any dispute like
- Git vs. SVN
- Windows vs. Linux vs Mac. vs others (see here)
- QWERTY vs DVORAK
- Vim vs. Emacs (see editor war)
- Cats vs dogs (I am not surprised that Wikipedia has an article on cat people and dog people)
Long story short, I believe it cannot hurt to be not only fluent in R but also know some Python or vice versa (depending on what is your “primary” language) as there are probably aspects where one language might be more convenient to use than the other.
This is why the reticulate
package is able to call Python commands from R and you have Python support in RStudio.
Good resources to start learning Python can be found at kaggle as they offer Python courses related to data science online for free. Additionally, I can recommend Automate the Boring Stuff with Python by Al Sweigart which is free to read online and is a more general introduction to Python.
14.9.2 HTML, CSS and JavaScript
The most common languages your web browser uses to interact with the internet are HTML, CSS and JavaScript. Therefore, if you want to do some internet-related stuff like setting up a blog or a personal website, knowledge about these languages are helpful. Similarly, if you want to scrape the web (i.e. extracting information from websites), you will run into these languages too.
Further, a standard file format many databases or APIs use is JSON which is derived from JavaScript. Finally, the JavaScript library D3.js is a great tool to visualize data. Check out beautiful examples here.