A two-hour introduction to data analysis in R

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

TIL what happens if you use %>% instead of + in ggplot2

This post is about ggplot2 and dplyr packages, so let’s start with loading them:


I can’t be the first person to make the following mistake:

ggplot(mtcars, aes(x = wt, y = mpg)) %>%

Can you spot the mistake in the code above? Look closely at the end of the first line.

The operator should be the + used in ggplot2 for layering, not the %>% operator used in dplyr for piping, like this:

ggplot(mtcars, aes(x = wt, y = mpg)) +

So what happens if you accidentally use the pipe operator instead of the +? You get the following error:

Error in get(x, envir = this, inherits = inh)(this, ...) : 
 Mapping should be a list of unevaluated mappings created by aes or aes_string

My Google search for this error did not yield my careless mistake as a potential cause. Since many people use these two packages together, I’m guessing such mix-up of operators can’t be too uncommon (right? I can’t be the only one…). So I’m leaving this post here for the next person who makes the same mistake.


Very brief first day of class activity in R

New academic year has started for most of us. I try to do a range of activities on the first day of my introductory statistics course, and one of them is an incredibly brief activity to just show students what R is and what the RStudio window looks like. Here it is:

Generate a random number between 1 and 5, and introduce yourself to that many people sitting around you:

sample(1:5, size = 1)

It’s a good opportunity to have students access RStudio once, talk about random sampling, and break up the class session and have them introduce themselves to their fellow classmates. I usually do the activity too, and use it as an opportunity to personally introduce myself to a few students and to meet them.

If you’re interested in everything else I’m doing in my introductory statistics course you can find the course materials for this semester at http://bit.ly/sta101_f15 and find the source code for all publicly available materials like slides, labs, etc. at https://github.com/mine-cetinkaya-rundel/sta101_f15. Both of these will be updated throughout the semester. Feel free to grab whatever you find useful.

R packages for undergraduate stat ed

The other day on the isostat mailing list Doug Andrews asked the following question:

Which R packages do you consider the most helpful and essential for undergrad stat ed? I ask in great part because it would help my local IT guru set up the way our network makes software available in our computer classrooms, but also just from curiosity.

Doug asked for a top 10 list, and a few people have already chimed in with great suggestions. I thought those not on the list might also have good ideas, so, with Doug’s permission, I’m reposting the question here.

Here is my top 10 (ok, 12) list:
(Links go to vignettes or pages I find to be quickest / most useful references for those packages, but if you know of better resources, let me know and I’ll update.)

  1. knitr / rmarkdown – for reproducible data analysis with literate programming, great set of tools that students can use from day 1 in intro stats all the way through to writing their undergrad theses
  2. dplyr – for most data manipulation tasks, with the added benefit of piping (via magrittr)
  3. ggplot2 – easy faceting allows for graphing multivariate relationships more easily than with base R (lattice is also good for that, but IMO ggplot2 graphics look more modern and lattice has a much steeper learning curve)
  4. openintro – or packages that come with the textbooks you use, great for pulling up any dataset from the text and building on it in class (a new version coming soon to fully complement 3rd edition of OpenIntro Statistics)
  5. mosaic – for consistent syntax for functions used in intro stat
  6. googlesheets – for loading data directly from Google spreadsheets
  7. lubridate – if you ever need to work with any date fields
  8. stringr – for text parsing and manipulation
  9. rvest – for scraping data off the web
  10. readr / data.table – for loading large datasets & default stringsAsFactors = FALSE

And the following suggestions from Randall Prium complement this list nicely:

  • readxl – for reading Excel data
  • tidyr – for converting between wide and long formats and for the very useful extract_numeric()
  • ggvisggplot2 “done right” and tuned for interactive graphics
  • htmlwidgets – this is actually a collection of packages for plots: see leaflet for maps and dygraphs for time series, for example

Note that most of these packages are for data manipulation and visualization. Methods specific packages that are useful / essential for a particular undergraduate program might depend on the focus of that program. Some packages that so far came up in the discussion are:

  • lme4 – for mixed models
  • pwr – for showing sample size and power calculations

This blog post is meant to provide a space for continuing this discussion, so I’ll ask the question one more time: Which R packages do you consider the most helpful and essential for undergrad stat ed? Please add your responses to the comments.


PS: Thanks to Michael Lopez for suggesting that I post this list somewhere.
PPS: I should really be working on my fast-approaching JSM talk.

“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R

## Packages

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")

RMarkdown: mail_merge_handout.Rmd

output: pdf_document

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this


with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Data Analysis and Statistical Inference starts tomorrow on Coursera

It has been (and still is) lots of work putting this course together, but I’m incredibly excited about the opportunity to teach (and learn from) the masses! Course starts tomorrow (Feb 17, 2014) at noon EST.


A huge thanks also goes out to my student collaborators who helped develop, review, and revise much of the course materials (and who will be taking the role of Community TAs on the course discussion forums) and to Duke’s Center for Instructional Technology who pretty much runs the show.

This course is also part of the Reasoning, Data Analysis and Writing Specialization, along with Think Again: How to Reason and Argue and English Composition 1: Achieving Expertise. This interdisciplinary specialization is designed to strengthen students’ ability to engage with others’ ideas and communicate productively with them by analyzing their arguments, identifying the inferences they are drawing, and understanding the reasons that inform their beliefs. After taking all three courses, students complete an in-depth capstone project where they choose a controversial topic and write an article-length essay in which they use their analysis of the data to argue for their own position about that topic.

Let’s get this party started!

Conditional probabilities and kitties

I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.


And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):


The question asks: “My cat is indoor only. Is it still at risk?”

The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).

However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.

Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.

When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!

My first Shiny experience – CLT applet

When introducing the Central Limit Theorem for the first time in class, I used to use applets like the SOCR Sampling Distribution Applet or the OnlineStatBook Sampling Distribution Applet. If you are reading this post on Google Chrome, chances are those previous links did not work for you. If on another browser, they may have, but you may have also seen warnings like this one:


Last year when I tried using one of these applets in class and had students pull it up on their own computers as well, it was a chaos. Between warnings like this and no simple way for everyone in their various computers and operating systems to update Java, most students got frustrated. As a class we had to give up playing with the applet, and the students just watched me go through the demonstrations on the screen.

In an effort to make things a little easier this year, I searched to see if I could find something similar created using Shiny. This one, created by Tarik Gouhier, looked pretty promising. However it wasn’t exactly what I was looking for. For example, it’s pretty safe to assume that my students have never heard of the Cauchy distribution, and I didn’t want to present something that might confuse them further.

Thanks to the code being available on GitHub, I was able to re-write the applet to match the functionality of the previous CLT applets: http://rundel.dyndns.org:3838/CLT.


I’m sure I’ll make some edits to the applet after I class-test it today. Among planned improvements are:

  • an intermediary step between the top (population distribution) and the bottom (sampling distribution) plots: the sample distribution.
  • sliders for input parameters (like mean and standard deviation) for the population distribution.

None of this is revolutionary, but it’s great to be able to build on someone else’s work so quickly. Plus, since all of the code is in R, which the students are learning anyway, those who are particularly motivated can dive deeper and can see the connection between the demonstration and what they’re doing in lab.

If you use such demonstrations in your class and have suggestions for improvements, leave a comment below. If you’d like to customize the applet for your use, the code is linked on the applet page, and I’ll be transitioning it to GitHub as I work on creating a few more of such applets.

(I should also thank Colin Rundel who helped with the implementation and is temporarily hosting the applet on his server until I get my Shiny Server set up — I filled out the registration form last night but I’m not yet sure what the next step is supposed to be.)