TIL what happens if you use %>% instead of + in ggplot2

This post is about ggplot2 and dplyr packages, so let’s start with loading them:


I can’t be the first person to make the following mistake:

ggplot(mtcars, aes(x = wt, y = mpg)) %>%

Can you spot the mistake in the code above? Look closely at the end of the first line.

The operator should be the + used in ggplot2 for layering, not the %>% operator used in dplyr for piping, like this:

ggplot(mtcars, aes(x = wt, y = mpg)) +

So what happens if you accidentally use the pipe operator instead of the +? You get the following error:

Error in get(x, envir = this, inherits = inh)(this, ...) : 
 Mapping should be a list of unevaluated mappings created by aes or aes_string

My Google search for this error did not yield my careless mistake as a potential cause. Since many people use these two packages together, I’m guessing such mix-up of operators can’t be too uncommon (right? I can’t be the only one…). So I’m leaving this post here for the next person who makes the same mistake.


Data News: Fitbit + iHealth, and Open Justice data

The LA Times reported today, along with several other sources, that the California Department of Justice has initiated a new “open justice” data initiative.  On their portal, the “Justice Dashboard“, you can view Arrest Rates, Deaths in Custody, or Law Enforcement Officers Killed or Assaulted.

I chose, for my first visit, to look at Deaths in Custody.  At first, I was disappointed with the quality of the data provided.  Instead of data, you see some nice graphical displays, mostly univariate but a few with two variables, addressing issues and questions that are probably on many people’s minds.  (Alarmingly, the second most common cause of death for people in custody is homicide by a law enforcement officer.)

However, if you scroll to the bottom, you’ll see that you can, in fact, download relatively raw data, in the form of a spreadsheet in which each row is a person in custody who died.  Variables include date of birth and death, gender, race, custody status, offense, reporting agency, and many other variables.  Altogether, there are 38 variables and over 15000 observations. The data set comes with a nice codebook, too.

FitBit vs. the iPhone

Onto a cheerier topic. This quarter I will be teaching regression, and once again my FitBit provided inspiration.  If you teach regression, you know one of the awful secrets of statistics: there are no linear associations. Well, they are few and far between.  And so I was pleased when a potentially linear association sprang to mind:  how well do FitBit step counts predict the Health app counts?

Health app is an ios8 app. It was automatically installed on your iPhone, whether you wanted it or not.  (I speak from the perspective of an iPhone 6 user, with ios8 installed.) Apparently, whether you know it or not, your steps are being counted.  If you have an Apple Watch, you know about this.  But if you don’t, it happens invisibly, until you open the app. Or buy the watch.

How can you access these data?  I did so by downloading the free app QS (for “Quantified Self”). The Quantified Self people have a quantified self website directing you to hundreds of apps you can use to learn more about yourself than you probably should.  Once installed, you simply open the app, choose which variables you wish to download, click ‘submit’, and a csv filed is emailed to you (or whomever you wish).

The FitBit data can only be downloaded if you have a premium account.  The FitBit premium website has a ‘custom option’ that allows you to download data for any time period you choose, but currently, due to an acknowledged bug, no matter which dates you select, only one month of data will be downloaded. Thus, you must download month by month.  I downloaded only two months, July and August, and at some point in August my FitBit went through the wash cycle, and then I misplaced it.  It’s around here, somewhere, I know. I just don’t know where.  For these reasons, the data are somewhat sparse.

I won’t bore you with details, but by applying functions from the lubridate package in R and using the gsub function to remove commas (because FitBit inexplicably inserts commas into its numbers and, I almost forgot, adds a superfluous title to the document which requires that you use the “skip =1” option in read.table), it was easy to merge a couple of months of FitBit with Health data.  And so here’s how they compare:


The regression line is Predicted.iOS.Steps = 1192 + 0.9553 (FitBit.Steps), r-squared is .9223.  (A residual plot shows that the relationship is not quite as linear as it looks. Damn.)

Questions I’m thinking of posing on the first day of my regression class this quarter:

  1. Which do you think is a more reliable counter of steps?
  2. How closely in agreement are these two step-counting tools? How would you measure this?
  3. What do the slope and intercept tell us?
  4. Why is there more variability for low fit-bit step counts than for high?
  5. I often lose my FitBit. Currently, for instance, I have no idea where it is.  On those days, FitBit reports “0 steps”. (I removed the 0’s from this analysis.)  Can I use this regression line to predict the values on days I lose my FitBit?  With how much precision?

I think it will be helpful to talk about these questions informally, on the first day, before they have learned more formal methods for tackling these.  And maybe I’ll add a few more months of data.

PDF and Citation Management

A new academic year looms. This means a new crop of graduate students will begin their academic training. PDF management is a critical tool that all graduate students need to use and the sooner the better. Often these tools go hand-in-hand with a citation management system, which is also critical for graduate students.

Using a citation management software makes scholarly work easier and more effective. First and foremost, these tools allow you to automatically cite references for a paper in a wide range of bibliographic styles. They also allow you to organize, evaluate, annotate, and search within your citation collection and share your references with others. Often they also sync across machines and devices allowing you to access your database wherever you are.

There are several tools available for PDF/citation management, including:

Some of these are citation managers only (BibDesk). Many allow you to also manage your PDF files as well; naming, organizing, and moving your files to a central repository on your computer. Some allow for annotation within the software as well. There are several online comparisons of some of the different systems ( e.g., Penn LibrariesUW Madison Library, etc.) From my experience, students tend to choose either Mendelay or Zotero—my guess is because they are free.

There is a lot to be said for free software, and both Zotero and Mendelay seem pretty solid. However, as a graduate student you should understand that you are investing in your future. This type of tool, I think it is fair to say, you will be using daily. Spending money on a tool that has the features and UI that you will want to use is perfectly ok and should even be encouraged.

Another consideration for students who are beginning the process is to find out what your advisor(s), and research groups use. Although many are cross-compatible, using and learning the tool is easier with a group helping you.

What Do I Use?

I use Papers. It is not free (a student license is ~$50). When I started using Papers, Mendelay and Zotero were not available. I actually have since used both Mendelay and Zotero for a while, but then ultimately made the decision this summer to switch back to Papers. It is faster and more importantly to me, has better search functionality, both across and within a paper.

I would like to use Sente (free for up to 100 references), but the search function is very limited. In my opinion, Sente has the best UI..it is sleek and minimalist and reading a paper is a nice experience.

My Recommendation…

Ultimately, use what you are comfortable with and then, actually use it. Take the time to enter ALL the meta-data for PDFs as you accumulate them. Don’t imagine you will have time to do it later…you won’t. Being organized with your references from the start will keep you more productive later.


Very brief first day of class activity in R

New academic year has started for most of us. I try to do a range of activities on the first day of my introductory statistics course, and one of them is an incredibly brief activity to just show students what R is and what the RStudio window looks like. Here it is:

Generate a random number between 1 and 5, and introduce yourself to that many people sitting around you:

sample(1:5, size = 1)

It’s a good opportunity to have students access RStudio once, talk about random sampling, and break up the class session and have them introduce themselves to their fellow classmates. I usually do the activity too, and use it as an opportunity to personally introduce myself to a few students and to meet them.

If you’re interested in everything else I’m doing in my introductory statistics course you can find the course materials for this semester at http://bit.ly/sta101_f15 and find the source code for all publicly available materials like slides, labs, etc. at https://github.com/mine-cetinkaya-rundel/sta101_f15. Both of these will be updated throughout the semester. Feel free to grab whatever you find useful.

R packages for undergraduate stat ed

The other day on the isostat mailing list Doug Andrews asked the following question:

Which R packages do you consider the most helpful and essential for undergrad stat ed? I ask in great part because it would help my local IT guru set up the way our network makes software available in our computer classrooms, but also just from curiosity.

Doug asked for a top 10 list, and a few people have already chimed in with great suggestions. I thought those not on the list might also have good ideas, so, with Doug’s permission, I’m reposting the question here.

Here is my top 10 (ok, 12) list:
(Links go to vignettes or pages I find to be quickest / most useful references for those packages, but if you know of better resources, let me know and I’ll update.)

  1. knitr / rmarkdown – for reproducible data analysis with literate programming, great set of tools that students can use from day 1 in intro stats all the way through to writing their undergrad theses
  2. dplyr – for most data manipulation tasks, with the added benefit of piping (via magrittr)
  3. ggplot2 – easy faceting allows for graphing multivariate relationships more easily than with base R (lattice is also good for that, but IMO ggplot2 graphics look more modern and lattice has a much steeper learning curve)
  4. openintro – or packages that come with the textbooks you use, great for pulling up any dataset from the text and building on it in class (a new version coming soon to fully complement 3rd edition of OpenIntro Statistics)
  5. mosaic – for consistent syntax for functions used in intro stat
  6. googlesheets – for loading data directly from Google spreadsheets
  7. lubridate – if you ever need to work with any date fields
  8. stringr – for text parsing and manipulation
  9. rvest – for scraping data off the web
  10. readr / data.table – for loading large datasets & default stringsAsFactors = FALSE

And the following suggestions from Randall Prium complement this list nicely:

  • readxl – for reading Excel data
  • tidyr – for converting between wide and long formats and for the very useful extract_numeric()
  • ggvisggplot2 “done right” and tuned for interactive graphics
  • htmlwidgets – this is actually a collection of packages for plots: see leaflet for maps and dygraphs for time series, for example

Note that most of these packages are for data manipulation and visualization. Methods specific packages that are useful / essential for a particular undergraduate program might depend on the focus of that program. Some packages that so far came up in the discussion are:

  • lme4 – for mixed models
  • pwr – for showing sample size and power calculations

This blog post is meant to provide a space for continuing this discussion, so I’ll ask the question one more time: Which R packages do you consider the most helpful and essential for undergrad stat ed? Please add your responses to the comments.


PS: Thanks to Michael Lopez for suggesting that I post this list somewhere.
PPS: I should really be working on my fast-approaching JSM talk.

“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R

## Packages

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")

RMarkdown: mail_merge_handout.Rmd

output: pdf_document

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this


with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Willful Ignorance [Book Review]

I just finished reading Willful Ignorance: The Mismeasure of Uncertainty by Herbert Weisberg. I gave this book five stars (out of five) on Goodreads.

According to Weisberg, the text can be

“regarded as two books in one. On one hand it is a history of a big idea: how we have come to think about uncertainty. On the other, it is a prescription for change, especially with regard to how we perform research in the biomedical and social sciences” (p. xi).

Willful ignorance is the idea that to deal with uncertainty, statisticians simplify the situation by filtering out or ignoring much of what we know…we willfully ignore some information in order to quantify the amount of uncertainty.

The book gives a cogent history and evolution of the ideas and history of probability, tackling head-on the questions: what is probability, how did we come to our current understanding of probability, and how did mathematical probability come to represent uncertainty and ambiguity.

Although Weisberg presents a nice historical perspective, the book is equally philosophical. In some ways it is a more leisurely read of the material found in Hacking, and in many ways more compelling.

I learned a great deal from this book. In many places I found myself re-reading sections and spiraling back to previously read sections to read them with some new understanding. I may even try to assign parts of it to the undergraduates I am teaching this summer.

This book would make a wonderful beach read for anyone interested in randomness, or uncertainty, or any academic hipster.

Quantitatively Thinking

John Oliver said it best: April 15 combines Americans two most-hated things: taxes and math.  I’ve been thinking about the latter recently after hearing a fascinating talk last weekend about quantitative literacy.

QL is meant to describe our ability to think with, and about, numbers.  QL doesn’t include  high-level math skills, but usually is meant to describe  our ability to understand percentages and proportions and basic mathematical operations.This is a really important type of literacy, of course, but I fear that the QL movement could benefit from merging QL with SL–Statistical Literacy.

No surprise, that, coming from this blog.  But let me tell you why.  The speaker began by saying that many Americans can’t figure out, given the amount of gas in their tank, how many miles they have to drive before they run out of gas.

This dumbfounded me.  If it were literally true, you’d see stalled cars every few blocks in Los Angeles.  (Now we see them only every 3 or 4 miles.)  But I also thought, wait, do I know how far I can drive before I run out of gas?  My gas gauge says I have half a tank left, and I think (but am not certain) that my tank holds 16 gallons.  That means I probably have 8 gallons left.  I can see I’ve driven about 200 miles since I last filled up because I remembered to hit that little mileage reset button that keeps track of such things.  And so I’m averaging 25 mpg. But I’m also planning a trip to San Diego in the next couple of days, and then I’ll be driving on the highway, and so my mileage will improve.  And that 25 mpg is just an average, and averages have variability, but I don’t really have a sense of the variability of that mean.  And this problem requires that I know my mpg in the future, and, well, of all the things you can predict, the future is the hardest.  And so, I’m left to conclude that I don’t really know when my car will run out gas.

Now while I don’t know the exact number of miles I can drive, I can estimate the value.  With a little more data I can measure the uncertainty in this estimate, too, and use that to decide, when the tank gets low, if I should push my luck (or push my car).

And that example, I think, illustrates a problem with the QL movement.  The issue is not that Americans don’t know how to calculate how far they can drive before their car runs out of gas, but that they don’t know how to estimate how far they can drive. This is not just mincing words. The actual problem from which the initial startling claim was made was something like this: “Your car gets 25 mpg and you have 8 gallons left in your tank.  How far can you drive before you run out of gas?”  In real life, the answer is “It depends.”  This is a situation that every first-year stats student should recognize contains variability.   (For those of you whose car tries to tell you how many miles you have left in your tank, you’ve probably experienced that pleasing event when you begin your trip with, say, 87 miles left in your tank and end your trip 10 miles later with 88 miles left in your tank.  And so you know first hand the variability in this system.) The correct response to this question is to try to estimate the miles you can drive, and to recognize assumptions you must make to do this estimation.  Instead, we are meant to go into “math mode” and recognize this not as a life-skills problem but  a Dreaded Word Problem.  One sign that you are dealing with a DWP is that there are implicit assumptions that you’re just supposed to know, and you’re supposed to ignore your own experience and plow ahead so that you can get the “right” answer, as opposed to the true answer. (Which is: “it depends”).

A better problem would provide us with data.  Perhaps we would see the distances travelled on 8 gallons the last 10 trips.  Or perhaps on just 5 gallons and then would have to estimate how far we could go, on average, with 8 gallons.  And we should be asked to state our assumptions and to consider the consequences if those assumptions are wrong.  In short, we should be performing a modeling activity, and not a DWP.  Here’s an example:  On my last 5 trips, on 10 gallons of gas I drove 252, 184, 300, 355, 205 miles.  I have 10 gallons left, and I must drive 200 miles.  Do I need to fill up? Explain.**

The point is that one reason QL seems to be such a problem is not because we can’t think about numbers, but that the questions that have been used to conclude that we can’t think about numbers are not reflective of real-life problems.  Instead, these questions are reflective of the DWP culture.  I should emphasize that this is just one reason.  I’ve seen first hand that many students wrestle with proportions and basic number-sense.  This sort of question that comes up often in intro stats — “I am 5 inches taller than average.  One standard deviation is 3 inches.  How many standard deviations above average am I?”  –is a real stumper for many students, and this is sad because by the time they get to college this sort of thing should be answerable through habit, and not require thinking through for the very first time. (Interestingly, if you change the 5 to a 6 it becomes much easier for some, but not for all.)

And so, while trying to ponder the perplexities of finding your tax bracket, be consoled that a great number of others —who really knows how many others? — are feeling the same QL anxiety as you.  But for a good reason:  tax problems are perhaps the rare examples of  DWPs that actually matter.

**suggestions for improving this problem are welcome!

Interpreting Cause and Effect

One big challenge we all face is understanding what’s good and what’s bad for us.  And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster:  Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US.  Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.