Halloween: An Excuse for Plotting with Icons

In my course on the GLM, we are discussing residual plots this week. Given that it is also Halloween this Saturday, it seems like a perfect time to code up a residual plot made of ghosts.

Ghost plotThe process I used to create this plot is as follows:

  1. Find an icon that you want to use in place of the points on your scatterplot (or dot plot).

I used a ghost icon (created by Andrea Mazzini) obtained from The Noun Project. After downloading the icon, I used Preview to create a new PNG file that had cut out the citation text in the downloaded image. I will add the citation text at a later stage in the plot itself. This new icon was 450×450 pixels.

  1. Use ggplot to create a scatterplot of a set of data, making the size of the points 0.

Here is the code that will create the data and make the plot that I used.

plotData = data.frame(
  .fitted = c(76.5, 81.3, 75.5, 79.5, 80.1, 78.5, 79.5, 77.5, 81.2, 80.4, 78.1, 79.5, 76.6, 79.4, 75.9, 86.6, 84.2, 83.1, 82.4, 78.4, 81.6, 79.6, 80.4, 82.3, 78.6, 82.1, 76.6, 82.1, 87, 82.2, 82.1, 87.2, 80.5, 84.9, 78.5, 79, 78.5, 81.5, 77.4, 76.8, 79.4, 75.5, 80.2, 80.4, 81.5, 81.5, 80.5, 79.2, 82.2, 83, 78.5, 79.2, 80.6, 78.6, 85.9, 76.5, 77.5, 84.1, 77.6, 81.2, 74.8, 83.4, 80.4, 77.6, 78.6, 83.3, 80.4, 80.5, 80.4, 83.8, 85.1, 82.2, 84.1, 80.2, 75.7, 83, 81.5, 83.1, 78.3, 76.9, 82, 82.3, 85.8, 78.5, 75.9, 80.4, 82.3, 75.7, 73.9, 80.4, 83.2, 85.2, 84.9, 80.4, 85.9, 76.8, 83.3, 80.2, 83.1, 77.6),
  .stdresid = c(0.2, -0.3, 0.5, 1.4, 0.3, -0.2, 1.2, -1.1, 0.7, -0.1, -0.3, -1.1, -1.5, -0.1, 0, -1, 1, 0.3, -0.5, 0.5, 1.8, 1.6, -0.1, -1.3, -0.2, -0.9, 1.1, -0.2, 1.5, -0.3, -1.2, -0.6, -0.4, -3, 0.5, 0.3, -0.8, 0.8, 0.5, 1.3, 1.8, 0.5, -1.6, -2, -2.1, -0.8, 0.4, -0.9, 0.4, -0.4, 0.6, 0.4, 1.4, -1.4, 1.3, 0.4, -0.8, -0.2, 0.5, 0.7, 0.5, 0.1, 0.1, -0.8, -2.1, 0, 1.9, -0.5, -0.1, -1.4, 0.6, 0.7, -0.3, 1, -0.7, 0.7, -0.2, 0.8, 1.3, -0.7, -0.4, 1.5, 2.1, 1.6, -1, 0.7, -1, 0.9, -0.3, 0.9, -0.3, -0.7, -0.9, -0.2, 1.2, -0.8, -0.9, -1.7, 0.6, -0.5)


p = ggplot(data = plotData, aes(x = .fitted, y = .stdresid)) +
    theme_bw() + 
    geom_hline(yintercept = 0) +
    geom_point(size = 0) +
    theme_bw() +
    xlab("Fitted values") +
    ylab("Standarized Residuals") +
    annotate("text", x = 76, y = -3, label = "Ghost created by Andrea Mazzini from Noun Project")
  1. Read in the icon (which is a PNG file).

Here we use the readPNG() function from the png library to bring the icon into R.

ghost = readPNG("/Users/andrewz/Desktop/ghost.png", TRUE)
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

The idea is that since we have saved our plot in the object p, we can add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The pseudo-like code for this is:

for(i in 1:nrow(plotData)){
    p = p + 
        xmin = minimum_x_value_for_the_image, 
        xmax = maximum_x_value_for_the_image, 
        ymin = minimum_y_value_for_the_image, 
        ymax = maximum_y_value_for_the_image

In order for the image to be plotted, we first have to make it plot-able by making it a graphical object, or GROB.

The rasterGrob() function (found in the grid,/b> package) renders a bitmap image (raster image) into a graphical object or GROB which can then be displayed at a specified location, orientation, etc. Read more about using Raster images in R here.

The arguments xmin, xmax, ymin, and ymax give the horizontal and vertical locations (in data coordinates) of the raster image. In our residual plot, we want the center of the image to be located at the coordinates (.fitted, .stdresid). In the syntax below, we add a small bit to the maximum values and subtract a small bit from the minimum values to force the icon into a box that will plot the icons a bit smaller than their actual size. (#protip: play around with this value until you get a plot that looks good.)


for(i in 1:nrow(plotData)){
    p = p + annotation_custom(
      xmin = plotData$.fitted[i]-0.2, xmax = plotData$.fitted[i]+0.2, 
      ymin = plotData$.stdresid[i]-0.2, ymax = plotData$.stdresid[i]+0.2

Finally we print the plot to our graphics device using


And the result is eerily pleasant!

The African Data Initiative

Are you looking for a way to celebrate World Statistics Day? I know you are. And I can’t think of a better way than supporting the African Data Initiative (ADI).

I’m proud to have met some of the statisticians, statisticis educators and researchers who are leading this initative at an International Association of Statistics Educators Roundtable workshop in Cebu, The Phillipines, in 2012. You can read about Roger and David’s Stern’s projects in Kenya here in the journal Technology Innovations in Statistics Education. This group — represented at the workshop by father-and-son Roger and David, and at-the-time grad students Zacharaiah Mbasu and James Musyoka — impressed me with their determination to improve international statistical literacy and  with their successful and creative pragmatic implementations to adjust to the needs of the local situations in Kenya.

The ADI is seeking funds within the next 18 days to adapt two existing software packages, R and Instat+ so that there is a free, open-source, easy-to-learn statistical software package available and accessible throughout the world. While R is free and open-sourced, it is not easy to learn (particularly in areas where English literacy is low). Instat+ is, they claim, easy to learn but not open-source (and also does not run on Linux or Mac).

One of the exciting things about this project is that these solutions to statistical literacy are being developed by Africans working and researching in Africa, and are not ‘imported’ by groups or corporations with little experience implementing in the local schools. One lesson I’ve learned from my experience working with the Los Angeles Unified School District is that you must work closely with the schools for which you are developing curricula; outsider efforts have a lower chance of success. I hope you’ll take a moment –in the next 18 days–to become acquainted with this worthy project!

World Statistics Day is October 20.  The theme is Better Data. Better Lives.

TIL what happens if you use %>% instead of + in ggplot2

This post is about ggplot2 and dplyr packages, so let’s start with loading them:


I can’t be the first person to make the following mistake:

ggplot(mtcars, aes(x = wt, y = mpg)) %>%

Can you spot the mistake in the code above? Look closely at the end of the first line.

The operator should be the + used in ggplot2 for layering, not the %>% operator used in dplyr for piping, like this:

ggplot(mtcars, aes(x = wt, y = mpg)) +

So what happens if you accidentally use the pipe operator instead of the +? You get the following error:

Error in get(x, envir = this, inherits = inh)(this, ...) : 
 Mapping should be a list of unevaluated mappings created by aes or aes_string

My Google search for this error did not yield my careless mistake as a potential cause. Since many people use these two packages together, I’m guessing such mix-up of operators can’t be too uncommon (right? I can’t be the only one…). So I’m leaving this post here for the next person who makes the same mistake.


R packages for undergraduate stat ed

The other day on the isostat mailing list Doug Andrews asked the following question:

Which R packages do you consider the most helpful and essential for undergrad stat ed? I ask in great part because it would help my local IT guru set up the way our network makes software available in our computer classrooms, but also just from curiosity.

Doug asked for a top 10 list, and a few people have already chimed in with great suggestions. I thought those not on the list might also have good ideas, so, with Doug’s permission, I’m reposting the question here.

Here is my top 10 (ok, 12) list:
(Links go to vignettes or pages I find to be quickest / most useful references for those packages, but if you know of better resources, let me know and I’ll update.)

  1. knitr / rmarkdown – for reproducible data analysis with literate programming, great set of tools that students can use from day 1 in intro stats all the way through to writing their undergrad theses
  2. dplyr – for most data manipulation tasks, with the added benefit of piping (via magrittr)
  3. ggplot2 – easy faceting allows for graphing multivariate relationships more easily than with base R (lattice is also good for that, but IMO ggplot2 graphics look more modern and lattice has a much steeper learning curve)
  4. openintro – or packages that come with the textbooks you use, great for pulling up any dataset from the text and building on it in class (a new version coming soon to fully complement 3rd edition of OpenIntro Statistics)
  5. mosaic – for consistent syntax for functions used in intro stat
  6. googlesheets – for loading data directly from Google spreadsheets
  7. lubridate – if you ever need to work with any date fields
  8. stringr – for text parsing and manipulation
  9. rvest – for scraping data off the web
  10. readr / data.table – for loading large datasets & default stringsAsFactors = FALSE

And the following suggestions from Randall Prium complement this list nicely:

  • readxl – for reading Excel data
  • tidyr – for converting between wide and long formats and for the very useful extract_numeric()
  • ggvisggplot2 “done right” and tuned for interactive graphics
  • htmlwidgets – this is actually a collection of packages for plots: see leaflet for maps and dygraphs for time series, for example

Note that most of these packages are for data manipulation and visualization. Methods specific packages that are useful / essential for a particular undergraduate program might depend on the focus of that program. Some packages that so far came up in the discussion are:

  • lme4 – for mixed models
  • pwr – for showing sample size and power calculations

This blog post is meant to provide a space for continuing this discussion, so I’ll ask the question one more time: Which R packages do you consider the most helpful and essential for undergrad stat ed? Please add your responses to the comments.


PS: Thanks to Michael Lopez for suggesting that I post this list somewhere.
PPS: I should really be working on my fast-approaching JSM talk.

“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R

## Packages

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")

RMarkdown: mail_merge_handout.Rmd

output: pdf_document

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this


with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Yikes…It’s Been Awile

Apparently our last blog post was in August. Dang. Where did five months go? Blog guilt would be killing me, but I swear it was just yesterday that Mine posted.

I will give a bit of review of some of the books that I read this semester related to statistics. Most recently, I finished Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications. This was a fairly readable book for those looking to understand a bit of matrix algebra. The emphasis is definitely in economics, but their are some statistics examples as well. I am not as sure where the “motivated learning” part comes in, but the examples are practical and the writing is pretty coherent.

The two books that I read that I am most excited about are Model Based Inference in the Life Sciences: A Primer on Evidence and The Psychology of Computer Programming. The latter, written in the 70’s, explored psychological aspects of computer programming, especially in industry, and on increasing productivity. Weinberg (the author) stated his purpose in the book was to study “computer programming as a human activity.” This was compelling on many levels to me, not the least of which is to better understand how students learn statistics when using software such as R.

Reading this book, along with participating in a student-led computing club in our department has sparked some interest to begin reading the literature related to these ideas this spring semester (feel free to join us…maybe we will document our conversations as we go). I am very interested in how instructor’s choose software to teach with (see concerns raised about using R in Harwell (2014). Not so fast my friend: The rush to R and the need for rigorous evaluation of data analysis and software in education. Education Research Quarterly.) I have also thought long and hard about not only what influences the choice of software to use in teaching (I do use R), but also about subsequent choices related to that decision (e.g., if R is adopted, which R packages will be introduced to students). All of these choices probably have some impact on student learning and also on students’ future practice (what you learn in graduate school is what you ultimately end up doing).

The Model Based Inference book was a shorter, readable version of Burnham and Anderson’s (2003) Springer volume on multimodel inference and information theory. I was introduced to these ideas when I taught out of Jeff Long’s, Longitudinal Data Analysis for the Behavioral Sciences Using R. They remained with me for several years and after reading Anderson’s book, I am going to teach some of these ideas in our advanced methods course this spring.

Anyway…just some short thoughts to leave you with. Happy Holidays.

Pie Charts. Are they worth the Fight?

Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.


During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.

plots_1Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)

It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3” browsers have a strong hold on the market.

The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.

The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.

So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.

*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.

**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.

# Input the ad data
ad = data.frame(
	type = c("Poster", "Billboard", "Bus", "Digital"),
	n = c(529, 356, 59, 81)

# Bar plot
ggplot(data = ad, aes(x = type, y = n, fill = type)) +
     geom_bar(stat = "identity", show_guide = FALSE) +

# Add addition columns to data, needed for donut plot.
ad$fraction = ad$n / sum(ad$n)
ad$ymax = cumsum(ad$fraction)
ad$ymin = c(0, head(ad$ymax, n = -1))

# Donut plot
ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
     geom_rect(colour = "grey30", show_guide = FALSE) +
     coord_polar(theta = "y") +
     xlim(c(0, 4)) +
     theme_bw() +
     theme(panel.grid=element_blank()) +
     theme(axis.text=element_blank()) +
     theme(axis.ticks=element_blank()) +
     geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) +
     xlab("") +



Fathom Returns

The other shoe has fallen.  Last week (or so) Tinkerplots returned to the market, and now Fathom Version 2.2 (which is the foundation on which Tinkerplots is built) is  available for a free download.  Details are available on Bill Finzer‘s website.

Fathom is one of my favorite softwares…the first commercially available package to be based on learning theory, Fathom’s primary goal is to teach statistics.  After a one-minute introduction, beginning students can quickly discuss ‘findings’ across several variables.  So many classroom exercises involve only one or two variables, and Fathom taught me  that this is unfair to students and artificially holds them back.

Welcome back, Fathom!

Tinkerplots Available Again

Very exciting news for Tinkerplots users (and for those who should be Tinkerplots users).  Tinkerplots is highly visual dynamic software that lets students design and implement simulation machines, and includes many very cool data analysis tools.

To quote from TP developer Cliff Konold:

Today we are releasing Version 2.2 of TinkerPlots.  This is a special, free version, which will expire in a year  — August 31, 2015.

To start the downloading process

Go to the TinkerPlots home page and click on the Download TinkerPlots link in the right hand panel. You’ll fill out a form. Shortly after submitting it, you’ll get an email with a link for downloading.

Help others find the TinkerPlots Download page

If you have a website, blog, or use a social media site, please help us get the word out so others can find the new TinkerPlots Download page. You could mention that you are using TinkerPlots 2.2 and link to www.srri.umass.edu/tinkerplots.

Why is this an expiring version?

As we explained in this correspondence, until January of 2014, TinkerPlots was published and sold by Key Curriculum, a division of McGraw Hill Education. Their decision to cease publication caught us off guard, and we have yet to come up with an alternative publishing plan. We created this special expiring version to meet the needs of users until we can get a new publishing plan in place.

What will happen after version 2.2 expires?

By August 2015, we will either have a new publisher lined up, or we will create another free version.  What is holding us up right now is our negotiations with the University of Massachusetts Amherst, who currently owns TinkerPlots.  Once they have decided about their future involvement with TinkerPlots, we can complete our discussions with various publishing partners.

If I have versions 2.0 or 2.1 should I delete them?

No, you should keep them. You already paid for these, and they are not substantively different from version 2.2. If and when a new version of TinkerPlots is ready for sale, you may not want to pay for it.  So keep your early version that you’ve already paid for.
Cliff and Craig