A two-hour introduction to data analysis in R

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

Fruit Plot: Plotting Using Multiple PNGs

In one of our previous posts (Halloween: An Excuse for Plotting with Icons), we gave a quick tutorial on how to plot using icons using ggplot. A reader, Dr. D. K. Samuel asked in a comment how to use multiple icons. His comment read,

…can you make a blog post on using multiple icons for such data
year, crop,yield
2000, Tomato,600
2000,Apple, 800
it will be nice to use icons for each data point. It will also be nice if the (icon) data could be colored by year.

This blog post will address this request. First, the result…


The process I used to create this plot is as follows:

  1. Find the icons that you want to use in place of the points on your scatterplot (or dot plot).

I used an apple icon (created by Creative Stall), an orange icon (created by Gui Zamarioli), and a tomato icon (created by Andrey Vasiliev); all obtained from The Noun Project.

  1. Color the icons.

After downloading the icons, I used Gimp, a free image manipulation program, to color each of the icons. I created a green version, and a blue version of each icon. (The request asked for the two different years to have different colors.) I also cropped the icons.

Given that there were only three icons, doing this manually was not much of a time burden (10 minutes after I selected the color palette—using colorbrewer.org). Could this be done programatically? I am not sure. A person, who is not me, might be able to write some commands to do this with ImageMagick or some other program. You might also be able to do this in R, but I sure don’t know how…I imagine it involves re-writing the values for the pixels you want to change the color of, but how you determine which of those you want is beyond me.

If you are interested in only changing the color of the icon outline, an alternative would be to download the SVGs rather than the PNGs. Opening the SVG file in a text editor gives the underlying syntax for the SVG. For example, the apple icon looks like this:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
    <path d="M19.749,48c-1.662... />
    <path d="M24.001,14.866c-0.048, ... />
    <path d="M29.512, ... />
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>

The three path commands draw the actual apple. The first draws the apple, the second path command draws the leaf on top of the apple, and the third draws the stem. Adding the text, fill=”blue” to the end of each path command will change the color of the path from black to blue (see below).

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
    <path d="M19.749,48c-1.662 ... fill="blue" />
    <path d="M24.001,14.866c-0.048, ... fill="blue" />
    <path d="M29.512, ... fill="blue" />
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>

This could easily be programmatically changed. Then the SVG images could also programmatically be exported to PNGs.

  1. Read in the icons (which are PNG files).

Here we use the readPNG() function from the png library to bring the icon into R.

blue_apple = readPNG("~/Desktop/fruit-plot/blue_apple.png", TRUE)
green_apple = readPNG("~/Desktop/fruit-plot/green_apple.png", TRUE)
blue_orange = readPNG("~/Desktop/fruit-plot/blue_orange.png", TRUE)
green_orange = readPNG("~/Desktop/fruit-plot/green_orange.png", TRUE)
blue_tomato = readPNG("~/Desktop/fruit-plot/blue_tomato.png", TRUE)
green_tomato = readPNG("~/Desktop/fruit-plot/green_tomato.png", TRUE)
  1. Create the data.

Use the data.frame() function to create the data.

plotData = data.frame(
&nbsp; year = c(1995, 1995, 1995, 2000, 2000, 2000),
&nbsp; crop = c("tomato", "apple", "orange", "tomato", "apple", "orange"),
&nbsp; yield = c(250, 300, 500, 600, 800, 900)

  year   crop yield
1 1995 tomato   250
2 1995  apple   300
3 1995 orange   500
4 2000 tomato   600
5 2000  apple   800
6 2000 orange   900

Next we will add a column to our data frame that maps the year to color. This uses the ifelse() function. In this example, if the logical statement plotData$year == 1995 evaluates as TRUE, then the value will be “blue”. If it evaluates as FALSE, then the value will be “green”.

plotData$color = ifelse(plotData$year == 1995, "blue", "green")

  year   crop yield color
1 1995 tomato   250  blue
2 1995  apple   300  blue
3 1995 orange   500  blue
4 2000 tomato   600 green
5 2000  apple   800 green
6 2000 orange   900 green

Now we will use this new “color” column in conjunction with the “crop” column to identify the icon that will be plotted for each row. the paste0() function concatenates each argument together with no spaces between them. Here we are concatenating the color value, an underscore, and the crop value.

plotData$icon = paste0(plotData$color, "_", plotData$crop)

  year   crop yield color         icon
1 1995 tomato   250  blue  blue_tomato
2 1995  apple   300  blue   blue_apple
3 1995 orange   500  blue  blue_orange
4 2000 tomato   600 green green_tomato
5 2000  apple   800 green  green_apple
6 2000 orange   900 green green_orange
  1. Use ggplot to create a scatterplot of the data, making the size of the points 0.

p = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  theme_bw() +
  xlab("Year") +
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

Similar to the previous post, we add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The key is that the image name is now in the “icon” column of the data frame. The values in the “icon” column are character data. To make R treat these as objects we first parse the character data using the parse() function, and then we use eval() to have R evaluate the parsed expression. A description of this appears in this Stack Overflow question.


for(i in 1:nrow(plotData)){
  p = p + annotation_custom(
    rasterGrob(eval(parse(text = plotData$icon[i]))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] - 20, ymax = plotData$yield[i] + 20

# Show plot
  1. Some issues to consider and my alternative plot.

I think that plot is what was requested, but since I cannot help myself, I would propose a few changes that I think would make this plot better. First, I would add lines to connect each fruit (apple in 1995 to apple in 2000). This would help the reader to better track the change in yield over time.

Secondly, I would actually leave the fruit color constant across years and vary the color between fruits (probably coloring them according to their real-world colors). This again helps the reader in that they can more easily identify the fruits and also helps them track the change in yield. (It also avoids a Stroop-like effect of coloring an orange some other color than orange!)

Here is the code:

# Read in PNG files
apple = readPNG("~/Desktop/fruit-plot/red_apple.png", TRUE)
orange = readPNG("~/Desktop/fruit-plot/orange_orange.png", TRUE)
tomato = readPNG("~/Desktop/fruit-plot/red_tomato.png", TRUE)

# Plot
p2 = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  geom_line(aes(group = crop), lty = "dashed") +
  theme_bw()  +
  xlab("Year") +
  ylab("Yield") +
  annotate("text", x = 1997, y = 350, label = "Tomato created by Andrey Vasiliev from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 330, label = "Apple created by Creative Stall from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 310, label = "Orange created by Gui Zamarioli from the Noun Project", size = 2, hjust = 0)

for(i in 1:nrow(plotData)){
  p2 = p2 + annotation_custom(
    rasterGrob(eval(parse(text = as.character(plotData$crop[i])))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] -20, ymax = plotData$yield[i]+20

# Show plot

And the result…


Halloween: An Excuse for Plotting with Icons

In my course on the GLM, we are discussing residual plots this week. Given that it is also Halloween this Saturday, it seems like a perfect time to code up a residual plot made of ghosts.

Ghost plotThe process I used to create this plot is as follows:

  1. Find an icon that you want to use in place of the points on your scatterplot (or dot plot).

I used a ghost icon (created by Andrea Mazzini) obtained from The Noun Project. After downloading the icon, I used Preview to create a new PNG file that had cut out the citation text in the downloaded image. I will add the citation text at a later stage in the plot itself. This new icon was 450×450 pixels.

  1. Use ggplot to create a scatterplot of a set of data, making the size of the points 0.

Here is the code that will create the data and make the plot that I used.

plotData = data.frame(
  .fitted = c(76.5, 81.3, 75.5, 79.5, 80.1, 78.5, 79.5, 77.5, 81.2, 80.4, 78.1, 79.5, 76.6, 79.4, 75.9, 86.6, 84.2, 83.1, 82.4, 78.4, 81.6, 79.6, 80.4, 82.3, 78.6, 82.1, 76.6, 82.1, 87, 82.2, 82.1, 87.2, 80.5, 84.9, 78.5, 79, 78.5, 81.5, 77.4, 76.8, 79.4, 75.5, 80.2, 80.4, 81.5, 81.5, 80.5, 79.2, 82.2, 83, 78.5, 79.2, 80.6, 78.6, 85.9, 76.5, 77.5, 84.1, 77.6, 81.2, 74.8, 83.4, 80.4, 77.6, 78.6, 83.3, 80.4, 80.5, 80.4, 83.8, 85.1, 82.2, 84.1, 80.2, 75.7, 83, 81.5, 83.1, 78.3, 76.9, 82, 82.3, 85.8, 78.5, 75.9, 80.4, 82.3, 75.7, 73.9, 80.4, 83.2, 85.2, 84.9, 80.4, 85.9, 76.8, 83.3, 80.2, 83.1, 77.6),
  .stdresid = c(0.2, -0.3, 0.5, 1.4, 0.3, -0.2, 1.2, -1.1, 0.7, -0.1, -0.3, -1.1, -1.5, -0.1, 0, -1, 1, 0.3, -0.5, 0.5, 1.8, 1.6, -0.1, -1.3, -0.2, -0.9, 1.1, -0.2, 1.5, -0.3, -1.2, -0.6, -0.4, -3, 0.5, 0.3, -0.8, 0.8, 0.5, 1.3, 1.8, 0.5, -1.6, -2, -2.1, -0.8, 0.4, -0.9, 0.4, -0.4, 0.6, 0.4, 1.4, -1.4, 1.3, 0.4, -0.8, -0.2, 0.5, 0.7, 0.5, 0.1, 0.1, -0.8, -2.1, 0, 1.9, -0.5, -0.1, -1.4, 0.6, 0.7, -0.3, 1, -0.7, 0.7, -0.2, 0.8, 1.3, -0.7, -0.4, 1.5, 2.1, 1.6, -1, 0.7, -1, 0.9, -0.3, 0.9, -0.3, -0.7, -0.9, -0.2, 1.2, -0.8, -0.9, -1.7, 0.6, -0.5)


p = ggplot(data = plotData, aes(x = .fitted, y = .stdresid)) +
    theme_bw() + 
    geom_hline(yintercept = 0) +
    geom_point(size = 0) +
    theme_bw() +
    xlab("Fitted values") +
    ylab("Standarized Residuals") +
    annotate("text", x = 76, y = -3, label = "Ghost created by Andrea Mazzini from Noun Project")
  1. Read in the icon (which is a PNG file).

Here we use the readPNG() function from the png library to bring the icon into R.

ghost = readPNG("/Users/andrewz/Desktop/ghost.png", TRUE)
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

The idea is that since we have saved our plot in the object p, we can add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The pseudo-like code for this is:

for(i in 1:nrow(plotData)){
    p = p + 
        xmin = minimum_x_value_for_the_image, 
        xmax = maximum_x_value_for_the_image, 
        ymin = minimum_y_value_for_the_image, 
        ymax = maximum_y_value_for_the_image

In order for the image to be plotted, we first have to make it plot-able by making it a graphical object, or GROB.

The rasterGrob() function (found in the grid,/b> package) renders a bitmap image (raster image) into a graphical object or GROB which can then be displayed at a specified location, orientation, etc. Read more about using Raster images in R here.

The arguments xmin, xmax, ymin, and ymax give the horizontal and vertical locations (in data coordinates) of the raster image. In our residual plot, we want the center of the image to be located at the coordinates (.fitted, .stdresid). In the syntax below, we add a small bit to the maximum values and subtract a small bit from the minimum values to force the icon into a box that will plot the icons a bit smaller than their actual size. (#protip: play around with this value until you get a plot that looks good.)


for(i in 1:nrow(plotData)){
    p = p + annotation_custom(
      xmin = plotData$.fitted[i]-0.2, xmax = plotData$.fitted[i]+0.2, 
      ymin = plotData$.stdresid[i]-0.2, ymax = plotData$.stdresid[i]+0.2

Finally we print the plot to our graphics device using


And the result is eerily pleasant!

The African Data Initiative

Are you looking for a way to celebrate World Statistics Day? I know you are. And I can’t think of a better way than supporting the African Data Initiative (ADI).

I’m proud to have met some of the statisticians, statisticis educators and researchers who are leading this initative at an International Association of Statistics Educators Roundtable workshop in Cebu, The Phillipines, in 2012. You can read about Roger and David’s Stern’s projects in Kenya here in the journal Technology Innovations in Statistics Education. This group — represented at the workshop by father-and-son Roger and David, and at-the-time grad students Zacharaiah Mbasu and James Musyoka — impressed me with their determination to improve international statistical literacy and  with their successful and creative pragmatic implementations to adjust to the needs of the local situations in Kenya.

The ADI is seeking funds within the next 18 days to adapt two existing software packages, R and Instat+ so that there is a free, open-source, easy-to-learn statistical software package available and accessible throughout the world. While R is free and open-sourced, it is not easy to learn (particularly in areas where English literacy is low). Instat+ is, they claim, easy to learn but not open-source (and also does not run on Linux or Mac).

One of the exciting things about this project is that these solutions to statistical literacy are being developed by Africans working and researching in Africa, and are not ‘imported’ by groups or corporations with little experience implementing in the local schools. One lesson I’ve learned from my experience working with the Los Angeles Unified School District is that you must work closely with the schools for which you are developing curricula; outsider efforts have a lower chance of success. I hope you’ll take a moment –in the next 18 days–to become acquainted with this worthy project!

World Statistics Day is October 20.  The theme is Better Data. Better Lives.

TIL what happens if you use %>% instead of + in ggplot2

This post is about ggplot2 and dplyr packages, so let’s start with loading them:


I can’t be the first person to make the following mistake:

ggplot(mtcars, aes(x = wt, y = mpg)) %>%

Can you spot the mistake in the code above? Look closely at the end of the first line.

The operator should be the + used in ggplot2 for layering, not the %>% operator used in dplyr for piping, like this:

ggplot(mtcars, aes(x = wt, y = mpg)) +

So what happens if you accidentally use the pipe operator instead of the +? You get the following error:

Error in get(x, envir = this, inherits = inh)(this, ...) : 
 Mapping should be a list of unevaluated mappings created by aes or aes_string

My Google search for this error did not yield my careless mistake as a potential cause. Since many people use these two packages together, I’m guessing such mix-up of operators can’t be too uncommon (right? I can’t be the only one…). So I’m leaving this post here for the next person who makes the same mistake.


R packages for undergraduate stat ed

The other day on the isostat mailing list Doug Andrews asked the following question:

Which R packages do you consider the most helpful and essential for undergrad stat ed? I ask in great part because it would help my local IT guru set up the way our network makes software available in our computer classrooms, but also just from curiosity.

Doug asked for a top 10 list, and a few people have already chimed in with great suggestions. I thought those not on the list might also have good ideas, so, with Doug’s permission, I’m reposting the question here.

Here is my top 10 (ok, 12) list:
(Links go to vignettes or pages I find to be quickest / most useful references for those packages, but if you know of better resources, let me know and I’ll update.)

  1. knitr / rmarkdown – for reproducible data analysis with literate programming, great set of tools that students can use from day 1 in intro stats all the way through to writing their undergrad theses
  2. dplyr – for most data manipulation tasks, with the added benefit of piping (via magrittr)
  3. ggplot2 – easy faceting allows for graphing multivariate relationships more easily than with base R (lattice is also good for that, but IMO ggplot2 graphics look more modern and lattice has a much steeper learning curve)
  4. openintro – or packages that come with the textbooks you use, great for pulling up any dataset from the text and building on it in class (a new version coming soon to fully complement 3rd edition of OpenIntro Statistics)
  5. mosaic – for consistent syntax for functions used in intro stat
  6. googlesheets – for loading data directly from Google spreadsheets
  7. lubridate – if you ever need to work with any date fields
  8. stringr – for text parsing and manipulation
  9. rvest – for scraping data off the web
  10. readr / data.table – for loading large datasets & default stringsAsFactors = FALSE

And the following suggestions from Randall Prium complement this list nicely:

  • readxl – for reading Excel data
  • tidyr – for converting between wide and long formats and for the very useful extract_numeric()
  • ggvisggplot2 “done right” and tuned for interactive graphics
  • htmlwidgets – this is actually a collection of packages for plots: see leaflet for maps and dygraphs for time series, for example

Note that most of these packages are for data manipulation and visualization. Methods specific packages that are useful / essential for a particular undergraduate program might depend on the focus of that program. Some packages that so far came up in the discussion are:

  • lme4 – for mixed models
  • pwr – for showing sample size and power calculations

This blog post is meant to provide a space for continuing this discussion, so I’ll ask the question one more time: Which R packages do you consider the most helpful and essential for undergrad stat ed? Please add your responses to the comments.


PS: Thanks to Michael Lopez for suggesting that I post this list somewhere.
PPS: I should really be working on my fast-approaching JSM talk.

“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R

## Packages

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")

RMarkdown: mail_merge_handout.Rmd

output: pdf_document

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this


with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Yikes…It’s Been Awile

Apparently our last blog post was in August. Dang. Where did five months go? Blog guilt would be killing me, but I swear it was just yesterday that Mine posted.

I will give a bit of review of some of the books that I read this semester related to statistics. Most recently, I finished Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications. This was a fairly readable book for those looking to understand a bit of matrix algebra. The emphasis is definitely in economics, but their are some statistics examples as well. I am not as sure where the “motivated learning” part comes in, but the examples are practical and the writing is pretty coherent.

The two books that I read that I am most excited about are Model Based Inference in the Life Sciences: A Primer on Evidence and The Psychology of Computer Programming. The latter, written in the 70’s, explored psychological aspects of computer programming, especially in industry, and on increasing productivity. Weinberg (the author) stated his purpose in the book was to study “computer programming as a human activity.” This was compelling on many levels to me, not the least of which is to better understand how students learn statistics when using software such as R.

Reading this book, along with participating in a student-led computing club in our department has sparked some interest to begin reading the literature related to these ideas this spring semester (feel free to join us…maybe we will document our conversations as we go). I am very interested in how instructor’s choose software to teach with (see concerns raised about using R in Harwell (2014). Not so fast my friend: The rush to R and the need for rigorous evaluation of data analysis and software in education. Education Research Quarterly.) I have also thought long and hard about not only what influences the choice of software to use in teaching (I do use R), but also about subsequent choices related to that decision (e.g., if R is adopted, which R packages will be introduced to students). All of these choices probably have some impact on student learning and also on students’ future practice (what you learn in graduate school is what you ultimately end up doing).

The Model Based Inference book was a shorter, readable version of Burnham and Anderson’s (2003) Springer volume on multimodel inference and information theory. I was introduced to these ideas when I taught out of Jeff Long’s, Longitudinal Data Analysis for the Behavioral Sciences Using R. They remained with me for several years and after reading Anderson’s book, I am going to teach some of these ideas in our advanced methods course this spring.

Anyway…just some short thoughts to leave you with. Happy Holidays.

Pie Charts. Are they worth the Fight?

Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.


During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.

plots_1Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)

It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3” browsers have a strong hold on the market.

The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.

The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.

So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.

*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.

**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.

# Input the ad data
ad = data.frame(
	type = c("Poster", "Billboard", "Bus", "Digital"),
	n = c(529, 356, 59, 81)

# Bar plot
ggplot(data = ad, aes(x = type, y = n, fill = type)) +
     geom_bar(stat = "identity", show_guide = FALSE) +

# Add addition columns to data, needed for donut plot.
ad$fraction = ad$n / sum(ad$n)
ad$ymax = cumsum(ad$fraction)
ad$ymin = c(0, head(ad$ymax, n = -1))

# Donut plot
ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
     geom_rect(colour = "grey30", show_guide = FALSE) +
     coord_polar(theta = "y") +
     xlim(c(0, 4)) +
     theme_bw() +
     theme(panel.grid=element_blank()) +
     theme(axis.text=element_blank()) +
     theme(axis.ticks=element_blank()) +
     geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) +
     xlab("") +