“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R

## Packages

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")

RMarkdown: mail_merge_handout.Rmd

output: pdf_document

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this


with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Willful Ignorance [Book Review]

I just finished reading Willful Ignorance: The Mismeasure of Uncertainty by Herbert Weisberg. I gave this book five stars (out of five) on Goodreads.

According to Weisberg, the text can be

“regarded as two books in one. On one hand it is a history of a big idea: how we have come to think about uncertainty. On the other, it is a prescription for change, especially with regard to how we perform research in the biomedical and social sciences” (p. xi).

Willful ignorance is the idea that to deal with uncertainty, statisticians simplify the situation by filtering out or ignoring much of what we know…we willfully ignore some information in order to quantify the amount of uncertainty.

The book gives a cogent history and evolution of the ideas and history of probability, tackling head-on the questions: what is probability, how did we come to our current understanding of probability, and how did mathematical probability come to represent uncertainty and ambiguity.

Although Weisberg presents a nice historical perspective, the book is equally philosophical. In some ways it is a more leisurely read of the material found in Hacking, and in many ways more compelling.

I learned a great deal from this book. In many places I found myself re-reading sections and spiraling back to previously read sections to read them with some new understanding. I may even try to assign parts of it to the undergraduates I am teaching this summer.

This book would make a wonderful beach read for anyone interested in randomness, or uncertainty, or any academic hipster.

Quantitatively Thinking

John Oliver said it best: April 15 combines Americans two most-hated things: taxes and math.  I’ve been thinking about the latter recently after hearing a fascinating talk last weekend about quantitative literacy.

QL is meant to describe our ability to think with, and about, numbers.  QL doesn’t include  high-level math skills, but usually is meant to describe  our ability to understand percentages and proportions and basic mathematical operations.This is a really important type of literacy, of course, but I fear that the QL movement could benefit from merging QL with SL–Statistical Literacy.

No surprise, that, coming from this blog.  But let me tell you why.  The speaker began by saying that many Americans can’t figure out, given the amount of gas in their tank, how many miles they have to drive before they run out of gas.

This dumbfounded me.  If it were literally true, you’d see stalled cars every few blocks in Los Angeles.  (Now we see them only every 3 or 4 miles.)  But I also thought, wait, do I know how far I can drive before I run out of gas?  My gas gauge says I have half a tank left, and I think (but am not certain) that my tank holds 16 gallons.  That means I probably have 8 gallons left.  I can see I’ve driven about 200 miles since I last filled up because I remembered to hit that little mileage reset button that keeps track of such things.  And so I’m averaging 25 mpg. But I’m also planning a trip to San Diego in the next couple of days, and then I’ll be driving on the highway, and so my mileage will improve.  And that 25 mpg is just an average, and averages have variability, but I don’t really have a sense of the variability of that mean.  And this problem requires that I know my mpg in the future, and, well, of all the things you can predict, the future is the hardest.  And so, I’m left to conclude that I don’t really know when my car will run out gas.

Now while I don’t know the exact number of miles I can drive, I can estimate the value.  With a little more data I can measure the uncertainty in this estimate, too, and use that to decide, when the tank gets low, if I should push my luck (or push my car).

And that example, I think, illustrates a problem with the QL movement.  The issue is not that Americans don’t know how to calculate how far they can drive before their car runs out of gas, but that they don’t know how to estimate how far they can drive. This is not just mincing words. The actual problem from which the initial startling claim was made was something like this: “Your car gets 25 mpg and you have 8 gallons left in your tank.  How far can you drive before you run out of gas?”  In real life, the answer is “It depends.”  This is a situation that every first-year stats student should recognize contains variability.   (For those of you whose car tries to tell you how many miles you have left in your tank, you’ve probably experienced that pleasing event when you begin your trip with, say, 87 miles left in your tank and end your trip 10 miles later with 88 miles left in your tank.  And so you know first hand the variability in this system.) The correct response to this question is to try to estimate the miles you can drive, and to recognize assumptions you must make to do this estimation.  Instead, we are meant to go into “math mode” and recognize this not as a life-skills problem but  a Dreaded Word Problem.  One sign that you are dealing with a DWP is that there are implicit assumptions that you’re just supposed to know, and you’re supposed to ignore your own experience and plow ahead so that you can get the “right” answer, as opposed to the true answer. (Which is: “it depends”).

A better problem would provide us with data.  Perhaps we would see the distances travelled on 8 gallons the last 10 trips.  Or perhaps on just 5 gallons and then would have to estimate how far we could go, on average, with 8 gallons.  And we should be asked to state our assumptions and to consider the consequences if those assumptions are wrong.  In short, we should be performing a modeling activity, and not a DWP.  Here’s an example:  On my last 5 trips, on 10 gallons of gas I drove 252, 184, 300, 355, 205 miles.  I have 10 gallons left, and I must drive 200 miles.  Do I need to fill up? Explain.**

The point is that one reason QL seems to be such a problem is not because we can’t think about numbers, but that the questions that have been used to conclude that we can’t think about numbers are not reflective of real-life problems.  Instead, these questions are reflective of the DWP culture.  I should emphasize that this is just one reason.  I’ve seen first hand that many students wrestle with proportions and basic number-sense.  This sort of question that comes up often in intro stats — “I am 5 inches taller than average.  One standard deviation is 3 inches.  How many standard deviations above average am I?”  –is a real stumper for many students, and this is sad because by the time they get to college this sort of thing should be answerable through habit, and not require thinking through for the very first time. (Interestingly, if you change the 5 to a 6 it becomes much easier for some, but not for all.)

And so, while trying to ponder the perplexities of finding your tax bracket, be consoled that a great number of others —who really knows how many others? — are feeling the same QL anxiety as you.  But for a good reason:  tax problems are perhaps the rare examples of  DWPs that actually matter.

**suggestions for improving this problem are welcome!

Interpreting Cause and Effect

One big challenge we all face is understanding what’s good and what’s bad for us.  And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster:  Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US.  Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.

Fitbit Revisited

Many moons ago we wrote about a bit of a kludge to get data from a Fitbit (see here). Now it looks as though there is a much better way. Cory Nissen has written an R package to scrape Fitbit data and posted it on GitHub. He also wrote a blog post on his blog Stats and Things announcing the package and demonstrating its use. While I haven’t tried it yet, it looks pretty straight-forward and much easier than anything else i have seen to date.

PD follow-up

Last Saturday the Mobilize project hosted a day-long professional development meeting for about 10 high school math teachers and 10 high school science teachers.  As always, it was very impressive how dedicated the teachers were, but I was particularly impressed by their creativity as, again and again, they demonstrated that they were able to take our lessons and add dimension to them that I, at least, didn’t initially see.

One important component of Mobilize is to teach the teachers statistical reasoning.  This is important because (a) the Mobilize content is mostly involved with using data analysis as a pathway for teaching math and science and (b) the Common Core (math) and the Next Generation (science) standards include much more statistics than previous curricula.  And yet, at least for math teachers, data analysis is not part of their education.

And so I was looking forward to seeing how the teachers performed on the “rank the airlines” Model Eliciting Activity, which was designed by the CATALYST project, led by Joan Garfield at U of Minnesota.  (Unit 2, Lesson 9 from the CATALYST web site.)  Model Eliciting Activities (MEA) are a lesson design which I’m getting really excited about, and trying to integrate into more of my own lessons.  Essentially, groups of students are given realistic and complex questions to answer.  The key is to provide some means for the student groups to evaluate their own work, so that they can iterate and achieve increasingly improved solutions.  MEAs began in the engineering-education world, and have been used increasingly in mathematics both at college and high school and middle school levels.  (A good starting point is “Model-eliciting activities (MEAs)  as a bridge between engineering education research and mathematics education research”, HamiIton, Lesh, Lester, Brilleslyper, 2008.  Advances in Engineering Education.) I was first introduced to MEAs when I was an evaluator for the CATALYST project, but didn’t really begin to see their potential until Joan Garfield pointed it out to me while I was trying to find ways of enhancing our Mobilize curriculum.

In the MEA we presented to the teachers on Saturday, they were shown data on arrival time delays from 5 airlines. Each airline had 10 randomly sampled flights into Chicago O’Hare from a particular year.  The primary purpose of the MEA is to help participants develop informal ways for comparing groups when variability is present.  In this case, the variability is present in an obvious way (different flights have different arrival delays) as well as less obvious ways (the data set is just one possible sample from a very large population, and there is sample-to-sample variability which is invisible. That is, you cannot see it in the data set, but might still use the data to conjecture about it.)

Before the PD I had wondered if the math and science teachers would approach the MEA differently.  Interestingly, during our debrief, one of the math teachers wondered the same thing.  I’m not sure if we saw truly meaningful differences, but here are some things we did see.

Most of the teams immediately hit on the idea of struggling to merge both the airline accuracy and the airline precision into their ranking.  However, only two teams presented rules that used both.  Interestingly, one used precision (variability) as the primary ranking and used accuracy (mean arrival delay) to break ties; another group did the opposite.

At least one team ranked only on precision, but developed a different measure of precision that was more relevant to the problem at hand:  the mean absolute deviations from 0 (rather than deviations from the mean).

One of the more interesting things that came to my attention, as a designer or curriculum, was that almost every team wrestled with what to do with outliers.  This made me realize that we do a lousy job of teaching people what to do with outliers, particularly since outliers are not very rare.   (One could argue whether, in fact, any of the observations in this MEA are outliers or not, but in order to engage in that argument you need a more sophisticated understanding of outliers than we develop in our students.  I, myself, would not have considered any of the observations to be outliers.)  For instance, I heard teams expressing concern that it wasn’t “fair” to penalize an airline that had a fairly good mean arrival time just because of one bad outlier.  Other groups wondered if the bad outliers were caused by weather delays and, if so, whether it was fair to include those data at all.   I was very pleased that no one proposed an outright elimination of outliers. (At least within my hearing.)  But my concern was that they didn’t seem to have constructive ways of thinking about outliers.

The fact that teachers don’t have a way of thinking about outliers is our fault.  I think this MEA did a great job of exposing the participants to a situation in which we really had to think about the effect of outliers in a context where they were not obvious data-entry errors.  But I wonder how we can develop more such experiences, so that teachers and students don’t fall into procedural-based, automated thinking.  (e.g. “If it is more than 1.5 times the IQR away from the median, it is an outlier and should be deleted.”  I have heard/read/seen this far too often.)

Do you have a lesson that engages students in wrestling with outliers? If so, please share!

Model Eliciting Activity: Prologue

I’m very excited/curious about tomorrow: I’m going to lead about 40 math and science teachers in a data-analysis activities, using one of the Model Eliciting Activities from the University of Minnesota Catalysts for Change Project. (One of our bloggers, Andy, was part of this project.) Specifically, we’re giving them the arrival-delay times for five different airlines into Chicago O’Hare. A random sample of 10 from each airline, and asking them to come up with rules for ranking the airlines from best to worst.

I’m curious to see what they come up with, particularly whether  the math teachers differ terribly from the science teachers. The math teachers are further along in our weekend professional development program than are the science teachers, and so I’m hoping they’ll identify the key characteristics of a distribution (all together: center, spread, shape; well, shape doesn’t play much of a role here) and use these to formulate their rankings. We’ve worked hard on helping them see distributions as a unit, and not a collection of individual points, and have seen big improvements in the teachers, most of whom have not taught statistics before.

The science teachers, I suspect, will be a little bit more deterministic in their reasoning, and, if true to my naive stereotype of science teachers, will try to find explanations for individual points. Since I haven’t worked as much with the science teachers, I’m curious to see if they’ll see the distribution as a whole, or instead try to do point-by-point comparisons.

When we initially started this project, we had some informal ideas that the science teachers would take more naturally to data analysis than would the math teachers. This hasn’t turned out to be entirely true. Many of the math teachers had taught statistics before, and so had some experience. Those who hadn’t, though, tended to be rather procedurally oriented. For example, they often just automatically dropped outliers from their analysis without any thought at all, just because they thought that that was the rule. (This has been a very hard habit to break.)

The math teachers also had a very rigid view of what was and was not data. The science teachers, on the other hand, had a much more flexible view of data. In a discussion about whether photos from a smart phone were data, a majority of math teachers said no and a majority of science teachers said yes. On the other hand, the science teachers tend to use data to confirm what they already know to be true, rather than use it to discover something. This isn’t such a problem with the math teachers, in part because they don’t have preconceptions of the data and so have nothing to confirm. In fact, we’ve worked hard with the math teachers, and with the science teachers, to help them approach a data set with questions in mind. But it’s been a challenge teaching them to phrase questions for their students in which the answers aren’t pre-determined or obvious, and which are empirically oriented. (For example: We would like them to ask something like “what activities most often led to our throwing away redcycling into the trash bin?” rather than “Is it wrong to throw trash into the recycling bin?” or “Do people throw trash into the recycling bin?”)

So I’ll report back soon on what happened and how it went.

Annual Review of Reading

It is that time of year…time to review the previous year; make top 10 lists; and resolve to be a better person in 2015. I will tackle the first, but only of my reading habits. In 2014 I read 46 books for a grand total of 17,480 pages. (Note: I do not count academic books for work in this list, only books I read for recreation.) This is a yearly high, at least since I have been tracking this data on GoodReads (since late 2010). You can read an older annual report of reading here.

Year Books Pages
2011 45 15,332
2012 29 9,203
2013 45 15,887
2014 46 17,480

Since I have accumulated four years worth of data, I thought I might do some comparative analysis of my reading over this time period.

When am I reading?

plot2The trend displayed here was somewhat surprising when I looked at it—at least related to the decline in reading over the summer months. Although, reflecting on it, it maybe should not have been as surprising. There is a slight uptick around the month of May (when spring semester ends) and the decline begins in June/July. Not only do summer classes begin, but I also try to do a few house and garden projects over the summer months. This uptick and decline are still visible when a plot of the number of pages (rather than the number of books) is examined, albeit much smaller (1,700 pages in May and 1,200 pages in the summer months). This might indicate I read longer books in the summer. For example, one of the books I read this last summer was Neal “I don’t know the meaning of the word ‘brevity'” Stephenson’s Reamde, which clocked in at a mere 1,044 pages.

Was I reading books that I ultimately enjoyed?

plot3I also plotted my monthly average rating (on a five-point scale) for the four years of data. This plot shows that 2014 is an anomaly. I apparently read trash in the summer (which is what you are supposed to do). The previous three years I read the most un-noteworthy books in the fall. Or, I just rated them lower because school had started again.

Am I more critical than other readers? Is this consistent throughout the year?

I also looked at how other GoodReads readers had rated those same books. The months represent when I read the book. (I didn’t look at when the book was read by other readers, although that would be interesting to see if time of year has an effect on rating.) The scale on the y-axis is the residual between my rating and the average GoodReads rating. My ratings are generally close to the average, sometimes higher, sometimes lower. There are, however, many books that I rated much lower than average. The loess smooth suggests that July–November is when I am most critical relative to other readers.