The birthday problem and voter fraud

I was traveling at the end of last week, which means I had some time to listen to podcasts while in transit. This American Life is always a hit for me, though sometimes I can’t listen to it in public because the stories can be too sad, and then I get all teary eyed in airports…

This past week’s was both fun and informative though. I’m talking about Episode 630: Things I Mean to Know. This post is about a specific segment of this episode: Fraud Complex. You can listen to it here, and here is the description:

We’ve all heard reports that voter fraud isn’t real. But how do we know that’s true? David Kestenbaum went on a quest to find out if someone had actually put in the work—and run the numbers—to know for certain. (17 minutes)
Source: TAL – Episode 630: Things I Mean to Know – Act One – Fraud Complex

The segment discusses a specific type of voter fraud, double voting. David Kestenbaum interviews Sharad Goel (Stanford University) for the piece, and they discuss this paper of his. Specifically, there is a discussion of the birthday problem in there. If you’re not familiar with the birthday problem, see here. Basically, it concerns the probability that, in a set of randomly chosen people, some pair of them will have the same birthday. The episode walks through applying the same logic used to solve this problem to calculate probability of having people with the same name and birthdate on voter records. However it turns out the simple calculation assuming uniform distribution of births over the year does a poor job at estimating this probability because of reasons like people born in certain times of the year to be more likely to be named a certain way (e.g. June for babies born in June, Autumn for babies born in fall, etc.). I won’t tell the whole story, because the producers of the show do a much better job at telling it.

If you’re teaching probability, or discussing the birthday problem in any way in your class, I highly recommend you have your students listen to this segment. It’s a wonderful application, and I think interesting applications tend to be hard to come by in probability theory courses.


Hand Drawn Data Visualizations

Recently the blog Brain Pickings wrote about the set of hand-drawn visualizations that Civil Rights activist W.E.B. Du Bois commissioned for the1900 World’s Fair in Paris. (In a previous post, Rob wrote about an art exhibit he saw that featured artistic interpretations of these plots.)

Every time I see these visualizations I am amazed—they are gorgeous and the detail (and penmanship) is amazing. The visualizations included bar charts, area plots, and maps—all hand-drawn! You can read more and see several of the data visualizations here.

My Ideal Bookshelf

This summer I read My Ideal Bookshelf, a book of hand-drawn illustrations depicting well-known cultural icons’ bookshelves. Each person was asked to identify a small shelf of books that represented her/him. These could be books that “changed your life, that have made you who you are today, your favorite favorites.”

My Ideal Bookshelf

I really enjoyed looking at the illustrations of the books, recognizing several that sat on my bookshelf. The accompanying text also told the back-story of why these books were chosen. Although this isn’t data visualization, per se (at least in the traditional sense), the hand-drawn illustrations were what drew me to the book in the first place.

Dear Data

Giorgia Lupi and Stefanie Posavec also created hand-drawn data visualizations for their Dear Data project. The project, as described on their website was:

Each week, and for a year, we collected and measured a particular type of data about our lives, used this data to make a drawing on a postcard-sized sheet of paper, and then dropped the postcard in an English “postbox” (Stefanie) or an American “mailbox” (Giorgia)!

They ultimately turned their visualizations into a book. You can learn more at their website or from the Data Stories podcast related to the project.

Postcards from the Dear Data project

Mapping Manhattan

Another project, that was also turned into a book, was Mapping Manhattan: A Love (and Sometimes Hate) Story in Maps by 75 New Yorkers. Becky Cooper, the project’s creator, passed out blank maps (with return postage) of Manhattan to strangers she met walking around the city and asked them to “map their Manhattan.” (Read more here.)

Map drawn by New Yorker staff writer Patricia Marx

The hand-drawn maps conveyed the personal story of each person’s map in a way that a computer drawn image never could. This was so compelling, we have thought about including a similar assignment in our Data Visualization course. (Although people don’t connect to Minneapolis as much as they do to New York.)

There is something simple about a hand drawn data visualization. Because of the time involved in creating a hand-drawn visualization, the creator needs to more carefully plan the execution of the final product. As should be obvious, this implies that this is not the best mode for exploratory work. However for expository work, hand drawn plots can be powerful and, I feel, connect to the viewer a bit more. (Maybe this connection is pure illusion…after all, I prefer typewritten letters to word-processed letters as well, despite never receiving either in today’s age.)


Data Visualization Course for First-Year Students

A little over a year ago, we decided to propose a data visualization course at the first-year level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.

Course description of the EPsy 1261 data visualization course

In putting together the proposal, we knew that:

  • The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
  • To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.

This led to several challenges and long discussions about the curriculum for this course. For example:

  • Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
  • If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
  • In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
  • Which types of visualizations would we include in the course?
  • Would we use a textbook? How might this inform the content of the course?

Software and Content

After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:

  • Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
  • Use R (primary tool) and RAWGraphs for the production of these plots.

Software: Use ggplot2 in R

The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive drag-and-drop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce side-by-side bar plots easily (actually at all). The other drawback was that the drag-and-drop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.

Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.

One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a course-specific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as well-versed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)

Interactive creation of the bar chart using the epsy1261 package. This allows students to input  minimal syntax, barchart(data), and then use interaction to create plots.

Course Content

We decided on a three-pronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a data-driven story.

Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.


The reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was over-ambitious. We grossly under-estimated the amount of practice time these students would need, especially working with R. Two things play a role in this:

  1. The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
  2. We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.

Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using point-and-click software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of case-sensitivity, commas, and parentheses is outside their wheelhouse.

I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “de-bug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)

The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to re-create these.

I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.

*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.



Data Science Webinar Announcement

I’m pleased to announce that on Monday, September 11 , 9-11 am Pacific, I’ll be leading a Concord Consortium Data Science Education Webinar. Oddly, I forgot to give it a title, but it would be something like “Towards a Learning Trajectory for K-12 Data Science”. This webinar, like all Concord webinars, is intended to be highly interactive. Participants should have their favorite statistical software at the ready. A detailed abstract as well as registration information is here

At the same site you can view recent wonderful webinars by Cliff Konold, Hollylynne Lee and Tim Erickson.

Revisiting that first day of class example

About a year ago I wrote this post: 

I wasn’t teaching that semester, so couldn’t take my own advice then, but thankfully (or the opposite of thankfully) Trump’s tweets still make timely discussion.

I had two goals for presenting this example on the first day of my data science course (to an audience of all first-year undergraduates, with little to no background in computing and statistics):

  1. Give a data analysis example with a familiar context
  2. Show that if they take the time to read the code, they can probably understand what it’s doing, at least at a high level

First, I provided them some context: “The author wanted to analyze Trump’s tweets: both the text, and some other information on the tweets like when and from what device they were posted.” And I asked the students “If you wanted to do this analysis, how would you go about collecting the data?”. Some suggested manual data collection, which we all agreed is too tedious. A few suggested there should be a way to get the data from Twitter. So then we went back to the blog post, and worked our way through some of the code. (My narrative is roughly outlined in handwriting below.)

The moral of the story: You don’t need to figure out how to write a program that gets tweets from Twitter. Someone else has already done it, and packaged it up (in a package called twitteR), and made it available for you to use. Here, the important message I tried to convey was that “No, I don’t expect you to know that this package exists, or to figure out how to use it. But I hope you agree that once you know the package exists, it’s worth the effort to figure out how to use its functionality to get the tweets, instead of collecting the data manually.”

Then, we discussed the following plot in detail:

First, I asked the students to come up with a list of variables we need in our dataset so that we can make this plot: we need to know what time each tweet was posted and what device it came from and we need to know how what percentage of tweets were posted in a given hour.

Here is the breakdown of the code (again, my narrative is in the handwritten comments):

Once again, I wanted to show the students that if they take some time, they can probably figure out roughly what each line (ok, maybe not each, but most lines) of code are doing. We didn’t get into discussing what’s a geom, what’s the difference between %>% and +, what’s an aesthetic, etc. We’ll get into those, but the night semester is young…

My hope is that next time I present how to do something new in R, they’ll remember this experience of being able to mostly figure out what’s happening by taking some time staring at the code and thinking about “if I had to do this by hand, how would I go about it?”.


Citizen Statistician’s very own Mine Çetinkaya-Rundel gave one of the keynote addresses at USCOTS 2017.

The abstract for her talk, Teaching Data Science and Statistical Computation to Undergraduates, is given below.

What draws students to statistics? For some, the answer is mathematics, and for those a course in probability theory might be an attractive entry point. For others, their first exposure to statistics might be an applied introductory statistics course that focuses on methodology. This talk presents an alternative focus for a gateway to statistics: an introductory data science course focusing on data wrangling, exploratory data analysis, data visualization, and effective communication and approaching statistics from a model-based, instead of an inference-based, perspective. A heavy emphasis is placed on best practices for statistical computation, such as reproducibility and collaborative computing through literate programming and version control. I will discuss specific details of this course and how it fits into a modern undergraduate statistics curriculum as well as the success of the course in recruiting students to a statistics major.

You can view her slides at



Some Reading for the Winter Break

It has been a long while since I wrote anything for Citizen Statistician, so I thought I would scribe a post about three books that I will be reading over break.







The first book is Cathy O’Neil’s book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [link to Amazon]. I am currently in the midst of Chapter 3. I heard about this book on an episode of 538’s podcast, What’s the Point?, on which O’Neil was featured [Who’s Accountable When An Algorithm Makes A Bad Decision?]. The premise of this book has been something that has been on the mind of many people thinking about data science and algorithms in recent years (and probably not-so-recent years); that many algorithms, and thus the predictions stemming from them, are not transparent. This leads to many ethical and, potentially, legal issues when algorithms are then used to make decisions about recidivism, loan applications, college admissions, etc. I think this book could be the basis for a very interesting seminar. Let me know if anyone is working on something like this.

The second book I will be reading is Michael Lewis’ The Undoing Project: A Friendship That Changed Our Minds [link to Amazon]. This book is bout the friendship, collaboration, and, ultimately, disentanglement between the renowned psychologists Daniel Kahnemann and Amos Tversky. I learned about Kahnemann and Tversky’s work early in my graduate career when Joan Garfield taught a doctoral research seminar on the seminal psychological work related to probabilistic thinking and statistics education. We read not only Kahnemann and Tversky, but also Gird Gigerenzer, Ruma Falk, Maya Bar Hillel, Richard Nisbett, Efraim Fischbein, and others. Interestingly, What’s the Point? recently did two episodes on Lewis’ book as well; Michael Lewis’s New Book Examines How We Think About Thinking and Nate Silver Interviews Michael Lewis About His New Book, ‘The Undoing Project’.

The third book is Who’s #1?: The Science of Rating and Ranking [link to Amazon] by Amy Langville and Carl Meyer. I had read their earlier book, Google’s PageRank and Beyond: The Science of Search Engine Rankings, several years ago, and was quite impressed with the readability of the complex matrix algebra they presented. In Who’s #1, the authors present the mathematics underlying several ratings systems including the Massey system, Elo, Colley, Keener, etc. I am actually treating this book like a self-taught class,  working out several of their examples using R, and really trying to understand the ideas. My interest here is related to the work that I am doing with Brandon LeBeau (University of Iowa) and a current graduate student, Kyle Nickodem on estimating the coaching ability for NCAA football coaches using a hierarchical IRT model [see slides from a talk here].

Measurement error in intro stats

I have recently been asked by my doctor to closely monitor my blood pressure, and report it if it’s above a certain cutoff. Sometimes I end up reporting it by calling a nurse line, sometimes directly to a doctor in person. The reactions I get vary from “oh, it happens sometimes, just take it again in a bit” to “OMG the end of the world is coming!!!” (ok, I’m exaggerating, but you get the idea). This got me thinking: does the person I’m talking to understand measurement error? Which then got me thinking: I routinely teach intro stats courses that for some students is the only stats, and potentially only quantitative reasoning, course they might take in college, do I discuss measurement error properly in this course? I’m afraid the answer is no… It’s certainly mentioned within the context of a few case studies, but I can’t say that it gets the emphasis it deserves. I also browsed through a few intro stats books (including mine!) and not a mention of “measurement error” specifically.

I’m always hesitant to make statements like “we should teach this in intro stats” because I know most intro stats curriculum is already pretty bloated, and it’s not feasible to cover everything in one course. But this seems to be a pretty crucial concept for someone to understand in order to be able to have meaningful conversations with their health providers and make better decisions (or stay calm) about their health that I think it is indeed worth spending a little bit of time on.

Statistics with R on Coursera

18332552I held off on posting about this until we had all the courses ready, and we still have a bit more work to do on the last component, but I’m proud to announce that the specialization called Statistics with R is now on Coursera!

Some of you might know that I’ve had a course on Coursera for a while now (whatever “a while” means on MOOC-land), but it was time to refresh things a bit to align the course with other Coursera offerings — shorter, modular, etc. So I chopped up the old course into bite size chunks and made some enhancements in each component such as

  • integrating dplyr and ggplot2 syntax into the R labs,
  • restructuring the labs to be completed in R Markdown to provide better scaffolding for a data analysis project for each course,
  • adding Shiny apps to some of the labs to better demonstrate statistical concepts without burdening the learners with coding beyond the level of the course,
  • creating an R package that contains all the data, custom functions, etc. used in the course, and
  • cleaning things up a bit to make the weekly workload consistent across weeks.

The underlying code for the labs and the package can be found at Here you can also find the R code for reproducing some of the figures and analyses shown on the course slides (and we’ll keep adding to that repo in the next few weeks).

The biggest change between the old course and the new specialization though is a completely new course: Bayesian Statistics. I touched on Bayesian inference a bit in my old course, and this generated lots of discussion on the course forums from learners wanting more on this content. Being at Duke, I figured who better to offer this course but us! (If you know anything about the Statistical Science department at Duke, you probably know it’s pretty Bayesian.) Note, I didn’t say “me”, I said “us”. I was able to convince a few colleagues (David Banks, Merlise Clyde, and Colin Rundel) to join me in developing this course, and I’m glad I did! Figuring out exactly how to teach this content in an effective way without assuming too much mathematical background took lots of thinking (and re-thinking, and re-thinking). We have also managed to feature a few interviews with researchers in academia and industry, such as Jim Berger (Duke), David Dunson (Duke), Amy Herring (UNC), and Steve Scott (Google) to provide a bit more context for learners on where and why Bayesian statistics is relevant. This course launched today, and I’m looking forward to seeing the feedback from the learners.

If you’re interested in the specialization, you can find out more about it here. The courses in the specialization are:

  1. Introduction to Probability and Data
  2. Inferential Statistics
  3. Linear Regression and Modeling
  4. Bayesian Statistics
  5. Statistics Capstone Project

You can take the courses individually or sign up for the whole specialization, but to do the capstone you need to have completed the 4 courses in the specialization. The landing page for the specialization outlines in further detail how to navigate everything, and relevant dates and deadlines.

Also note that while the graded components of the course which will allow you to pursue a certificate require payment, one can audit the courses for free and watch videos, complete practices quizzes, and work on the labs.

Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).


PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!