The birthday problem and voter fraud

I was traveling at the end of last week, which means I had some time to listen to podcasts while in transit. This American Life is always a hit for me, though sometimes I can’t listen to it in public because the stories can be too sad, and then I get all teary eyed in airports…

This past week’s was both fun and informative though. I’m talking about Episode 630: Things I Mean to Know. This post is about a specific segment of this episode: Fraud Complex. You can listen to it here, and here is the description:


We’ve all heard reports that voter fraud isn’t real. But how do we know that’s true? David Kestenbaum went on a quest to find out if someone had actually put in the work—and run the numbers—to know for certain. (17 minutes)
Source: TAL – Episode 630: Things I Mean to Know – Act One – Fraud Complex

The segment discusses a specific type of voter fraud, double voting. David Kestenbaum interviews Sharad Goel (Stanford University) for the piece, and they discuss this paper of his. Specifically, there is a discussion of the birthday problem in there. If you’re not familiar with the birthday problem, see here. Basically, it concerns the probability that, in a set of randomly chosen people, some pair of them will have the same birthday. The episode walks through applying the same logic used to solve this problem to calculate probability of having people with the same name and birthdate on voter records. However it turns out the simple calculation assuming uniform distribution of births over the year does a poor job at estimating this probability because of reasons like people born in certain times of the year to be more likely to be named a certain way (e.g. June for babies born in June, Autumn for babies born in fall, etc.). I won’t tell the whole story, because the producers of the show do a much better job at telling it.

If you’re teaching probability, or discussing the birthday problem in any way in your class, I highly recommend you have your students listen to this segment. It’s a wonderful application, and I think interesting applications tend to be hard to come by in probability theory courses.

 

Mapping Irma, but not really…

We’re discussing data visualization nowadays in my course, and today’s topic was supposed to be mapping. However late last night I realized I was going to run out of time and decided to table hands on mapping exercises till a bit later in the course (after we do some data manipulation as well, which I think will work better).

That being said, talking about maps seemed timely, especially with Hurricane Irma developing. Here is how we went about it:

In addition to what’s on the slide I told the students that they can assume the map is given, and they should only think about how the forecast lines would be drawn.

Everyone came up with “we need latitude and longitude and time”. However some teams suggested each column would represent one of the trajectories (wide data), while others came up with the idea of having an indicator column for the trajectory (long data). We sketched out on the board what these two data frames would look like, and evaluated which would be easier to directly plot using tools we’ve learned so far (plotting in R with ggplot2).

While this was a somewhat superficial activity compared to a hands on mapping exercise, I thought it worked well for a variety of reasons:

  1. It was a timely example that grabbed students’ attention.
  2. It generated lively discussion around various ways of organizing data into data frames (which hopefully will serve as a good primer for the data manipulation unit where we’ll discuss how data don’t always come in the format you need and you might need to get it in shape first before you can visualize/analyze it).
  3. Working backwards from a visualization to source data (as opposed to from data to visualization) provided a different challenge/perspective, and a welcome break from “how do I get R to plot this?”.
  4. We got to talk about the fact that predictions based on the same source data can vary depending on the forecasting model (foreshadowing of concepts we will discuss in the modeling unit coming up later in the course).
  5. It was quick to prepare! And quick to work through in class (~5 mins of team discussion + ~10 mins of class discussion).

I also suggested to students that they read the underlying NYTimes article as well as this Upshot article if they’re interested in finding out more about modeling the path of a hurricane (or modeling anything, really) and uncertainty.

Revisiting that first day of class example

About a year ago I wrote this post: 

I wasn’t teaching that semester, so couldn’t take my own advice then, but thankfully (or the opposite of thankfully) Trump’s tweets still make timely discussion.

I had two goals for presenting this example on the first day of my data science course (to an audience of all first-year undergraduates, with little to no background in computing and statistics):

  1. Give a data analysis example with a familiar context
  2. Show that if they take the time to read the code, they can probably understand what it’s doing, at least at a high level

First, I provided them some context: “The author wanted to analyze Trump’s tweets: both the text, and some other information on the tweets like when and from what device they were posted.” And I asked the students “If you wanted to do this analysis, how would you go about collecting the data?”. Some suggested manual data collection, which we all agreed is too tedious. A few suggested there should be a way to get the data from Twitter. So then we went back to the blog post, and worked our way through some of the code. (My narrative is roughly outlined in handwriting below.)

The moral of the story: You don’t need to figure out how to write a program that gets tweets from Twitter. Someone else has already done it, and packaged it up (in a package called twitteR), and made it available for you to use. Here, the important message I tried to convey was that “No, I don’t expect you to know that this package exists, or to figure out how to use it. But I hope you agree that once you know the package exists, it’s worth the effort to figure out how to use its functionality to get the tweets, instead of collecting the data manually.”

Then, we discussed the following plot in detail:

First, I asked the students to come up with a list of variables we need in our dataset so that we can make this plot: we need to know what time each tweet was posted and what device it came from and we need to know how what percentage of tweets were posted in a given hour.

Here is the breakdown of the code (again, my narrative is in the handwritten comments):

Once again, I wanted to show the students that if they take some time, they can probably figure out roughly what each line (ok, maybe not each, but most lines) of code are doing. We didn’t get into discussing what’s a geom, what’s the difference between %>% and +, what’s an aesthetic, etc. We’ll get into those, but the night semester is young…

My hope is that next time I present how to do something new in R, they’ll remember this experience of being able to mostly figure out what’s happening by taking some time staring at the code and thinking about “if I had to do this by hand, how would I go about it?”.

Modernizing the Undergraduate Statistics Curriculum at #JSM2017

I’m a bit late in posting this, but travel delays post-JSM left me weary, so I’m just getting around to it. Better late than never?

Wednesday at JSM featured an invited statistics education session on Modernizing the Undergraduate Statistics Curriculum. This session featured two types of speakers: those who are currently involved in undergraduate education and those who are on the receiving end of graduating majors. The speakers involved in undergraduate education presented on their recent efforts for modernizing the undergraduate statistics curriculum to provide the essential computational and problem solving skills expected from today’s modern statistician while also providing a firm grounding in theory and methods. The speakers representing industry discussed their expectations (or hopes and dreams) for new graduates and where they find gaps in the knowledge of new hires.

The speakers were  Nick Horton (Amherst College), Hilary Parker (Stitch Fix), Jo Hardin (Pomona College), and Colin Rundel (Duke University). The discussant was Rob Gould (UCLA). Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments.

Modernizing the undergraduate statistics curriculum: what are the theoretical underpinnings? – Nick Horton

Hopes and dreams for statistics graduates – Hilary Parker

Expectations and Skills for Undergraduate Students Doing Research in Statistics and Data Science – Jo Hardin

Moving Away from Ad Hoc Statistical Computing Education – Colin Rundel

Discussion – Rob Gould

Novel Approaches to First Statistics / Data Science Course at #JSM2017

Tuesday morning, bright an early at 8:30am, was our session titled “Novel Approaches to First Statistics / Data Science Course”. For some students the first course in statistics may be the only quantitative reasoning course they take in college. For others, it is the first of many in a statistics major curriculum. The content of this course depends on which audience the course is aimed at as well as its place in the curriculum. However a data-centric approach with an emphasis on computation and algorithmic thinking is essential for all modern first statistics courses. The speakers in our session presented their approaches for the various first courses in statistics and data science that they have developed and taught. The discussion also highlighted pedagogical and curricular choices they have made in deciding what to keep, what to eliminate, and what to modify from the traditional introductory statistics curriculum. The speakers in the session were Ben Baumer from Smith College, Rebecca Nugent from CMU, myself, and Daniel Kaplan from Macalester College. Our esteemed discussant was Dick DeVeaux, and our chair, the person who managed to keep this rambunctious bunch on time, was Andrew Bray from Reed College. Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments, or find us on social media!

Ben Baumer – Three Methods Approach to Statistical InferenceRebecca Nugent – Lessons Learned in Transitioning from “Intro to Statistics” to “Reasoning with Data”

Mine Cetinkaya-Rundel – A First-Year Undergraduate Data Science Course

Daniel Kaplan – Teaching Stats for Data Science

Dick DeVeaux – Discussion

 

My JSM 2017 itinerary

JSM 2017 is almost here. I just landed in Maryland, and I finally managed to finish combing through the entire program. What a packed schedule! I like writing an itinerary post each year, mainly so I can come back to it during and after the event. I obviously won’t make it to all sessions listed for each time slot below, but my decision for which one(s) to attend during any time period will likely depend on proximity to previous session, and potentially also proximity to childcare area.

The focus of the sessions I selected are education, data science, computing, visualization, and social responsibility. In addition to talks on topics I actively work in, I also enjoy listening to talks in application areas I’m interested in, hence the last topic on this list.

If you have suggestions for other sessions (in these topics or other) that you think would be interested, let me know in the comments!

Sun, 7/30/2017

Sunday will be mostly meetings for me, and I’m skipping any evening stuff to see Andrew Bird & Belle and Sebastian!

Mon, 7/31/2017

  • DataFest meeting: 10am – 12pm at H-Key Ballroom 9. Stop by if you’re already an ASA DataFest organizer, or if you’d like to be one in the future!
    • First hour will be discussing what worked and what didn’t, any concerns, kudos, advice for new sites, etc.
    • Second hour will be drop-in for addressing any questions regarding organizing an ASA DataFest at your institution.
  • Computing and Graphics mixer: 6 – 8pm at H-Key Ballroom 1.
  • Caucus for Women in Statistics Reception and Business Meeting: 6:30 – 8:30pm at H-Holiday Ballroom 1&2.

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00 PM – 3:50 PM

4:00 PM – 5:50 PM

ASA President’s Invited Speaker: It’s Not What You Said. It’s What They Heard – Jo Craven McGinty, The Wall Street Journal

Tue, 8/1/2017

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

Deming Lecture: A Rake’s Progress Revisited – Fritz Scheuren, NORC-University of Chicago

Wed, 8/2/2017

  • Statistical Education Business Meeting – 6-7:30pm

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

COPSS Awards and Fisher Lecture: The Importance of Statistics: Lessons from the Brain Sciences – Robert E. Kass, Carnegie Mellon University

Thur, 8/3/2017

8:30 AM – 10:20 AM

 10:30 AM – 12:20 PM

Read elsewhere: Organizing DataFest the tidy way

Part of the reason why we have been somewhat silent at Citizen Statistician is that it’s DataFest season, and that means a few weeks (months?) of all consuming organization followed by a weekend of super fun data immersion and exhaustion… Each year that I organize DataFest I tell myself “next year, I’ll do [blah] to make my life easier”. This year I finally did it! Read about how I’ve been streamlining the process of registrations, registration confirmations, and dissemination of information prior to the event on my post titled “Organizing DataFest the tidy way” on the R Views blog.

Stay tuned for an update on ASA DataFest 2017 once all 31 DataFests around the globe have concluded!

Measurement error in intro stats

I have recently been asked by my doctor to closely monitor my blood pressure, and report it if it’s above a certain cutoff. Sometimes I end up reporting it by calling a nurse line, sometimes directly to a doctor in person. The reactions I get vary from “oh, it happens sometimes, just take it again in a bit” to “OMG the end of the world is coming!!!” (ok, I’m exaggerating, but you get the idea). This got me thinking: does the person I’m talking to understand measurement error? Which then got me thinking: I routinely teach intro stats courses that for some students is the only stats, and potentially only quantitative reasoning, course they might take in college, do I discuss measurement error properly in this course? I’m afraid the answer is no… It’s certainly mentioned within the context of a few case studies, but I can’t say that it gets the emphasis it deserves. I also browsed through a few intro stats books (including mine!) and not a mention of “measurement error” specifically.

I’m always hesitant to make statements like “we should teach this in intro stats” because I know most intro stats curriculum is already pretty bloated, and it’s not feasible to cover everything in one course. But this seems to be a pretty crucial concept for someone to understand in order to be able to have meaningful conversations with their health providers and make better decisions (or stay calm) about their health that I think it is indeed worth spending a little bit of time on.

Slack for managing course TAs

slackI meant to write this post last year when I was teaching a large course with lots of teaching assistants to manage, but, well, I was teaching a large course with lots of teaching assistants to manage, so I ran out of time…

There is nothing all that revolutionary here. People have been using Slack to manage teams for a while now. I’ve even come across some articles / posts on using Slack as a course discussion forum, so use of Slack in an educational setting is not all that new either. But I have not heard of people using Slack for organizing the course and managing TAs, so I figured it might be worthwhile to write about my experience.

TL;DR: A+, would do it again!

I’ll be honest, when I first found out about Slack, I wasn’t all that impressed. First, I kept thinking it’s called Slacker, and I was like, “hey, I’m no slacker!” (I totally am…). Second, I initially thought one had to use Slack in the browser, and accidentally kept closing the tab and hence missing messages. There is a Slack app that you can run on your computer or phone, it took me a while to realize that. Because of my rocky start with it, I didn’t think to use Slack in my teaching. I must credit my co-instructor, Anthea Monod, for the idea of using Slack for communicating with our TAs.

Between the two instructors we had 12 TAs to manage. We set up a Slack team for the course with channels like #labs, #problem sets, #office_hours, #meetings, etc.

This setup worked really well for us for a variety of reasons:

  • Keep course management related emails out of email inbox: These really add up. At this point, any email I can keep out of my inbox is a win in my book!
  • Easily keep all TAs in the loop: Need to announce a typo in a solution key? Or give TAs a heads up about questions they might expect in office hours? I used to handle these by emailing them all, and either I’d miss one or two or a TA responding to my email would forget to reply all (people never seem to reply all when they should, but they always do when they shouldn’t!)
  • Provide a space for TAs to easily communicate with each other: Our TAs used Slack to let others know they might need someone to cover for them for office hours, or teaching a section, etc. It was nice to be able to alert all of them at once, and also for everyone to see when someone responded saying they’re available to cover.
  • Keep a record of decisions made in an easily searchable space: Slack’s search is not great, but it’s better than my email’s for sure. Plus, since you’re searching only within that team’s communication, as opposed to through all your emails, it’s a lot easier to find what you’re looking for.
  • It’s fun: The #random channel was a place people shared funny tidbits or cool blog posts etc. I doubt the TAs would be emailing each other with these if this communication channel wasn’t there. It made them act more like a community than they would otherwise.
  • It’s free: At least for a reasonable amount of usage for a semester long course.

Some words of advice if you decide to use Slack for managing your own course:

  • There is a start-up cost: Not cost as in $$, but cost as in time… At the beginning of the semester you’ll need to make sure everyone gets in the team and sets up Slack on their devices. We did this during our first meeting, it was a lot more efficient than emailing reminders.
  • It takes time for people to break their emailing habits: For the first couple weeks TAs would still email me their questions instead of using Slack. It took some time and nudging, but eventually everyone shifted all course related communication to Slack.

If you’re teaching a course with TAs this semester, especially a large one with many people to manage, I strongly recommend giving Slack a try.

A timely first day of class example for Fall 2016: Trump Tweets

On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).

silver-feature-most-common-women-names3One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students’ names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.

4671594023_b41c2ee662_m

Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson‘s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red – tourist – pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.

makeHowever if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half

I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that

the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.

Source: http://varianceexplained.org/r/trump-tweets/

I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.

As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.