Some Reading for the Winter Break

It has been a long while since I wrote anything for Citizen Statistician, so I thought I would scribe a post about three books that I will be reading over break.







The first book is Cathy O’Neil’s book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [link to Amazon]. I am currently in the midst of Chapter 3. I heard about this book on an episode of 538’s podcast, What’s the Point?, on which O’Neil was featured [Who’s Accountable When An Algorithm Makes A Bad Decision?]. The premise of this book has been something that has been on the mind of many people thinking about data science and algorithms in recent years (and probably not-so-recent years); that many algorithms, and thus the predictions stemming from them, are not transparent. This leads to many ethical and, potentially, legal issues when algorithms are then used to make decisions about recidivism, loan applications, college admissions, etc. I think this book could be the basis for a very interesting seminar. Let me know if anyone is working on something like this.

The second book I will be reading is Michael Lewis’ The Undoing Project: A Friendship That Changed Our Minds [link to Amazon]. This book is bout the friendship, collaboration, and, ultimately, disentanglement between the renowned psychologists Daniel Kahnemann and Amos Tversky. I learned about Kahnemann and Tversky’s work early in my graduate career when Joan Garfield taught a doctoral research seminar on the seminal psychological work related to probabilistic thinking and statistics education. We read not only Kahnemann and Tversky, but also Gird Gigerenzer, Ruma Falk, Maya Bar Hillel, Richard Nisbett, Efraim Fischbein, and others. Interestingly, What’s the Point? recently did two episodes on Lewis’ book as well; Michael Lewis’s New Book Examines How We Think About Thinking and Nate Silver Interviews Michael Lewis About His New Book, ‘The Undoing Project’.

The third book is Who’s #1?: The Science of Rating and Ranking [link to Amazon] by Amy Langville and Carl Meyer. I had read their earlier book, Google’s PageRank and Beyond: The Science of Search Engine Rankings, several years ago, and was quite impressed with the readability of the complex matrix algebra they presented. In Who’s #1, the authors present the mathematics underlying several ratings systems including the Massey system, Elo, Colley, Keener, etc. I am actually treating this book like a self-taught class,  working out several of their examples using R, and really trying to understand the ideas. My interest here is related to the work that I am doing with Brandon LeBeau (University of Iowa) and a current graduate student, Kyle Nickodem on estimating the coaching ability for NCAA football coaches using a hierarchical IRT model [see slides from a talk here].

Measurement error in intro stats

I have recently been asked by my doctor to closely monitor my blood pressure, and report it if it’s above a certain cutoff. Sometimes I end up reporting it by calling a nurse line, sometimes directly to a doctor in person. The reactions I get vary from “oh, it happens sometimes, just take it again in a bit” to “OMG the end of the world is coming!!!” (ok, I’m exaggerating, but you get the idea). This got me thinking: does the person I’m talking to understand measurement error? Which then got me thinking: I routinely teach intro stats courses that for some students is the only stats, and potentially only quantitative reasoning, course they might take in college, do I discuss measurement error properly in this course? I’m afraid the answer is no… It’s certainly mentioned within the context of a few case studies, but I can’t say that it gets the emphasis it deserves. I also browsed through a few intro stats books (including mine!) and not a mention of “measurement error” specifically.

I’m always hesitant to make statements like “we should teach this in intro stats” because I know most intro stats curriculum is already pretty bloated, and it’s not feasible to cover everything in one course. But this seems to be a pretty crucial concept for someone to understand in order to be able to have meaningful conversations with their health providers and make better decisions (or stay calm) about their health that I think it is indeed worth spending a little bit of time on.

Statistics with R on Coursera

18332552I held off on posting about this until we had all the courses ready, and we still have a bit more work to do on the last component, but I’m proud to announce that the specialization called Statistics with R is now on Coursera!

Some of you might know that I’ve had a course on Coursera for a while now (whatever “a while” means on MOOC-land), but it was time to refresh things a bit to align the course with other Coursera offerings — shorter, modular, etc. So I chopped up the old course into bite size chunks and made some enhancements in each component such as

  • integrating dplyr and ggplot2 syntax into the R labs,
  • restructuring the labs to be completed in R Markdown to provide better scaffolding for a data analysis project for each course,
  • adding Shiny apps to some of the labs to better demonstrate statistical concepts without burdening the learners with coding beyond the level of the course,
  • creating an R package that contains all the data, custom functions, etc. used in the course, and
  • cleaning things up a bit to make the weekly workload consistent across weeks.

The underlying code for the labs and the package can be found at Here you can also find the R code for reproducing some of the figures and analyses shown on the course slides (and we’ll keep adding to that repo in the next few weeks).

The biggest change between the old course and the new specialization though is a completely new course: Bayesian Statistics. I touched on Bayesian inference a bit in my old course, and this generated lots of discussion on the course forums from learners wanting more on this content. Being at Duke, I figured who better to offer this course but us! (If you know anything about the Statistical Science department at Duke, you probably know it’s pretty Bayesian.) Note, I didn’t say “me”, I said “us”. I was able to convince a few colleagues (David Banks, Merlise Clyde, and Colin Rundel) to join me in developing this course, and I’m glad I did! Figuring out exactly how to teach this content in an effective way without assuming too much mathematical background took lots of thinking (and re-thinking, and re-thinking). We have also managed to feature a few interviews with researchers in academia and industry, such as Jim Berger (Duke), David Dunson (Duke), Amy Herring (UNC), and Steve Scott (Google) to provide a bit more context for learners on where and why Bayesian statistics is relevant. This course launched today, and I’m looking forward to seeing the feedback from the learners.

If you’re interested in the specialization, you can find out more about it here. The courses in the specialization are:

  1. Introduction to Probability and Data
  2. Inferential Statistics
  3. Linear Regression and Modeling
  4. Bayesian Statistics
  5. Statistics Capstone Project

You can take the courses individually or sign up for the whole specialization, but to do the capstone you need to have completed the 4 courses in the specialization. The landing page for the specialization outlines in further detail how to navigate everything, and relevant dates and deadlines.

Also note that while the graded components of the course which will allow you to pursue a certificate require payment, one can audit the courses for free and watch videos, complete practices quizzes, and work on the labs.

Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).


PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!

Tools for Managing Your Inbox

First of all, happy new year to all of our readers.

As my first contribution in 2016, I thought I would share a couple tools that have helped tame my email inbox with you. In my continued resolution to finally achieve Inbox Zero, I have made a major dent in the last month. This is thanks to two tools: Unroll Me and Google Mail’s “Save & Archive” button.

Unroll Me

The first tool I would like to share with you is an app called Unroll Me. This app makes unsubscribing to email services or adding them to a once-a-day-digest super easy…no more clicking “unsubscribe” in each individual email. After you sign up, Unroll Me examines your inbox for different email subscriptions. You then have three choices for each subscription you find:

  1. Unsubscribe
  2. Add to Rollup
  3. Keep in Inbox

The first and third are self-explanatory. The second option adds all selected subscriptions to a digest-like email that comes once-per-day. Here is an example of my rollup:

unrollmeUnroll Me keeps a history of your rollups for easy reference, and adds new subscriptions that it finds for you to manage. You can also change the preference for any of your subscriptions at any time. Also, since Unroll Me keeps a list of emails that you have unsubscribed from, it is easy to re-subscribe at any point.

This tool has changed my inbox. Digest-like emails are great for reducing inbox clutter (it is the email equivalent of putting household odds-and-ends into a nice wicker basket). Unfortunately, many places that should have an option for this, simply do not. For example, the University of Minnesota (my place of employment) has somehow auto subscribed me to a million email lists. In general, I am not interested in about 90% of what they send, and another 9% are related to things I don’t need to see immediately. Unroll Me has allowed me to keep the email subscription I want to see immediately in my inbox, put those that are less important in a rollup, and eliminates those emails I could care less about completely from my sight.

Google Mail’s “Send & Archive” Button

This is a Google Mail option I learned about on Lifehacker. Why it is not a default button, I do not know. Go to Google Mail’s Settings, and under the General Setting, click the option button labelled “Show ‘Send & Archive’ button in reply”.

senandarchiveThis will add a button to any email you reply to that allows you to send the email and archive the email message you just replied to, essentially combining two steps in one.

emailreplyYou also still have the send button if you want to keep the original email in your inbox.

I hope these tools might also help you. If you have further suggestions, put them in the comments to share.


PDF and Citation Management

A new academic year looms. This means a new crop of graduate students will begin their academic training. PDF management is a critical tool that all graduate students need to use and the sooner the better. Often these tools go hand-in-hand with a citation management system, which is also critical for graduate students.

Using a citation management software makes scholarly work easier and more effective. First and foremost, these tools allow you to automatically cite references for a paper in a wide range of bibliographic styles. They also allow you to organize, evaluate, annotate, and search within your citation collection and share your references with others. Often they also sync across machines and devices allowing you to access your database wherever you are.

There are several tools available for PDF/citation management, including:

Some of these are citation managers only (BibDesk). Many allow you to also manage your PDF files as well; naming, organizing, and moving your files to a central repository on your computer. Some allow for annotation within the software as well. There are several online comparisons of some of the different systems ( e.g., Penn LibrariesUW Madison Library, etc.) From my experience, students tend to choose either Mendelay or Zotero—my guess is because they are free.

There is a lot to be said for free software, and both Zotero and Mendelay seem pretty solid. However, as a graduate student you should understand that you are investing in your future. This type of tool, I think it is fair to say, you will be using daily. Spending money on a tool that has the features and UI that you will want to use is perfectly ok and should even be encouraged.

Another consideration for students who are beginning the process is to find out what your advisor(s), and research groups use. Although many are cross-compatible, using and learning the tool is easier with a group helping you.

What Do I Use?

I use Papers. It is not free (a student license is ~$50). When I started using Papers, Mendelay and Zotero were not available. I actually have since used both Mendelay and Zotero for a while, but then ultimately made the decision this summer to switch back to Papers. It is faster and more importantly to me, has better search functionality, both across and within a paper.

I would like to use Sente (free for up to 100 references), but the search function is very limited. In my opinion, Sente has the best is sleek and minimalist and reading a paper is a nice experience.

My Recommendation…

Ultimately, use what you are comfortable with and then, actually use it. Take the time to enter ALL the meta-data for PDFs as you accumulate them. Don’t imagine you will have time to do it later…you won’t. Being organized with your references from the start will keep you more productive later.


Very brief first day of class activity in R

New academic year has started for most of us. I try to do a range of activities on the first day of my introductory statistics course, and one of them is an incredibly brief activity to just show students what R is and what the RStudio window looks like. Here it is:

Generate a random number between 1 and 5, and introduce yourself to that many people sitting around you:

sample(1:5, size = 1)

It’s a good opportunity to have students access RStudio once, talk about random sampling, and break up the class session and have them introduce themselves to their fellow classmates. I usually do the activity too, and use it as an opportunity to personally introduce myself to a few students and to meet them.

If you’re interested in everything else I’m doing in my introductory statistics course you can find the course materials for this semester at and find the source code for all publicly available materials like slides, labs, etc. at Both of these will be updated throughout the semester. Feel free to grab whatever you find useful.

Interpreting Cause and Effect

One big challenge we all face is understanding what’s good and what’s bad for us.  And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster:  Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US.  Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.

Fitbit Revisited

Many moons ago we wrote about a bit of a kludge to get data from a Fitbit (see here). Now it looks as though there is a much better way. Cory Nissen has written an R package to scrape Fitbit data and posted it on GitHub. He also wrote a blog post on his blog Stats and Things announcing the package and demonstrating its use. While I haven’t tried it yet, it looks pretty straight-forward and much easier than anything else i have seen to date.

Model Eliciting Activity: Prologue

I’m very excited/curious about tomorrow: I’m going to lead about 40 math and science teachers in a data-analysis activities, using one of the Model Eliciting Activities from the University of Minnesota Catalysts for Change Project. (One of our bloggers, Andy, was part of this project.) Specifically, we’re giving them the arrival-delay times for five different airlines into Chicago O’Hare. A random sample of 10 from each airline, and asking them to come up with rules for ranking the airlines from best to worst.

I’m curious to see what they come up with, particularly whether  the math teachers differ terribly from the science teachers. The math teachers are further along in our weekend professional development program than are the science teachers, and so I’m hoping they’ll identify the key characteristics of a distribution (all together: center, spread, shape; well, shape doesn’t play much of a role here) and use these to formulate their rankings. We’ve worked hard on helping them see distributions as a unit, and not a collection of individual points, and have seen big improvements in the teachers, most of whom have not taught statistics before.

The science teachers, I suspect, will be a little bit more deterministic in their reasoning, and, if true to my naive stereotype of science teachers, will try to find explanations for individual points. Since I haven’t worked as much with the science teachers, I’m curious to see if they’ll see the distribution as a whole, or instead try to do point-by-point comparisons.

When we initially started this project, we had some informal ideas that the science teachers would take more naturally to data analysis than would the math teachers. This hasn’t turned out to be entirely true. Many of the math teachers had taught statistics before, and so had some experience. Those who hadn’t, though, tended to be rather procedurally oriented. For example, they often just automatically dropped outliers from their analysis without any thought at all, just because they thought that that was the rule. (This has been a very hard habit to break.)

The math teachers also had a very rigid view of what was and was not data. The science teachers, on the other hand, had a much more flexible view of data. In a discussion about whether photos from a smart phone were data, a majority of math teachers said no and a majority of science teachers said yes. On the other hand, the science teachers tend to use data to confirm what they already know to be true, rather than use it to discover something. This isn’t such a problem with the math teachers, in part because they don’t have preconceptions of the data and so have nothing to confirm. In fact, we’ve worked hard with the math teachers, and with the science teachers, to help them approach a data set with questions in mind. But it’s been a challenge teaching them to phrase questions for their students in which the answers aren’t pre-determined or obvious, and which are empirically oriented. (For example: We would like them to ask something like “what activities most often led to our throwing away redcycling into the trash bin?” rather than “Is it wrong to throw trash into the recycling bin?” or “Do people throw trash into the recycling bin?”)

So I’ll report back soon on what happened and how it went.