StatPREP Workshops

This last weekend I helped Danny Kaplan and Kathryn Kozak (Coconino Community College) put on a StatPREP workshop. We were also joined by Amelia McNamara (Smith College) and Joe Roith (St. Catherine’s University). The idea behind StatPREP is to work directly with college-level instructors, through online and in community-based workshops, to develop the understanding and skills needed to work and teach with modern data.

Danny Kaplan ponders at #StatPREP

One of the most interesting aspects of these workshops were the tutorials and exercises that the participants worked on. These utilized the R package learnr. This package allows people to create interactive tutorials via RMarkdown. These tutorials can incorporate code chunks that run directly in the browser (when the tutorial is hosted on an appropriate server), and Shiny apps. They can also include exercises/quiz questions as well.

An example of a code chunk from the learnr package.

Within these tutorials, participants were introduced to data wrangling (via dplyr), data visualization (via ggfomula), and data summarization and simulation-based inference (via functions from Project Mosaic). You can see and try some of the tutorials from the workshop here. Participants, in breakout groups, also envisioned a tutorial, and with the help of the workshop presenters, turned that into the skeleton for a tutorial (some things we got working and others are just outlines…we only had a couple hours).

You can read more about the StatPREP workshops and opportunities here.

 

 

USCOTS 2017

Citizen Statistician’s very own Mine Çetinkaya-Rundel gave one of the keynote addresses at USCOTS 2017.

The abstract for her talk, Teaching Data Science and Statistical Computation to Undergraduates, is given below.

What draws students to statistics? For some, the answer is mathematics, and for those a course in probability theory might be an attractive entry point. For others, their first exposure to statistics might be an applied introductory statistics course that focuses on methodology. This talk presents an alternative focus for a gateway to statistics: an introductory data science course focusing on data wrangling, exploratory data analysis, data visualization, and effective communication and approaching statistics from a model-based, instead of an inference-based, perspective. A heavy emphasis is placed on best practices for statistical computation, such as reproducibility and collaborative computing through literate programming and version control. I will discuss specific details of this course and how it fits into a modern undergraduate statistics curriculum as well as the success of the course in recruiting students to a statistics major.

You can view her slides at bit.ly/uscots2017

 

 

Read elsewhere: Organizing DataFest the tidy way

Part of the reason why we have been somewhat silent at Citizen Statistician is that it’s DataFest season, and that means a few weeks (months?) of all consuming organization followed by a weekend of super fun data immersion and exhaustion… Each year that I organize DataFest I tell myself “next year, I’ll do [blah] to make my life easier”. This year I finally did it! Read about how I’ve been streamlining the process of registrations, registration confirmations, and dissemination of information prior to the event on my post titled “Organizing DataFest the tidy way” on the R Views blog.

Stay tuned for an update on ASA DataFest 2017 once all 31 DataFests around the globe have concluded!

Theaster Gates, W.E.B. Du Bois, and Statistical Graphics

After reading this review of a Theaster Gates show at Regan Projects, in L.A., I hurried to see the show before it closed. Inspired by sociologist and civil rights activist W.E.B. Du Bois, Gates created artistic interpretations of statistical graphics that Du Bois had produced for an exhibition in Paris in 1900.  Coincidentally, I had just heard about these graphics the previous week at the Data Science Education Technology conference while evesdropping on a conversation Andy Zieffler was having with someone else.  What a pleasant surprise, then, when I learned, almost as soon as I got home, about this exhibit.

I’m no art critic ( but I know what I like), and I found these works to be beautiful, simple, and powerful.  What startled me, when I looked for the Du Bois originals, was how little Gates had changed the graphics. Here’s one work (I apologize for not knowing the title. That’s the difference between an occasional blogger and a journalist.)  It hints of Mondrian, and  the geometry intrigues. Up close, the colors are rich and textured.

Here’s Du Bois’s circa-1900 mosaic-type plot (from http://www.openculture.com/2016/09/w-e-b-du-bois-creates-revolutionary-artistic-data-visualizations-showing-the-economic-plight-of-african-americans-1900.html, which provides a nice overview of the exhibit for which Du Bois created his innovative graphics)

The title is “Negro business men in the United States”. The large yellow square is “Grocers” the blue square “Undertakers”, and the green square below it is “Publishers.  More are available at the Library of Congress.

Here’s another pair.  The Gates version raised many questions for me.  Why were the bars irregularly sized? What was the organizing principle behind the original? Were the categories sorted in an increasing order, and Gates added some irregularities for visual interest?  What variables are on the axes?

The answer is, no, Gates did not vary the lengths of the bars, only the color.

The vertical axis displays dates, ranging from 1874 to 1899 (just 1 year before Du Bois put the graphics together from a wide variety of sources).  The horizontal axis is acres of land, with values from 334,000 to 1.1 million.

The history of using data to support civil rights has a long history.   A colleague once remarked that there was a great unwritten book behind the story that data and statistical analysis played (and continue to play) in the gay civil rights movement (and perhaps it has been written?)  And the folks at We Quant LA have a nice article demonstrating some of the difficulties in using open data to ask questions about racial profiling by the LAPD. In this day and age of alternative facts and fake news, it’s wise to be careful and precise about what we can and cannot learn from data. And it is encouraging to see the role that art can play in keeping this dialogue alive.

Some Reading for the Winter Break

It has been a long while since I wrote anything for Citizen Statistician, so I thought I would scribe a post about three books that I will be reading over break.

 

 

 

 

 

 

The first book is Cathy O’Neil’s book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [link to Amazon]. I am currently in the midst of Chapter 3. I heard about this book on an episode of 538’s podcast, What’s the Point?, on which O’Neil was featured [Who’s Accountable When An Algorithm Makes A Bad Decision?]. The premise of this book has been something that has been on the mind of many people thinking about data science and algorithms in recent years (and probably not-so-recent years); that many algorithms, and thus the predictions stemming from them, are not transparent. This leads to many ethical and, potentially, legal issues when algorithms are then used to make decisions about recidivism, loan applications, college admissions, etc. I think this book could be the basis for a very interesting seminar. Let me know if anyone is working on something like this.

The second book I will be reading is Michael Lewis’ The Undoing Project: A Friendship That Changed Our Minds [link to Amazon]. This book is bout the friendship, collaboration, and, ultimately, disentanglement between the renowned psychologists Daniel Kahnemann and Amos Tversky. I learned about Kahnemann and Tversky’s work early in my graduate career when Joan Garfield taught a doctoral research seminar on the seminal psychological work related to probabilistic thinking and statistics education. We read not only Kahnemann and Tversky, but also Gird Gigerenzer, Ruma Falk, Maya Bar Hillel, Richard Nisbett, Efraim Fischbein, and others. Interestingly, What’s the Point? recently did two episodes on Lewis’ book as well; Michael Lewis’s New Book Examines How We Think About Thinking and Nate Silver Interviews Michael Lewis About His New Book, ‘The Undoing Project’.

The third book is Who’s #1?: The Science of Rating and Ranking [link to Amazon] by Amy Langville and Carl Meyer. I had read their earlier book, Google’s PageRank and Beyond: The Science of Search Engine Rankings, several years ago, and was quite impressed with the readability of the complex matrix algebra they presented. In Who’s #1, the authors present the mathematics underlying several ratings systems including the Massey system, Elo, Colley, Keener, etc. I am actually treating this book like a self-taught class,  working out several of their examples using R, and really trying to understand the ideas. My interest here is related to the work that I am doing with Brandon LeBeau (University of Iowa) and a current graduate student, Kyle Nickodem on estimating the coaching ability for NCAA football coaches using a hierarchical IRT model [see slides from a talk here].

Measurement error in intro stats

I have recently been asked by my doctor to closely monitor my blood pressure, and report it if it’s above a certain cutoff. Sometimes I end up reporting it by calling a nurse line, sometimes directly to a doctor in person. The reactions I get vary from “oh, it happens sometimes, just take it again in a bit” to “OMG the end of the world is coming!!!” (ok, I’m exaggerating, but you get the idea). This got me thinking: does the person I’m talking to understand measurement error? Which then got me thinking: I routinely teach intro stats courses that for some students is the only stats, and potentially only quantitative reasoning, course they might take in college, do I discuss measurement error properly in this course? I’m afraid the answer is no… It’s certainly mentioned within the context of a few case studies, but I can’t say that it gets the emphasis it deserves. I also browsed through a few intro stats books (including mine!) and not a mention of “measurement error” specifically.

I’m always hesitant to make statements like “we should teach this in intro stats” because I know most intro stats curriculum is already pretty bloated, and it’s not feasible to cover everything in one course. But this seems to be a pretty crucial concept for someone to understand in order to be able to have meaningful conversations with their health providers and make better decisions (or stay calm) about their health that I think it is indeed worth spending a little bit of time on.

Slack for managing course TAs

slackI meant to write this post last year when I was teaching a large course with lots of teaching assistants to manage, but, well, I was teaching a large course with lots of teaching assistants to manage, so I ran out of time…

There is nothing all that revolutionary here. People have been using Slack to manage teams for a while now. I’ve even come across some articles / posts on using Slack as a course discussion forum, so use of Slack in an educational setting is not all that new either. But I have not heard of people using Slack for organizing the course and managing TAs, so I figured it might be worthwhile to write about my experience.

TL;DR: A+, would do it again!

I’ll be honest, when I first found out about Slack, I wasn’t all that impressed. First, I kept thinking it’s called Slacker, and I was like, “hey, I’m no slacker!” (I totally am…). Second, I initially thought one had to use Slack in the browser, and accidentally kept closing the tab and hence missing messages. There is a Slack app that you can run on your computer or phone, it took me a while to realize that. Because of my rocky start with it, I didn’t think to use Slack in my teaching. I must credit my co-instructor, Anthea Monod, for the idea of using Slack for communicating with our TAs.

Between the two instructors we had 12 TAs to manage. We set up a Slack team for the course with channels like #labs, #problem sets, #office_hours, #meetings, etc.

This setup worked really well for us for a variety of reasons:

  • Keep course management related emails out of email inbox: These really add up. At this point, any email I can keep out of my inbox is a win in my book!
  • Easily keep all TAs in the loop: Need to announce a typo in a solution key? Or give TAs a heads up about questions they might expect in office hours? I used to handle these by emailing them all, and either I’d miss one or two or a TA responding to my email would forget to reply all (people never seem to reply all when they should, but they always do when they shouldn’t!)
  • Provide a space for TAs to easily communicate with each other: Our TAs used Slack to let others know they might need someone to cover for them for office hours, or teaching a section, etc. It was nice to be able to alert all of them at once, and also for everyone to see when someone responded saying they’re available to cover.
  • Keep a record of decisions made in an easily searchable space: Slack’s search is not great, but it’s better than my email’s for sure. Plus, since you’re searching only within that team’s communication, as opposed to through all your emails, it’s a lot easier to find what you’re looking for.
  • It’s fun: The #random channel was a place people shared funny tidbits or cool blog posts etc. I doubt the TAs would be emailing each other with these if this communication channel wasn’t there. It made them act more like a community than they would otherwise.
  • It’s free: At least for a reasonable amount of usage for a semester long course.

Some words of advice if you decide to use Slack for managing your own course:

  • There is a start-up cost: Not cost as in $$, but cost as in time… At the beginning of the semester you’ll need to make sure everyone gets in the team and sets up Slack on their devices. We did this during our first meeting, it was a lot more efficient than emailing reminders.
  • It takes time for people to break their emailing habits: For the first couple weeks TAs would still email me their questions instead of using Slack. It took some time and nudging, but eventually everyone shifted all course related communication to Slack.

If you’re teaching a course with TAs this semester, especially a large one with many people to manage, I strongly recommend giving Slack a try.

A timely first day of class example for Fall 2016: Trump Tweets

On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).

silver-feature-most-common-women-names3One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students’ names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.

4671594023_b41c2ee662_m

Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson‘s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red – tourist – pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.

makeHowever if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half

I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that

the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.

Source: http://varianceexplained.org/r/trump-tweets/

I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.

As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.

Michael Phelps’ hickies

Ok, they’re not hickies, but NPR referred to them as such, so I’m going with it… I’m talking about the cupping marks.

The NPR story can be heard (or read) here. There were two points made in this story that I think would be useful and fun to discuss in a stats course.

The first is the placebo effect. Often times in intro stats courses the placebo effect is mentioned as something undesirable that must be controlled for. This is true, but in this case the “placebo effect from cupping could work to reduce pain with or without an underlying physical benefit”. While there isn’t sufficient scientific evidence for the positive physical effect of cupping, the placebo effect might be just enough to give the edge to an individual olympian to outperform others by a small margin.

This brings me to my second point, the individual effect on extreme cases vs. a statistically significant effect on a population parameter. I briefly did a search on Google scholar for studies on the effectiveness of cupping and most use t-tests or ANOVAs to evaluate the effect on some average pain / severity of symptom score. If we can assume no adverse effect from cupping, might it still make sense for an individual to give the treatment a try even if the treatment has not been shown to statistically significantly improve average pain? I think this would be an interesting, and timely, question to discuss in class when introducing a method like the t-test. Often in tests of significance on a mean the variance of a treatment effect is viewed as a nuisance factor that is only useful for figuring out the variability of the sampling distribution of the mean, but in this case the variance of the treatment effect on individuals might also be of interest.

While my brief search didn’t result in any datasets on cupping, the following articles contain some summary statistics or citations to studies that report such statistics that one could bring into the classroom:

PS: I wanted to include a picture of these cupping marks on Michael Phelps, but I couldn’t easily find an image that was free to use or share. You can see a picture here.

PPS: Holy small sample sizes in some of the studies I came across!

JSM 2016 session on “Doing more with data”

The ASA’s most recent curriculum guidelines emphasize the increasing importance of data science, real applications, model diversity, and communication / teamwork in undergraduate education. In an effort to highlight recent efforts inspired by these guidelines, I organized a JSM session titled Doing more with data in and outside the undergraduate classroom. This session featured talks on recent curricular and extra-curricular efforts in this vein, with a particular emphasis on challenging students with real and complex data and data analysis. The speakers discussed how these pedagogical innovations aim to educate and engage the next generation, and help them acquire the statistical and data science skills necessary to succeed in a future of ever-increasing data. I’m posting the slides from this session for those who missed it as well as for those who want to review the resources linked in the slides.

Computational Thinking and Statistical Thinking: Foundations of Data Science

by Ani Adhikari and Michael I. Jordan, University of California at Berkeley

 

Learning Communities: An Emerging Platform for Research in Statistics

by Mark Daniel Ward, Purdue University

 

The ASA DataFest: Learning by Doing

by Robert Gould, University of California at Los Angeles

(See http://www.amstat.org/education/datafest/ if you’re interested in organizing an ASA DataFest at your institution.)

 

Statistical Computing as an Introduction to Data Science

by Colin Rundel, Duke University [GitHub]