Theaster Gates, W.E.B. Du Bois, and Statistical Graphics

After reading this review of a Theaster Gates show at Regan Projects, in L.A., I hurried to see the show before it closed. Inspired by sociologist and civil rights activist W.E.B. Du Bois, Gates created artistic interpretations of statistical graphics that Du Bois had produced for an exhibition in Paris in 1900.  Coincidentally, I had just heard about these graphics the previous week at the Data Science Education Technology conference while evesdropping on a conversation Andy Zieffler was having with someone else.  What a pleasant surprise, then, when I learned, almost as soon as I got home, about this exhibit.

I’m no art critic ( but I know what I like), and I found these works to be beautiful, simple, and powerful.  What startled me, when I looked for the Du Bois originals, was how little Gates had changed the graphics. Here’s one work (I apologize for not knowing the title. That’s the difference between an occasional blogger and a journalist.)  It hints of Mondrian, and  the geometry intrigues. Up close, the colors are rich and textured.

Here’s Du Bois’s circa-1900 mosaic-type plot (from http://www.openculture.com/2016/09/w-e-b-du-bois-creates-revolutionary-artistic-data-visualizations-showing-the-economic-plight-of-african-americans-1900.html, which provides a nice overview of the exhibit for which Du Bois created his innovative graphics)

The title is “Negro business men in the United States”. The large yellow square is “Grocers” the blue square “Undertakers”, and the green square below it is “Publishers.  More are available at the Library of Congress.

Here’s another pair.  The Gates version raised many questions for me.  Why were the bars irregularly sized? What was the organizing principle behind the original? Were the categories sorted in an increasing order, and Gates added some irregularities for visual interest?  What variables are on the axes?

The answer is, no, Gates did not vary the lengths of the bars, only the color.

The vertical axis displays dates, ranging from 1874 to 1899 (just 1 year before Du Bois put the graphics together from a wide variety of sources).  The horizontal axis is acres of land, with values from 334,000 to 1.1 million.

The history of using data to support civil rights has a long history.   A colleague once remarked that there was a great unwritten book behind the story that data and statistical analysis played (and continue to play) in the gay civil rights movement (and perhaps it has been written?)  And the folks at We Quant LA have a nice article demonstrating some of the difficulties in using open data to ask questions about racial profiling by the LAPD. In this day and age of alternative facts and fake news, it’s wise to be careful and precise about what we can and cannot learn from data. And it is encouraging to see the role that art can play in keeping this dialogue alive.

A timely first day of class example for Fall 2016: Trump Tweets

On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).

silver-feature-most-common-women-names3One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students’ names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.

4671594023_b41c2ee662_m

Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson‘s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red – tourist – pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.

makeHowever if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half

I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that

the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.

Source: http://varianceexplained.org/r/trump-tweets/

I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.

As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.

JSM 2016 session on “Doing more with data”

The ASA’s most recent curriculum guidelines emphasize the increasing importance of data science, real applications, model diversity, and communication / teamwork in undergraduate education. In an effort to highlight recent efforts inspired by these guidelines, I organized a JSM session titled Doing more with data in and outside the undergraduate classroom. This session featured talks on recent curricular and extra-curricular efforts in this vein, with a particular emphasis on challenging students with real and complex data and data analysis. The speakers discussed how these pedagogical innovations aim to educate and engage the next generation, and help them acquire the statistical and data science skills necessary to succeed in a future of ever-increasing data. I’m posting the slides from this session for those who missed it as well as for those who want to review the resources linked in the slides.

Computational Thinking and Statistical Thinking: Foundations of Data Science

by Ani Adhikari and Michael I. Jordan, University of California at Berkeley

 

Learning Communities: An Emerging Platform for Research in Statistics

by Mark Daniel Ward, Purdue University

 

The ASA DataFest: Learning by Doing

by Robert Gould, University of California at Los Angeles

(See http://www.amstat.org/education/datafest/ if you’re interested in organizing an ASA DataFest at your institution.)

 

Statistical Computing as an Introduction to Data Science

by Colin Rundel, Duke University [GitHub]

JSM 2016 session on Reproducibility in Statistics and Data Science

Will reproducibility always be this hard?Ten years after Ioannidis alleged that most scientific findings are false, reproducibility — or lack thereof — has become a full-blown crisis in science. Flagship journals like Nature and Science have published hand-wringing editorials and revised their policies in the hopes of heightening standards of reproducibility. In the statistical and data sciences, the barriers towards reproducibility are far lower, given that our analysis can usually be digitally encoded (e.g., scripts, algorithms, data files, etc.). Failure to ensure the credibility of our contributions will erode “the extraordinary power of statistics,” both among our colleagues and in our collaborations with scientists of all fields. This morning’s JSM session on Reproducibility in Statistics and Data Science featured talks on recent efforts in pursuit of reproducibility. The slides of talks by the speakers and the discussant are posted below.

Note that some links point to a GitHub repo including slides as well as other useful resources for the talk and for adopting reproducible frameworks for your research and teaching. I’m also including Twitter handles for the speakers which is likely the most efficient way for getting in touch with them if you have any questions for them.

This session was organized by Ben Baumer and myself as part of our Project TIER fellowship. Many thanks to Amelia McNamara, who is also a Project TIER fellow, for chairing the session (and correctly pronouncing my name)!

  • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets – Jenny Bryan – repo, including slides – @JennyBryan
  • Steps Toward Reproducible Research – Karl Broman – slides – @kwbroman
  • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! – Karthik Ram – slides – @_inundata
  • Integrating Reproducibility into the Undergraduate Statistics Curriculum – Mine Çetinkaya-Rundel – repo, including slides – @minebocek
  • Discussant: Yihui Xie – slides – @xieyihui

PS: Don’t miss this gem of a repo for links to many many more JSM 2016 slides. Thanks Karl for putting it together!

A two-hour introduction to data analysis in R

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
 
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

Data News: Fitbit + iHealth, and Open Justice data

The LA Times reported today, along with several other sources, that the California Department of Justice has initiated a new “open justice” data initiative.  On their portal, the “Justice Dashboard“, you can view Arrest Rates, Deaths in Custody, or Law Enforcement Officers Killed or Assaulted.

I chose, for my first visit, to look at Deaths in Custody.  At first, I was disappointed with the quality of the data provided.  Instead of data, you see some nice graphical displays, mostly univariate but a few with two variables, addressing issues and questions that are probably on many people’s minds.  (Alarmingly, the second most common cause of death for people in custody is homicide by a law enforcement officer.)

However, if you scroll to the bottom, you’ll see that you can, in fact, download relatively raw data, in the form of a spreadsheet in which each row is a person in custody who died.  Variables include date of birth and death, gender, race, custody status, offense, reporting agency, and many other variables.  Altogether, there are 38 variables and over 15000 observations. The data set comes with a nice codebook, too.

FitBit vs. the iPhone

Onto a cheerier topic. This quarter I will be teaching regression, and once again my FitBit provided inspiration.  If you teach regression, you know one of the awful secrets of statistics: there are no linear associations. Well, they are few and far between.  And so I was pleased when a potentially linear association sprang to mind:  how well do FitBit step counts predict the Health app counts?

Health app is an ios8 app. It was automatically installed on your iPhone, whether you wanted it or not.  (I speak from the perspective of an iPhone 6 user, with ios8 installed.) Apparently, whether you know it or not, your steps are being counted.  If you have an Apple Watch, you know about this.  But if you don’t, it happens invisibly, until you open the app. Or buy the watch.

How can you access these data?  I did so by downloading the free app QS (for “Quantified Self”). The Quantified Self people have a quantified self website directing you to hundreds of apps you can use to learn more about yourself than you probably should.  Once installed, you simply open the app, choose which variables you wish to download, click ‘submit’, and a csv filed is emailed to you (or whomever you wish).

The FitBit data can only be downloaded if you have a premium account.  The FitBit premium website has a ‘custom option’ that allows you to download data for any time period you choose, but currently, due to an acknowledged bug, no matter which dates you select, only one month of data will be downloaded. Thus, you must download month by month.  I downloaded only two months, July and August, and at some point in August my FitBit went through the wash cycle, and then I misplaced it.  It’s around here, somewhere, I know. I just don’t know where.  For these reasons, the data are somewhat sparse.

I won’t bore you with details, but by applying functions from the lubridate package in R and using the gsub function to remove commas (because FitBit inexplicably inserts commas into its numbers and, I almost forgot, adds a superfluous title to the document which requires that you use the “skip =1” option in read.table), it was easy to merge a couple of months of FitBit with Health data.  And so here’s how they compare:

fitbitsteps

The regression line is Predicted.iOS.Steps = 1192 + 0.9553 (FitBit.Steps), r-squared is .9223.  (A residual plot shows that the relationship is not quite as linear as it looks. Damn.)

Questions I’m thinking of posing on the first day of my regression class this quarter:

  1. Which do you think is a more reliable counter of steps?
  2. How closely in agreement are these two step-counting tools? How would you measure this?
  3. What do the slope and intercept tell us?
  4. Why is there more variability for low fit-bit step counts than for high?
  5. I often lose my FitBit. Currently, for instance, I have no idea where it is.  On those days, FitBit reports “0 steps”. (I removed the 0’s from this analysis.)  Can I use this regression line to predict the values on days I lose my FitBit?  With how much precision?

I think it will be helpful to talk about these questions informally, on the first day, before they have learned more formal methods for tackling these.  And maybe I’ll add a few more months of data.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Interpreting Cause and Effect

One big challenge we all face is understanding what’s good and what’s bad for us.  And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster:  Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US.  Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.

Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.