Modernizing the Undergraduate Statistics Curriculum at #JSM2017

I’m a bit late in posting this, but travel delays post-JSM left me weary, so I’m just getting around to it. Better late than never?

Wednesday at JSM featured an invited statistics education session on Modernizing the Undergraduate Statistics Curriculum. This session featured two types of speakers: those who are currently involved in undergraduate education and those who are on the receiving end of graduating majors. The speakers involved in undergraduate education presented on their recent efforts for modernizing the undergraduate statistics curriculum to provide the essential computational and problem solving skills expected from today’s modern statistician while also providing a firm grounding in theory and methods. The speakers representing industry discussed their expectations (or hopes and dreams) for new graduates and where they find gaps in the knowledge of new hires.

The speakers were  Nick Horton (Amherst College), Hilary Parker (Stitch Fix), Jo Hardin (Pomona College), and Colin Rundel (Duke University). The discussant was Rob Gould (UCLA). Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments.

Modernizing the undergraduate statistics curriculum: what are the theoretical underpinnings? – Nick Horton

Hopes and dreams for statistics graduates – Hilary Parker

Expectations and Skills for Undergraduate Students Doing Research in Statistics and Data Science – Jo Hardin

Moving Away from Ad Hoc Statistical Computing Education – Colin Rundel

Discussion – Rob Gould

Novel Approaches to First Statistics / Data Science Course at #JSM2017

Tuesday morning, bright an early at 8:30am, was our session titled “Novel Approaches to First Statistics / Data Science Course”. For some students the first course in statistics may be the only quantitative reasoning course they take in college. For others, it is the first of many in a statistics major curriculum. The content of this course depends on which audience the course is aimed at as well as its place in the curriculum. However a data-centric approach with an emphasis on computation and algorithmic thinking is essential for all modern first statistics courses. The speakers in our session presented their approaches for the various first courses in statistics and data science that they have developed and taught. The discussion also highlighted pedagogical and curricular choices they have made in deciding what to keep, what to eliminate, and what to modify from the traditional introductory statistics curriculum. The speakers in the session were Ben Baumer from Smith College, Rebecca Nugent from CMU, myself, and Daniel Kaplan from Macalester College. Our esteemed discussant was Dick DeVeaux, and our chair, the person who managed to keep this rambunctious bunch on time, was Andrew Bray from Reed College. Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments, or find us on social media!

Ben Baumer – Three Methods Approach to Statistical InferenceRebecca Nugent – Lessons Learned in Transitioning from “Intro to Statistics” to “Reasoning with Data”

Mine Cetinkaya-Rundel – A First-Year Undergraduate Data Science Course

Daniel Kaplan – Teaching Stats for Data Science

Dick DeVeaux – Discussion


My JSM 2017 itinerary

JSM 2017 is almost here. I just landed in Maryland, and I finally managed to finish combing through the entire program. What a packed schedule! I like writing an itinerary post each year, mainly so I can come back to it during and after the event. I obviously won’t make it to all sessions listed for each time slot below, but my decision for which one(s) to attend during any time period will likely depend on proximity to previous session, and potentially also proximity to childcare area.

The focus of the sessions I selected are education, data science, computing, visualization, and social responsibility. In addition to talks on topics I actively work in, I also enjoy listening to talks in application areas I’m interested in, hence the last topic on this list.

If you have suggestions for other sessions (in these topics or other) that you think would be interested, let me know in the comments!

Sun, 7/30/2017

Sunday will be mostly meetings for me, and I’m skipping any evening stuff to see Andrew Bird & Belle and Sebastian!

Mon, 7/31/2017

  • DataFest meeting: 10am – 12pm at H-Key Ballroom 9. Stop by if you’re already an ASA DataFest organizer, or if you’d like to be one in the future!
    • First hour will be discussing what worked and what didn’t, any concerns, kudos, advice for new sites, etc.
    • Second hour will be drop-in for addressing any questions regarding organizing an ASA DataFest at your institution.
  • Computing and Graphics mixer: 6 – 8pm at H-Key Ballroom 1.
  • Caucus for Women in Statistics Reception and Business Meeting: 6:30 – 8:30pm at H-Holiday Ballroom 1&2.

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00 PM – 3:50 PM

4:00 PM – 5:50 PM

ASA President’s Invited Speaker: It’s Not What You Said. It’s What They Heard – Jo Craven McGinty, The Wall Street Journal

Tue, 8/1/2017

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

Deming Lecture: A Rake’s Progress Revisited – Fritz Scheuren, NORC-University of Chicago

Wed, 8/2/2017

  • Statistical Education Business Meeting – 6-7:30pm

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

COPSS Awards and Fisher Lecture: The Importance of Statistics: Lessons from the Brain Sciences – Robert E. Kass, Carnegie Mellon University

Thur, 8/3/2017

8:30 AM – 10:20 AM

 10:30 AM – 12:20 PM

Structuring Data in Middle School

Of the many provocative and exciting discussions at this year’s Statistics Research Teaching and Learning conference in Rotarua, NZ, one that has stuck in my mind is from Lucia Zapata-Cardona, from the Universidad de Antioquia in Columbia. Lucia discussed data from her classroom observations of a teacher at a middle school (ages 12-13) in a “Northwest Columbian city”. The class was exciting for many reasons, but the reason that I want to write about it here is because of the fact that the teacher had the students structure and store their own data.

The classroom was remarkable – to my American eyes – for the large number of students (45) and for the noise (walls were thin, the playground was immediately outside, and windows were kept open because of the heat.) Despite this, the teacher led an inquiry-based discussion, skillfully prompting the students with questions from the back of the classroom. The discussion lasted over several days.

The students had collected data about the nutritional content of the foods they eat. Challenging students with real-world, meaningful problems is an important part of Prof. Zapata-Cardona’s research, since an important goal of education is to tie the world of the classroom to the real world. Lucia was interested in examining how (and whether) the students constructed and employed statistical models to reason with the data. (Modeling was the theme of this SRTL.) What fascinated me wasn’t the modeling, but the role that the structure of the data played in the students’ reasoning.

Students were asked to collect data on the food contained in their lunchboxes so that they could answer the statistical question “How nutritious is the food we bring to school in our lunchbox?” It’s important to note that in Columbia, as Lucia explained to us, the “lunch box” doesn’t contain actual lunch (which the students eat at home), but instead includes snacks for during the day. What interested me was that the teacher let the class, after discussion, decide how they would enter and organize the data. Now I’m not sure what parameters/options the students were given. I do know that the classroom had one computer, and students took turns entering the data into this computer. And I know that the students discussed which variables they wanted to store, and how they wanted to store them.

The pivotal decision here was that the students decided that each row would represent a food, for example, Chicle. They decided to record information about serving size, calories, fats, carbs, protein, sodium, sugars, whether it was “processed” (5 g, 18, 0, 5, 0, 0, 0 and si, in case you were curious). They decided not to store information about how many students brought this food, or how many servings any individual student brought.

At this point, you may have realized that their statistical question is rather difficult, if not impossible, to answer given the format in which they stored the data. Had each case of the data been an individual lunchbox or an individual person, then the students might have made headway. Instead, they stumbled over issues about how to compare the total calories of the dataset with the total calories eaten by individuals. (After much discussion, most of the class “discovered” that the average amount was a good way of summarizing the data, but some of the more perceptive students pointed out that it wasn’t clear what the average really meant.)

Lucia’s forthcoming paper will go into the details about the good and the bad in the students’ statistical reasoning, and the ways in which they used (or failed to use) statistical models. But what was fascinating to me was the opportunity this provided for helping students understand how the structure of data affects the questions that we can ask, and how the questions we ask should first consider the structure of the data.

Too often, particularly in textbooks, there is no opportunity to reason about the structure of data. When a question is asked, the students are given appropriate data, and rarely allowed even to decide which variables to consider (since the provided data usually includes only the necessary variables), much less whether or not the data should be restructured or re-collected.

Another reason classrooms have avoided letting students structure their own data is that many real-life datasets have complicated structures. The data these students collected is really (or should have been) hierarchical. If the case is the lunchbox, a lunchbox is associated with a student and possibly with more than 1 item. If data are collected on multiple days, then there is nesting within days as well as the potential for missing variables or unequal record lengths.

Data with such a complicated structure are simply not taught in middle schools, even though, as Lucia’s case study demonstrates, they arise easily from familiar contexts.   These data are messy and complicated. Should we even open this pandora’s box for middle school students, or should it wait until they are older? Is it enough to work with the simplified “flat” format such as the one these students came up with, and just modify the statistical question? Should students be taught how to manipulate such data into different formats to answer the questions they are interested in?

You might think hierarchical formats are beyond the middle school level, but work done by Cliff Konold and Bill Finzer, in the context of using the CODAP tool, suggests that it is possible. [I can’t find an online paper to link to for this result, but there are some leads here, and I’m told it has been approved for publication so should appear soon.]

So the question is: when do we teach students to reason with hierarchical data? When do we teach students to recognize that data can be stored in different formats? When do we teach students to convert data from one format to another?

We are back to the question I asked in my last blog: what’s the learning trajectory that takes statistical beginners and teaches them the computational and statistical tools to allow them to address fundamental questions that rely on data that, on the one hand, are complex but on the other hand are found in our day-to-day lives?

Are computers needed to teach Data Science?

One of the many nice things about summer is the time and space it allows for blogging. And, after a very stimulating SRTL conference (Statistics Reasoning, Teaching and Learning) in Rotorua, New Zealand, there’s lots to blog about.

Let’s begin with a provocative posting by fellow SRTL-er Tim Erickson at his excellent blog A Best Case Scenario.  I’ve known Tim for quite awhile, and have enjoyed many interesting and challenging discussions. Tim is a creator of curricula par excellence, and has first-hand experience in what inspires and motivates students to think deeply about statistics.

The central question here is: Is computation (on a computer) necessary for learning data science? The learners here are beginners in K-12. Tim answers no, and I answer, tentatively, yes. Tim portrays me in his blog as being a bit more steadfast on this position than I really am. In truth the answer is, some; maybe; a little; I don’t know.

My own experience in the topic comes from the Mobilize project  , in which we developed the course Introduction to Data Science for students in the Los Angeles Unified School District. (I’m pleased to say that the course is expanding. This summer, five new L.A.-area school districts will begin training teachers to teach this course. )

The course relies heavily on R via Rstudio. Students begin by studying the structure of data, learning to identify cases and variables and to organize unstructured data into a “tidy” format. Next, they learn to “read” tidy datafiles into Rstudio. The course ends with students learning some predictive modeling using Classification and Regression Trees. In between, they study some inference using randomization-based methods.

To be precise, the students don’t learn straight-up R. They work within a package developed by the Mobilize team (primarily James Molyneux, Amelia McNamara, Steve Nolen, Jeroen Ooms, and Hongsuda Tangmunarunkit) called mobilizR, which is based pretty heavily on the mosaic package developed by Randall Pruim, Danny Kaplan and Nick Horton.  The idea with these packages is to provide beginners to R with a unified syntax and a set of verbs that relate more directly to the analysts’ goals. The basic structure for (almost) all commands is

WhatIWantToDo(yvariable~xvariables, dataset)

For example, to see the average walking distance recorded by a fitbit by day of the week:

 > mean(Distance~DOW,data=fitbitdec)
 Friday Monday Saturday Sunday Thursday Tuesday Wednesday 1.900000 3.690000 2.020909 2.419091 1.432727 3.378182 3.644545

The idea is to provide students with a simplified syntax that “bridges the gap” between beginners of R and more advanced users. Hopefully, this frees up some of the cognitive load required to remember and employ R commands so that students can think strategically and statistically about problems they are trying to solve.

The “bridge the gap” terminology comes from Amelia McNamara, who used the term in her PhD dissertation. One of the many really useful ideas Amelia has given us is the notion that the gap needs to be bridged. Much of “traditional” statistics education holds to the idea that statistical concepts are primarily mathematical, and, for most people, it is sufficient to learn enough of the mathematical concepts so that they can react skeptically and critically to others’ analyses. What is exciting about data science in education is that students can do their own analyses. And if students are analyzing data and discovering on their own (instead of just trying to understand others’ findings), then we need to teach them to use software in such a way that they can transition to more professional practices.

And now, dear readers, we get to the heart of the matter. That gap is really hard to bridge. One reason is that we know little to nothing about the terrain. How do students learn coding when applied to data analysis? How does the technology they use mediate that experience? How can it enhance, rather than inhibit, understanding of statistical concepts and the ability to do data analysis intelligently?

In other words, what’s the learning trajectory?

Tim rightly points to CODAP, the Common Online Data Analysis Platform,  as one tool that might help bridge the gap by providing students with some powerful data manipulation techniques. And I recently learned about, which seems another attempt to help bridge the gap.  But Amelia’s point is that it is not enough to give students the ability to do something; you have to give it to them so that they are prepared to learn the next step. And if the end-point of a statistics education involves coding, then those intermediate steps need to be developing students’ coding skills, as well as their statistical thinking. It’s not sufficient to help studemts learn statistics. They must simultaneously learn computation.

So how do we get there? One important initial step, I believe, is to really examine what the term “computational thinking” means when we apply it to data analysis. And that will be the subject of an upcoming summer blog.

StatPREP Workshops

This last weekend I helped Danny Kaplan and Kathryn Kozak (Coconino Community College) put on a StatPREP workshop. We were also joined by Amelia McNamara (Smith College) and Joe Roith (St. Catherine’s University). The idea behind StatPREP is to work directly with college-level instructors, through online and in community-based workshops, to develop the understanding and skills needed to work and teach with modern data.

Danny Kaplan ponders at #StatPREP

One of the most interesting aspects of these workshops were the tutorials and exercises that the participants worked on. These utilized the R package learnr. This package allows people to create interactive tutorials via RMarkdown. These tutorials can incorporate code chunks that run directly in the browser (when the tutorial is hosted on an appropriate server), and Shiny apps. They can also include exercises/quiz questions as well.

An example of a code chunk from the learnr package.

Within these tutorials, participants were introduced to data wrangling (via dplyr), data visualization (via ggfomula), and data summarization and simulation-based inference (via functions from Project Mosaic). You can see and try some of the tutorials from the workshop here. Participants, in breakout groups, also envisioned a tutorial, and with the help of the workshop presenters, turned that into the skeleton for a tutorial (some things we got working and others are just outlines…we only had a couple hours).

You can read more about the StatPREP workshops and opportunities here.




Citizen Statistician’s very own Mine Çetinkaya-Rundel gave one of the keynote addresses at USCOTS 2017.

The abstract for her talk, Teaching Data Science and Statistical Computation to Undergraduates, is given below.

What draws students to statistics? For some, the answer is mathematics, and for those a course in probability theory might be an attractive entry point. For others, their first exposure to statistics might be an applied introductory statistics course that focuses on methodology. This talk presents an alternative focus for a gateway to statistics: an introductory data science course focusing on data wrangling, exploratory data analysis, data visualization, and effective communication and approaching statistics from a model-based, instead of an inference-based, perspective. A heavy emphasis is placed on best practices for statistical computation, such as reproducibility and collaborative computing through literate programming and version control. I will discuss specific details of this course and how it fits into a modern undergraduate statistics curriculum as well as the success of the course in recruiting students to a statistics major.

You can view her slides at



Read elsewhere: Organizing DataFest the tidy way

Part of the reason why we have been somewhat silent at Citizen Statistician is that it’s DataFest season, and that means a few weeks (months?) of all consuming organization followed by a weekend of super fun data immersion and exhaustion… Each year that I organize DataFest I tell myself “next year, I’ll do [blah] to make my life easier”. This year I finally did it! Read about how I’ve been streamlining the process of registrations, registration confirmations, and dissemination of information prior to the event on my post titled “Organizing DataFest the tidy way” on the R Views blog.

Stay tuned for an update on ASA DataFest 2017 once all 31 DataFests around the globe have concluded!

Theaster Gates, W.E.B. Du Bois, and Statistical Graphics

After reading this review of a Theaster Gates show at Regan Projects, in L.A., I hurried to see the show before it closed. Inspired by sociologist and civil rights activist W.E.B. Du Bois, Gates created artistic interpretations of statistical graphics that Du Bois had produced for an exhibition in Paris in 1900.  Coincidentally, I had just heard about these graphics the previous week at the Data Science Education Technology conference while evesdropping on a conversation Andy Zieffler was having with someone else.  What a pleasant surprise, then, when I learned, almost as soon as I got home, about this exhibit.

I’m no art critic ( but I know what I like), and I found these works to be beautiful, simple, and powerful.  What startled me, when I looked for the Du Bois originals, was how little Gates had changed the graphics. Here’s one work (I apologize for not knowing the title. That’s the difference between an occasional blogger and a journalist.)  It hints of Mondrian, and  the geometry intrigues. Up close, the colors are rich and textured.

Here’s Du Bois’s circa-1900 mosaic-type plot (from, which provides a nice overview of the exhibit for which Du Bois created his innovative graphics)

The title is “Negro business men in the United States”. The large yellow square is “Grocers” the blue square “Undertakers”, and the green square below it is “Publishers.  More are available at the Library of Congress.

Here’s another pair.  The Gates version raised many questions for me.  Why were the bars irregularly sized? What was the organizing principle behind the original? Were the categories sorted in an increasing order, and Gates added some irregularities for visual interest?  What variables are on the axes?

The answer is, no, Gates did not vary the lengths of the bars, only the color.

The vertical axis displays dates, ranging from 1874 to 1899 (just 1 year before Du Bois put the graphics together from a wide variety of sources).  The horizontal axis is acres of land, with values from 334,000 to 1.1 million.

The history of using data to support civil rights has a long history.   A colleague once remarked that there was a great unwritten book behind the story that data and statistical analysis played (and continue to play) in the gay civil rights movement (and perhaps it has been written?)  And the folks at We Quant LA have a nice article demonstrating some of the difficulties in using open data to ask questions about racial profiling by the LAPD. In this day and age of alternative facts and fake news, it’s wise to be careful and precise about what we can and cannot learn from data. And it is encouraging to see the role that art can play in keeping this dialogue alive.

Some Reading for the Winter Break

It has been a long while since I wrote anything for Citizen Statistician, so I thought I would scribe a post about three books that I will be reading over break.







The first book is Cathy O’Neil’s book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [link to Amazon]. I am currently in the midst of Chapter 3. I heard about this book on an episode of 538’s podcast, What’s the Point?, on which O’Neil was featured [Who’s Accountable When An Algorithm Makes A Bad Decision?]. The premise of this book has been something that has been on the mind of many people thinking about data science and algorithms in recent years (and probably not-so-recent years); that many algorithms, and thus the predictions stemming from them, are not transparent. This leads to many ethical and, potentially, legal issues when algorithms are then used to make decisions about recidivism, loan applications, college admissions, etc. I think this book could be the basis for a very interesting seminar. Let me know if anyone is working on something like this.

The second book I will be reading is Michael Lewis’ The Undoing Project: A Friendship That Changed Our Minds [link to Amazon]. This book is bout the friendship, collaboration, and, ultimately, disentanglement between the renowned psychologists Daniel Kahnemann and Amos Tversky. I learned about Kahnemann and Tversky’s work early in my graduate career when Joan Garfield taught a doctoral research seminar on the seminal psychological work related to probabilistic thinking and statistics education. We read not only Kahnemann and Tversky, but also Gird Gigerenzer, Ruma Falk, Maya Bar Hillel, Richard Nisbett, Efraim Fischbein, and others. Interestingly, What’s the Point? recently did two episodes on Lewis’ book as well; Michael Lewis’s New Book Examines How We Think About Thinking and Nate Silver Interviews Michael Lewis About His New Book, ‘The Undoing Project’.

The third book is Who’s #1?: The Science of Rating and Ranking [link to Amazon] by Amy Langville and Carl Meyer. I had read their earlier book, Google’s PageRank and Beyond: The Science of Search Engine Rankings, several years ago, and was quite impressed with the readability of the complex matrix algebra they presented. In Who’s #1, the authors present the mathematics underlying several ratings systems including the Massey system, Elo, Colley, Keener, etc. I am actually treating this book like a self-taught class,  working out several of their examples using R, and really trying to understand the ideas. My interest here is related to the work that I am doing with Brandon LeBeau (University of Iowa) and a current graduate student, Kyle Nickodem on estimating the coaching ability for NCAA football coaches using a hierarchical IRT model [see slides from a talk here].