Mapping Irma, but not really…

We’re discussing data visualization nowadays in my course, and today’s topic was supposed to be mapping. However late last night I realized I was going to run out of time and decided to table hands on mapping exercises till a bit later in the course (after we do some data manipulation as well, which I think will work better).

That being said, talking about maps seemed timely, especially with Hurricane Irma developing. Here is how we went about it:

In addition to what’s on the slide I told the students that they can assume the map is given, and they should only think about how the forecast lines would be drawn.

Everyone came up with “we need latitude and longitude and time”. However some teams suggested each column would represent one of the trajectories (wide data), while others came up with the idea of having an indicator column for the trajectory (long data). We sketched out on the board what these two data frames would look like, and evaluated which would be easier to directly plot using tools we’ve learned so far (plotting in R with ggplot2).

While this was a somewhat superficial activity compared to a hands on mapping exercise, I thought it worked well for a variety of reasons:

  1. It was a timely example that grabbed students’ attention.
  2. It generated lively discussion around various ways of organizing data into data frames (which hopefully will serve as a good primer for the data manipulation unit where we’ll discuss how data don’t always come in the format you need and you might need to get it in shape first before you can visualize/analyze it).
  3. Working backwards from a visualization to source data (as opposed to from data to visualization) provided a different challenge/perspective, and a welcome break from “how do I get R to plot this?”.
  4. We got to talk about the fact that predictions based on the same source data can vary depending on the forecasting model (foreshadowing of concepts we will discuss in the modeling unit coming up later in the course).
  5. It was quick to prepare! And quick to work through in class (~5 mins of team discussion + ~10 mins of class discussion).

I also suggested to students that they read the underlying NYTimes article as well as this Upshot article if they’re interested in finding out more about modeling the path of a hurricane (or modeling anything, really) and uncertainty.

Modernizing the Undergraduate Statistics Curriculum at #JSM2017

I’m a bit late in posting this, but travel delays post-JSM left me weary, so I’m just getting around to it. Better late than never?

Wednesday at JSM featured an invited statistics education session on Modernizing the Undergraduate Statistics Curriculum. This session featured two types of speakers: those who are currently involved in undergraduate education and those who are on the receiving end of graduating majors. The speakers involved in undergraduate education presented on their recent efforts for modernizing the undergraduate statistics curriculum to provide the essential computational and problem solving skills expected from today’s modern statistician while also providing a firm grounding in theory and methods. The speakers representing industry discussed their expectations (or hopes and dreams) for new graduates and where they find gaps in the knowledge of new hires.

The speakers were  Nick Horton (Amherst College), Hilary Parker (Stitch Fix), Jo Hardin (Pomona College), and Colin Rundel (Duke University). The discussant was Rob Gould (UCLA). Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments.

Modernizing the undergraduate statistics curriculum: what are the theoretical underpinnings? – Nick Horton

Hopes and dreams for statistics graduates – Hilary Parker

Expectations and Skills for Undergraduate Students Doing Research in Statistics and Data Science – Jo Hardin

Moving Away from Ad Hoc Statistical Computing Education – Colin Rundel

Discussion – Rob Gould

Novel Approaches to First Statistics / Data Science Course at #JSM2017

Tuesday morning, bright an early at 8:30am, was our session titled “Novel Approaches to First Statistics / Data Science Course”. For some students the first course in statistics may be the only quantitative reasoning course they take in college. For others, it is the first of many in a statistics major curriculum. The content of this course depends on which audience the course is aimed at as well as its place in the curriculum. However a data-centric approach with an emphasis on computation and algorithmic thinking is essential for all modern first statistics courses. The speakers in our session presented their approaches for the various first courses in statistics and data science that they have developed and taught. The discussion also highlighted pedagogical and curricular choices they have made in deciding what to keep, what to eliminate, and what to modify from the traditional introductory statistics curriculum. The speakers in the session were Ben Baumer from Smith College, Rebecca Nugent from CMU, myself, and Daniel Kaplan from Macalester College. Our esteemed discussant was Dick DeVeaux, and our chair, the person who managed to keep this rambunctious bunch on time, was Andrew Bray from Reed College. Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments, or find us on social media!

Ben Baumer – Three Methods Approach to Statistical InferenceRebecca Nugent – Lessons Learned in Transitioning from “Intro to Statistics” to “Reasoning with Data”

Mine Cetinkaya-Rundel – A First-Year Undergraduate Data Science Course

Daniel Kaplan – Teaching Stats for Data Science

Dick DeVeaux – Discussion

 

Structuring Data in Middle School

Of the many provocative and exciting discussions at this year’s Statistics Research Teaching and Learning conference in Rotarua, NZ, one that has stuck in my mind is from Lucia Zapata-Cardona, from the Universidad de Antioquia in Columbia. Lucia discussed data from her classroom observations of a teacher at a middle school (ages 12-13) in a “Northwest Columbian city”. The class was exciting for many reasons, but the reason that I want to write about it here is because of the fact that the teacher had the students structure and store their own data.

The classroom was remarkable – to my American eyes – for the large number of students (45) and for the noise (walls were thin, the playground was immediately outside, and windows were kept open because of the heat.) Despite this, the teacher led an inquiry-based discussion, skillfully prompting the students with questions from the back of the classroom. The discussion lasted over several days.

The students had collected data about the nutritional content of the foods they eat. Challenging students with real-world, meaningful problems is an important part of Prof. Zapata-Cardona’s research, since an important goal of education is to tie the world of the classroom to the real world. Lucia was interested in examining how (and whether) the students constructed and employed statistical models to reason with the data. (Modeling was the theme of this SRTL.) What fascinated me wasn’t the modeling, but the role that the structure of the data played in the students’ reasoning.

Students were asked to collect data on the food contained in their lunchboxes so that they could answer the statistical question “How nutritious is the food we bring to school in our lunchbox?” It’s important to note that in Columbia, as Lucia explained to us, the “lunch box” doesn’t contain actual lunch (which the students eat at home), but instead includes snacks for during the day. What interested me was that the teacher let the class, after discussion, decide how they would enter and organize the data. Now I’m not sure what parameters/options the students were given. I do know that the classroom had one computer, and students took turns entering the data into this computer. And I know that the students discussed which variables they wanted to store, and how they wanted to store them.

The pivotal decision here was that the students decided that each row would represent a food, for example, Chicle. They decided to record information about serving size, calories, fats, carbs, protein, sodium, sugars, whether it was “processed” (5 g, 18, 0, 5, 0, 0, 0 and si, in case you were curious). They decided not to store information about how many students brought this food, or how many servings any individual student brought.

At this point, you may have realized that their statistical question is rather difficult, if not impossible, to answer given the format in which they stored the data. Had each case of the data been an individual lunchbox or an individual person, then the students might have made headway. Instead, they stumbled over issues about how to compare the total calories of the dataset with the total calories eaten by individuals. (After much discussion, most of the class “discovered” that the average amount was a good way of summarizing the data, but some of the more perceptive students pointed out that it wasn’t clear what the average really meant.)

Lucia’s forthcoming paper will go into the details about the good and the bad in the students’ statistical reasoning, and the ways in which they used (or failed to use) statistical models. But what was fascinating to me was the opportunity this provided for helping students understand how the structure of data affects the questions that we can ask, and how the questions we ask should first consider the structure of the data.

Too often, particularly in textbooks, there is no opportunity to reason about the structure of data. When a question is asked, the students are given appropriate data, and rarely allowed even to decide which variables to consider (since the provided data usually includes only the necessary variables), much less whether or not the data should be restructured or re-collected.

Another reason classrooms have avoided letting students structure their own data is that many real-life datasets have complicated structures. The data these students collected is really (or should have been) hierarchical. If the case is the lunchbox, a lunchbox is associated with a student and possibly with more than 1 item. If data are collected on multiple days, then there is nesting within days as well as the potential for missing variables or unequal record lengths.

Data with such a complicated structure are simply not taught in middle schools, even though, as Lucia’s case study demonstrates, they arise easily from familiar contexts.   These data are messy and complicated. Should we even open this pandora’s box for middle school students, or should it wait until they are older? Is it enough to work with the simplified “flat” format such as the one these students came up with, and just modify the statistical question? Should students be taught how to manipulate such data into different formats to answer the questions they are interested in?

You might think hierarchical formats are beyond the middle school level, but work done by Cliff Konold and Bill Finzer, in the context of using the CODAP tool, suggests that it is possible. [I can’t find an online paper to link to for this result, but there are some leads here, and I’m told it has been approved for publication so should appear soon.]

So the question is: when do we teach students to reason with hierarchical data? When do we teach students to recognize that data can be stored in different formats? When do we teach students to convert data from one format to another?

We are back to the question I asked in my last blog: what’s the learning trajectory that takes statistical beginners and teaches them the computational and statistical tools to allow them to address fundamental questions that rely on data that, on the one hand, are complex but on the other hand are found in our day-to-day lives?

Are computers needed to teach Data Science?

One of the many nice things about summer is the time and space it allows for blogging. And, after a very stimulating SRTL conference (Statistics Reasoning, Teaching and Learning) in Rotorua, New Zealand, there’s lots to blog about.

Let’s begin with a provocative posting by fellow SRTL-er Tim Erickson at his excellent blog A Best Case Scenario.  I’ve known Tim for quite awhile, and have enjoyed many interesting and challenging discussions. Tim is a creator of curricula par excellence, and has first-hand experience in what inspires and motivates students to think deeply about statistics.

The central question here is: Is computation (on a computer) necessary for learning data science? The learners here are beginners in K-12. Tim answers no, and I answer, tentatively, yes. Tim portrays me in his blog as being a bit more steadfast on this position than I really am. In truth the answer is, some; maybe; a little; I don’t know.

My own experience in the topic comes from the Mobilize project  , in which we developed the course Introduction to Data Science for students in the Los Angeles Unified School District. (I’m pleased to say that the course is expanding. This summer, five new L.A.-area school districts will begin training teachers to teach this course. )

The course relies heavily on R via Rstudio. Students begin by studying the structure of data, learning to identify cases and variables and to organize unstructured data into a “tidy” format. Next, they learn to “read” tidy datafiles into Rstudio. The course ends with students learning some predictive modeling using Classification and Regression Trees. In between, they study some inference using randomization-based methods.

To be precise, the students don’t learn straight-up R. They work within a package developed by the Mobilize team (primarily James Molyneux, Amelia McNamara, Steve Nolen, Jeroen Ooms, and Hongsuda Tangmunarunkit) called mobilizR, which is based pretty heavily on the mosaic package developed by Randall Pruim, Danny Kaplan and Nick Horton.  The idea with these packages is to provide beginners to R with a unified syntax and a set of verbs that relate more directly to the analysts’ goals. The basic structure for (almost) all commands is

WhatIWantToDo(yvariable~xvariables, dataset)

For example, to see the average walking distance recorded by a fitbit by day of the week:

 > mean(Distance~DOW,data=fitbitdec)
 Friday Monday Saturday Sunday Thursday Tuesday Wednesday 1.900000 3.690000 2.020909 2.419091 1.432727 3.378182 3.644545

The idea is to provide students with a simplified syntax that “bridges the gap” between beginners of R and more advanced users. Hopefully, this frees up some of the cognitive load required to remember and employ R commands so that students can think strategically and statistically about problems they are trying to solve.

The “bridge the gap” terminology comes from Amelia McNamara, who used the term in her PhD dissertation. One of the many really useful ideas Amelia has given us is the notion that the gap needs to be bridged. Much of “traditional” statistics education holds to the idea that statistical concepts are primarily mathematical, and, for most people, it is sufficient to learn enough of the mathematical concepts so that they can react skeptically and critically to others’ analyses. What is exciting about data science in education is that students can do their own analyses. And if students are analyzing data and discovering on their own (instead of just trying to understand others’ findings), then we need to teach them to use software in such a way that they can transition to more professional practices.

And now, dear readers, we get to the heart of the matter. That gap is really hard to bridge. One reason is that we know little to nothing about the terrain. How do students learn coding when applied to data analysis? How does the technology they use mediate that experience? How can it enhance, rather than inhibit, understanding of statistical concepts and the ability to do data analysis intelligently?

In other words, what’s the learning trajectory?

Tim rightly points to CODAP, the Common Online Data Analysis Platform,  as one tool that might help bridge the gap by providing students with some powerful data manipulation techniques. And I recently learned about data.world, which seems another attempt to help bridge the gap.  But Amelia’s point is that it is not enough to give students the ability to do something; you have to give it to them so that they are prepared to learn the next step. And if the end-point of a statistics education involves coding, then those intermediate steps need to be developing students’ coding skills, as well as their statistical thinking. It’s not sufficient to help studemts learn statistics. They must simultaneously learn computation.

So how do we get there? One important initial step, I believe, is to really examine what the term “computational thinking” means when we apply it to data analysis. And that will be the subject of an upcoming summer blog.

Read elsewhere: Organizing DataFest the tidy way

Part of the reason why we have been somewhat silent at Citizen Statistician is that it’s DataFest season, and that means a few weeks (months?) of all consuming organization followed by a weekend of super fun data immersion and exhaustion… Each year that I organize DataFest I tell myself “next year, I’ll do [blah] to make my life easier”. This year I finally did it! Read about how I’ve been streamlining the process of registrations, registration confirmations, and dissemination of information prior to the event on my post titled “Organizing DataFest the tidy way” on the R Views blog.

Stay tuned for an update on ASA DataFest 2017 once all 31 DataFests around the globe have concluded!

Slack for managing course TAs

slackI meant to write this post last year when I was teaching a large course with lots of teaching assistants to manage, but, well, I was teaching a large course with lots of teaching assistants to manage, so I ran out of time…

There is nothing all that revolutionary here. People have been using Slack to manage teams for a while now. I’ve even come across some articles / posts on using Slack as a course discussion forum, so use of Slack in an educational setting is not all that new either. But I have not heard of people using Slack for organizing the course and managing TAs, so I figured it might be worthwhile to write about my experience.

TL;DR: A+, would do it again!

I’ll be honest, when I first found out about Slack, I wasn’t all that impressed. First, I kept thinking it’s called Slacker, and I was like, “hey, I’m no slacker!” (I totally am…). Second, I initially thought one had to use Slack in the browser, and accidentally kept closing the tab and hence missing messages. There is a Slack app that you can run on your computer or phone, it took me a while to realize that. Because of my rocky start with it, I didn’t think to use Slack in my teaching. I must credit my co-instructor, Anthea Monod, for the idea of using Slack for communicating with our TAs.

Between the two instructors we had 12 TAs to manage. We set up a Slack team for the course with channels like #labs, #problem sets, #office_hours, #meetings, etc.

This setup worked really well for us for a variety of reasons:

  • Keep course management related emails out of email inbox: These really add up. At this point, any email I can keep out of my inbox is a win in my book!
  • Easily keep all TAs in the loop: Need to announce a typo in a solution key? Or give TAs a heads up about questions they might expect in office hours? I used to handle these by emailing them all, and either I’d miss one or two or a TA responding to my email would forget to reply all (people never seem to reply all when they should, but they always do when they shouldn’t!)
  • Provide a space for TAs to easily communicate with each other: Our TAs used Slack to let others know they might need someone to cover for them for office hours, or teaching a section, etc. It was nice to be able to alert all of them at once, and also for everyone to see when someone responded saying they’re available to cover.
  • Keep a record of decisions made in an easily searchable space: Slack’s search is not great, but it’s better than my email’s for sure. Plus, since you’re searching only within that team’s communication, as opposed to through all your emails, it’s a lot easier to find what you’re looking for.
  • It’s fun: The #random channel was a place people shared funny tidbits or cool blog posts etc. I doubt the TAs would be emailing each other with these if this communication channel wasn’t there. It made them act more like a community than they would otherwise.
  • It’s free: At least for a reasonable amount of usage for a semester long course.

Some words of advice if you decide to use Slack for managing your own course:

  • There is a start-up cost: Not cost as in $$, but cost as in time… At the beginning of the semester you’ll need to make sure everyone gets in the team and sets up Slack on their devices. We did this during our first meeting, it was a lot more efficient than emailing reminders.
  • It takes time for people to break their emailing habits: For the first couple weeks TAs would still email me their questions instead of using Slack. It took some time and nudging, but eventually everyone shifted all course related communication to Slack.

If you’re teaching a course with TAs this semester, especially a large one with many people to manage, I strongly recommend giving Slack a try.

A timely first day of class example for Fall 2016: Trump Tweets

On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).

silver-feature-most-common-women-names3One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students’ names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.

4671594023_b41c2ee662_m

Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson‘s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red – tourist – pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.

makeHowever if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half

I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that

the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.

Source: http://varianceexplained.org/r/trump-tweets/

I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.

As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.

Michael Phelps’ hickies

Ok, they’re not hickies, but NPR referred to them as such, so I’m going with it… I’m talking about the cupping marks.

The NPR story can be heard (or read) here. There were two points made in this story that I think would be useful and fun to discuss in a stats course.

The first is the placebo effect. Often times in intro stats courses the placebo effect is mentioned as something undesirable that must be controlled for. This is true, but in this case the “placebo effect from cupping could work to reduce pain with or without an underlying physical benefit”. While there isn’t sufficient scientific evidence for the positive physical effect of cupping, the placebo effect might be just enough to give the edge to an individual olympian to outperform others by a small margin.

This brings me to my second point, the individual effect on extreme cases vs. a statistically significant effect on a population parameter. I briefly did a search on Google scholar for studies on the effectiveness of cupping and most use t-tests or ANOVAs to evaluate the effect on some average pain / severity of symptom score. If we can assume no adverse effect from cupping, might it still make sense for an individual to give the treatment a try even if the treatment has not been shown to statistically significantly improve average pain? I think this would be an interesting, and timely, question to discuss in class when introducing a method like the t-test. Often in tests of significance on a mean the variance of a treatment effect is viewed as a nuisance factor that is only useful for figuring out the variability of the sampling distribution of the mean, but in this case the variance of the treatment effect on individuals might also be of interest.

While my brief search didn’t result in any datasets on cupping, the following articles contain some summary statistics or citations to studies that report such statistics that one could bring into the classroom:

PS: I wanted to include a picture of these cupping marks on Michael Phelps, but I couldn’t easily find an image that was free to use or share. You can see a picture here.

PPS: Holy small sample sizes in some of the studies I came across!

JSM 2016 session on “Doing more with data”

The ASA’s most recent curriculum guidelines emphasize the increasing importance of data science, real applications, model diversity, and communication / teamwork in undergraduate education. In an effort to highlight recent efforts inspired by these guidelines, I organized a JSM session titled Doing more with data in and outside the undergraduate classroom. This session featured talks on recent curricular and extra-curricular efforts in this vein, with a particular emphasis on challenging students with real and complex data and data analysis. The speakers discussed how these pedagogical innovations aim to educate and engage the next generation, and help them acquire the statistical and data science skills necessary to succeed in a future of ever-increasing data. I’m posting the slides from this session for those who missed it as well as for those who want to review the resources linked in the slides.

Computational Thinking and Statistical Thinking: Foundations of Data Science

by Ani Adhikari and Michael I. Jordan, University of California at Berkeley

 

Learning Communities: An Emerging Platform for Research in Statistics

by Mark Daniel Ward, Purdue University

 

The ASA DataFest: Learning by Doing

by Robert Gould, University of California at Los Angeles

(See http://www.amstat.org/education/datafest/ if you’re interested in organizing an ASA DataFest at your institution.)

 

Statistical Computing as an Introduction to Data Science

by Colin Rundel, Duke University [GitHub]