JSM 2016 session on “Doing more with data”

The ASA’s most recent curriculum guidelines emphasize the increasing importance of data science, real applications, model diversity, and communication / teamwork in undergraduate education. In an effort to highlight recent efforts inspired by these guidelines, I organized a JSM session titled Doing more with data in and outside the undergraduate classroom. This session featured talks on recent curricular and extra-curricular efforts in this vein, with a particular emphasis on challenging students with real and complex data and data analysis. The speakers discussed how these pedagogical innovations aim to educate and engage the next generation, and help them acquire the statistical and data science skills necessary to succeed in a future of ever-increasing data. I’m posting the slides from this session for those who missed it as well as for those who want to review the resources linked in the slides.

Computational Thinking and Statistical Thinking: Foundations of Data Science

by Ani Adhikari and Michael I. Jordan, University of California at Berkeley


Learning Communities: An Emerging Platform for Research in Statistics

by Mark Daniel Ward, Purdue University


The ASA DataFest: Learning by Doing

by Robert Gould, University of California at Los Angeles

(See http://www.amstat.org/education/datafest/ if you’re interested in organizing an ASA DataFest at your institution.)


Statistical Computing as an Introduction to Data Science

by Colin Rundel, Duke University [GitHub]

JSM 2016 session on Reproducibility in Statistics and Data Science

Will reproducibility always be this hard?Ten years after Ioannidis alleged that most scientific findings are false, reproducibility — or lack thereof — has become a full-blown crisis in science. Flagship journals like Nature and Science have published hand-wringing editorials and revised their policies in the hopes of heightening standards of reproducibility. In the statistical and data sciences, the barriers towards reproducibility are far lower, given that our analysis can usually be digitally encoded (e.g., scripts, algorithms, data files, etc.). Failure to ensure the credibility of our contributions will erode “the extraordinary power of statistics,” both among our colleagues and in our collaborations with scientists of all fields. This morning’s JSM session on Reproducibility in Statistics and Data Science featured talks on recent efforts in pursuit of reproducibility. The slides of talks by the speakers and the discussant are posted below.

Note that some links point to a GitHub repo including slides as well as other useful resources for the talk and for adopting reproducible frameworks for your research and teaching. I’m also including Twitter handles for the speakers which is likely the most efficient way for getting in touch with them if you have any questions for them.

This session was organized by Ben Baumer and myself as part of our Project TIER fellowship. Many thanks to Amelia McNamara, who is also a Project TIER fellow, for chairing the session (and correctly pronouncing my name)!

  • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets – Jenny Bryan – repo, including slides – @JennyBryan
  • Steps Toward Reproducible Research – Karl Broman – slides – @kwbroman
  • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! – Karthik Ram – slides – @_inundata
  • Integrating Reproducibility into the Undergraduate Statistics Curriculum – Mine Çetinkaya-Rundel – repo, including slides – @minebocek
  • Discussant: Yihui Xie – slides – @xieyihui

PS: Don’t miss this gem of a repo for links to many many more JSM 2016 slides. Thanks Karl for putting it together!

My JSM 2016 itinerary


JSM 2016 is almost here. I just spent an hour going through the (very) lengthy program. I think that was time well spent, though some might argue I should have been working on my talk instead…

Here is what my itinerary looks like as of today. If you know of a session that you think I might be interested in that I missed, please let me know! And if you go to any one of these sessions and not see me there, it means I got distracted by something else (or something close by).

Sunday, July 31

Unfortunately it looks like I’ll be in meetings all Sunday, but if there is an opportunity to sneak out I would love to see the following sessions:

4PM – 5:50pm

  • Making the Most of R Tools
    • Thinking with Data Using R and RStudio: Powerful Idioms for Analysts — Nicholas Jon Horton, Amherst College ; Randall Pruim, Calvin College ; Daniel Kaplan, Macalester College
    • Transform Your Workflow and Deliverables with Shiny and R Markdown — Garrett Grolemund, RStudio
    • Discussant: Hadley Wickham, Rice University
  • Media and Statistics
    • Causal Inferences from Observational Studies: Fracking, Earthquakes, and Oklahoma — Howard Wainer, NBME
    • It’s Not What We Say, It’s Not What They Hear, It’s What They Say They Heard — Barry Nussbaum, EPA
    • Bad Statistics, Bad Reporting, Bad Impact on Patients: The Story of the PACE Trial — Julie Rehmeyer, Discover Magazine
    • Can Statisticians Enlist the Media to Successfully Change Policy? — Donald A. Berry, MD Anderson Cancer Center
    • Discussant: Jessica Utts, University of California at Irvine

I’ll also be attending the ASA Awards Celebration (6:30 – 7:30pm) this evening.

Monday, August 1

On Monday there are a couple ASA DataFest related meetings. If you organized a DataFest in 2016, or would like to organize one in 2017 (especially if you will be doing so for the first time), please join us. Both meetings will be held at Hilton Chicago Hotel, Room H-PDR3.

  • 10:30am – 2016 ASA DataFest Debrief Meeting
  • 1pm – 2017ASA DataFest Planning Meeting

8:30AM – 10:20AM

  • Applied Data Visualization in Industry and Journalism
    • Linked Brushing in R — Hadley Wickham, Rice University
    • Creating Data Visualization Tools at Facebook — Andreas Gros, Facebook
    • Cocktail Party Horror Stories About Data Vis for Clients — Lynn Cherny, Ghostweather R&D
    • Visualizing the News at FiveThirtyEight — Andrei Scheinkman, FiveThirtyEight.com
    • Teaching Data Visualization to 100k Data Scientists: Lessons from Evidence-Based Data Analysis — Jeffrey Leek, Johns Hopkins Bloomberg School of Public Health

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

I am planning on splitting my time between

4:45pm – 6:15pm

ASA President’s Invited Address – Science and News: A Marriage of Convenience — Joe Palca, NPR


I’ll be splitting my time between the Statistical Computing and Graphics Mixer (6 – 8pm) and the Duke StatSci Dinner.

Tuesday, August 2

8:30AM – 10:20am

  • Introductory Overview Lecture: Data Science
    • On Mining Big Data and Social Network Analysis — Philip S. Yu, University of Illinois at Chicago
    • On Computational Thinking and Inferential Thinking — Michael I. Jordan, University of California at Berkeley

10:30AM – 12:20pm

I’m organizing and chairing the following invited session. I think we have a fantastic line up. Hoping to see many of you in the audience!

  • Doing More with Data in and Outside the Undergraduate Classroom
    • Computational Thinking and Statistical Thinking: Foundations of Data Science — Ani Adhikari, University of California at Berkeley ; Michael I. Jordan, University of California at Berkeley
    • Learning Communities: An Emerging Platform for Research in Statistics — Mark Daniel Ward, Purdue University
    • The ASA DataFest: Learning by Doing — Robert Gould, University of California at Los Angeles
    • Statistical Computing as an Introduction to Data Science — Colin Rundel, Duke University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

  • Interactive Visualizations and Web Applications for Analytics
    • Radiant: A Platform-Independent Browser-Based Interface for Business Analytics in R — Vincent Nijs, Rady School of Management
    • Rbokeh: An R Interface to the Bokeh Plotting Library — Ryan Hafen, Hafen Consulting
    • Composable Linked Interactive Visualizations in R with Htmlwidgets and Shiny — Joseph Cheng, RStudio
    • Papayar: A Better Interactive Neuroimage Plotter in R — John Muschelli, The Johns Hopkins University
    • Interactive and Dynamic Web-Based Graphics for Data Analysis — Carson Sievert, Iowa State University
    • HTML Widgets: Interactive Visualizations from R Made Easy! — Yihui Xie, RStudio ; Ramnath Vaidyanathan, Alteryx

If I could be in two places at once, I’d also love to see:


I’ll be splitting my time between the UCLA Statistics/Biostatistics Mixer (5-7pm), Google Cruise, and maybe a peek at the Dance Party.

Sad to be missing the ASA President’s Address – Appreciating Statistics.

Wednesday, August 3

8:30AM – 10:20am

I’m speaking at the following session co-organized by Ben Baumer and myself. If you’re interested in reproducible data analysis, don’t miss it!

  • Reproducibility in Statistics and Data Science
    • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets — Jennifer Bryan, University of British Columbia
    • Steps Toward Reproducible Research — Karl W. Broman, University of Wisconsin – Madison
    • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! — Karthik Ram, University of California at Berkeley
    • Integrating Reproducibility into the Undergraduate Statistics Curriculum — Mine Cetinkaya-Rundel, Duke University
    • Discussant: Yihui Xie, RStudio

If I could be in two places at once, I’d also love to see:

10:30AM – 12:20pm

  • The 2016 Statistical Computing and Graphics Award Honors William S. Cleveland
    • Bill Cleveland: Il Maestro of Statistical Graphics — Nicholas Fisher, University of Sydney
    • Modern Crowd-Sourcing Validates Cleveland’s 1984 Hierarchy of Graphical Elements — Dianne Cook, Monash University
    • Some Reflections on Dynamic Graphics for Data Exploration — Luke-Jon Tierney, University of Iowa
    • Carpe Datum! Bill Cleveland’s Contributions to Data Science and Big Data Analysis — Steve Scott, Google Analytics
    • Scaling Up Statistical Models to Hadoop Using Tessera — Jim Harner, West Virginia University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

If I could be in two places at once, I’d also see:

4:45PM – 6:15pm


I’m planning on attending the Section on Statistical Education Meeting / Mixer (6-7:30pm).

Thursday, August 4

8:30AM – 10:20am

I think I have to attend a meeting at this time, but if I get a chance I’d love to see:

  • Big Data and Data Science Education
    • Teaching Students to Work with Big Data Through Visualizations — Shonda Kuiper, Grinnell College
    • A Data Visualization Course for Undergraduate Data Science Students — Silas Bergen, Winona State University
    • Intro Stats for Future Data Scientists — Brianna Heggeseth, Williams College ; Richard De Veaux, Williams College
    • An Undergraduate Data Science Program — James Albert, Bowling Green State University ; Maria Rizzo, Bowling Green State University
    • Modernizing an Undergraduate Multivariate Statistics Class — David Hitchcock, University of South Carolina ; Xiaoyan Lin, University of South Carolina ; Brian Habing, University of South Carolina
    • Business Analytics and Implications for Applied Statistics Education — Samuel Woolford, Bentley University
    • DataSurfing on the World Wide Web: Part 2 — Robin Lock, St. Lawrence University

10:30AM – 12:20pm

  • Showcasing Statistics and Public Policy
    • The Twentieth-Century Reversal: How Did the Republican States Switch to the Democrats and Vice Versa? — Andrew Gelman, Columbia University
    • A Commentary on Statistical Assessment of Violence Recidivism Risk — Peter B. Imrey, Cleveland Clinic ; Philip Dawid, University of Cambridge
    • Using Student Test Scores for Teacher Evaluations: The Pros and Cons of Student Growth Percentiles — J.R. Lockwood, Educational Testing Service ; Katherine E. Castellano, Educational Testing Service ; Daniel F. McCaffrey, Educational Testing Service
    • Discussant: David Banks, Duke University

If I could be in two places, I’d also love to see:

That’s it folks! It’s an ambitious itinerary, let’s hope I get through it all.

I probably won’t get a chance to write daily digests like I’ve tried to do in previous years at JSM, but I’ll tweet about interesting things I hear from @minebocek. I’m sure there will be lots of JSM chatter at #JSM2016 as well.

Now, somebody give me something else to look forward to, and tell me Chicago is cooler than Durham!

Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).


PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!

The African Data Initiative

Are you looking for a way to celebrate World Statistics Day? I know you are. And I can’t think of a better way than supporting the African Data Initiative (ADI).

I’m proud to have met some of the statisticians, statisticis educators and researchers who are leading this initative at an International Association of Statistics Educators Roundtable workshop in Cebu, The Phillipines, in 2012. You can read about Roger and David’s Stern’s projects in Kenya here in the journal Technology Innovations in Statistics Education. This group — represented at the workshop by father-and-son Roger and David, and at-the-time grad students Zacharaiah Mbasu and James Musyoka — impressed me with their determination to improve international statistical literacy and  with their successful and creative pragmatic implementations to adjust to the needs of the local situations in Kenya.

The ADI is seeking funds within the next 18 days to adapt two existing software packages, R and Instat+ so that there is a free, open-source, easy-to-learn statistical software package available and accessible throughout the world. While R is free and open-sourced, it is not easy to learn (particularly in areas where English literacy is low). Instat+ is, they claim, easy to learn but not open-source (and also does not run on Linux or Mac).

One of the exciting things about this project is that these solutions to statistical literacy are being developed by Africans working and researching in Africa, and are not ‘imported’ by groups or corporations with little experience implementing in the local schools. One lesson I’ve learned from my experience working with the Los Angeles Unified School District is that you must work closely with the schools for which you are developing curricula; outsider efforts have a lower chance of success. I hope you’ll take a moment –in the next 18 days–to become acquainted with this worthy project!

World Statistics Day is October 20.  The theme is Better Data. Better Lives.

Reproducibility breakout session at USCOTS

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Community Colleges and the ASA

Rob will be be participating in this event, organized by Nicholas Horton:

CONNECTION WITH COMMUNITY COLLEGES: second in the guidelines for undergraduate statistics programs webinar series

The American Statistical Association endorses the value of undergraduate programs in statistical science, both for statistical science majors and for students in other majors seeking a minor or concentration. Guidelines for such programs were promulgated in 2000, and a new workgroup is working to update them.

To help gather input and identify issues and areas for discussion, the workgroup has organized a series of webinars to focus on different issues.

Connection with Community Colleges
Monday, October 21st, 6:00-6:45pm Eastern Time

Description: Community colleges serve a key role in the US higher education system, accounting for approximately 40% of all enrollments. In this webinar, representatives from community colleges and universities with many community college transfers will discuss the interface between the systems and ways to prepare students for undergraduate degrees and minors in statistics.

The webinar is free to attend, and a recording will be made available after the event.  To sign up, please email Rebecca Nichols (rebecca@amstat.org).

More information about the existing curriculum guidelines as well as a survey can be found at:


JSM 2013 – Days 4 and 4.5

I started off my Wednesday with the “The New Face of Statistics Education (#480)” session. Erin Blackenship from UNL talked about their second course in statistics, a math/stat course where students don’t just learn how to calculate sufficient statistics and unbiased estimators but also learn what the values they’re calculating mean in context of the data. The goal of the course is to bring together the kind of reasoning emphasized in intro stat courses with the mathematical rigor of a traditional math/stat course. Blackenship mentioned that almost 90% of the students taking the class are actuarial science students who need to pass the P exam (the first actuarial exam) therefore the probability theory must be a major component of the course. However UNL has been bridging the gap between these demands and the GAISE guidelines by introducing technology to the course (simulating empirical sampling distributions, checking distributional assumptions, numerical approximation) as well as using writing assessments to improve and evaluate student learning. For example, students are asked to explain in their own words the difference between a sufficient statistic and minimal sufficient statistic, and answers that put things in context instead of regurgitating differences are graded highly. This approach not only allows students who struggle with math to demonstrate understanding, but it also reveals shallow understanding of students who might be testing well in terms of the math by simply going through the mechanics.

In my intro stat class I used to ask similar questions on exams, but have been doing so less and less lately in the interest of time spent on grading (they can be tedious to grade). However lately I’ve been trying to incorporate more activities into the class, and I’m thinking such exercises might be quite appropriate as class activities where students work in teams to perfect their answers and perhaps even teams then grading each others’ answers.

Anyway, back to the session… Another talk in the session given by Chris Malone from Winona State was about modernizing the undergraduate curriculum. Chris made the point that we need much more than just cosmetic changes as he believes the current undergraduate curriculum is disconnected from what graduates are doing when they get their first job. His claim was that the current curriculum is designed for the student who is going on to graduate school in statistics, but that that’s only about a fifth of the students in undergraduate majors. (As an aside, I would have guessed the ratio to be even lower.) He advocated for more computing in the undergrad curriculum, a common thread among many of the education talks at JSM this year, and described a few new programs at Winona and other universities on data science. Another common thread was this discussion of “data science” vs. “statistics”, but I’m not going to go there – at least not in this post. (If you’re interested in this discussion, this Simply Statistician post initiated a good conversation on the topic in the comments section.) I started making a list of Data Science programs I found while searching online but this post seems to have a pretty exhaustive list (original post dates back to 2012 but it seems to be updated regularly).

Other notes from the day:
R visreg package looks pretty cool, though perhaps not necessarily very useful for an intro stat course where we don’t cover interactions, non-linear regression, etc.
– There is another DataFest like competition going on in the Midwest: MUDAC – maybe we should do a contributed session at JSM next year where organizers share experiences with each other and the audience to solicit more interest in their events or inspire others.

On Thursday I only attended one session: “Teaching the Fundamentals (#699)” (the very last session, mine). You can find my slides for my talk on using R Markdown to teach data analysis in R as well as to instill the importance of reproducible research early on here.

One of the other speakers in my session was Robert Jernigan, who I recognize from this video. He talked about how students confuse “diversity” and “variability” and hence have a difficult time understanding why a dataset like [60,60,60,10,10,10] has a higher standard deviation than a dataset like [10,20,30,40,50,60]. He also mentioned his blog statpics.com, which seems to have some interesting examples of images like the ones in his video on distributions.

John Walker from Cal Poly San Luis Obispo discussed his experiment on how well students can recognize normal and non-normal distributions using normal probability plots — a standard approach for checking conditions for many statistical methods. He showed that faculty do significantly better than students, which I suppose means that you do get better at this with more exposure. However the results aren’t final, and he is considering some changes to his design. I’m eager to see the final results of his experiment, especially if they come with some evidence/suggestions for what the best method to teach this skill is.