Measurement error in intro stats

I have recently been asked by my doctor to closely monitor my blood pressure, and report it if it’s above a certain cutoff. Sometimes I end up reporting it by calling a nurse line, sometimes directly to a doctor in person. The reactions I get vary from “oh, it happens sometimes, just take it again in a bit” to “OMG the end of the world is coming!!!” (ok, I’m exaggerating, but you get the idea). This got me thinking: does the person I’m talking to understand measurement error? Which then got me thinking: I routinely teach intro stats courses that for some students is the only stats, and potentially only quantitative reasoning, course they might take in college, do I discuss measurement error properly in this course? I’m afraid the answer is no… It’s certainly mentioned within the context of a few case studies, but I can’t say that it gets the emphasis it deserves. I also browsed through a few intro stats books (including mine!) and not a mention of “measurement error” specifically.

I’m always hesitant to make statements like “we should teach this in intro stats” because I know most intro stats curriculum is already pretty bloated, and it’s not feasible to cover everything in one course. But this seems to be a pretty crucial concept for someone to understand in order to be able to have meaningful conversations with their health providers and make better decisions (or stay calm) about their health that I think it is indeed worth spending a little bit of time on.

Slack for managing course TAs

slackI meant to write this post last year when I was teaching a large course with lots of teaching assistants to manage, but, well, I was teaching a large course with lots of teaching assistants to manage, so I ran out of time…

There is nothing all that revolutionary here. People have been using Slack to manage teams for a while now. I’ve even come across some articles / posts on using Slack as a course discussion forum, so use of Slack in an educational setting is not all that new either. But I have not heard of people using Slack for organizing the course and managing TAs, so I figured it might be worthwhile to write about my experience.

TL;DR: A+, would do it again!

I’ll be honest, when I first found out about Slack, I wasn’t all that impressed. First, I kept thinking it’s called Slacker, and I was like, “hey, I’m no slacker!” (I totally am…). Second, I initially thought one had to use Slack in the browser, and accidentally kept closing the tab and hence missing messages. There is a Slack app that you can run on your computer or phone, it took me a while to realize that. Because of my rocky start with it, I didn’t think to use Slack in my teaching. I must credit my co-instructor, Anthea Monod, for the idea of using Slack for communicating with our TAs.

Between the two instructors we had 12 TAs to manage. We set up a Slack team for the course with channels like #labs, #problem sets, #office_hours, #meetings, etc.

This setup worked really well for us for a variety of reasons:

  • Keep course management related emails out of email inbox: These really add up. At this point, any email I can keep out of my inbox is a win in my book!
  • Easily keep all TAs in the loop: Need to announce a typo in a solution key? Or give TAs a heads up about questions they might expect in office hours? I used to handle these by emailing them all, and either I’d miss one or two or a TA responding to my email would forget to reply all (people never seem to reply all when they should, but they always do when they shouldn’t!)
  • Provide a space for TAs to easily communicate with each other: Our TAs used Slack to let others know they might need someone to cover for them for office hours, or teaching a section, etc. It was nice to be able to alert all of them at once, and also for everyone to see when someone responded saying they’re available to cover.
  • Keep a record of decisions made in an easily searchable space: Slack’s search is not great, but it’s better than my email’s for sure. Plus, since you’re searching only within that team’s communication, as opposed to through all your emails, it’s a lot easier to find what you’re looking for.
  • It’s fun: The #random channel was a place people shared funny tidbits or cool blog posts etc. I doubt the TAs would be emailing each other with these if this communication channel wasn’t there. It made them act more like a community than they would otherwise.
  • It’s free: At least for a reasonable amount of usage for a semester long course.

Some words of advice if you decide to use Slack for managing your own course:

  • There is a start-up cost: Not cost as in $$, but cost as in time… At the beginning of the semester you’ll need to make sure everyone gets in the team and sets up Slack on their devices. We did this during our first meeting, it was a lot more efficient than emailing reminders.
  • It takes time for people to break their emailing habits: For the first couple weeks TAs would still email me their questions instead of using Slack. It took some time and nudging, but eventually everyone shifted all course related communication to Slack.

If you’re teaching a course with TAs this semester, especially a large one with many people to manage, I strongly recommend giving Slack a try.

A timely first day of class example for Fall 2016: Trump Tweets

On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).

silver-feature-most-common-women-names3One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students’ names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.


Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson‘s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red – tourist – pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.

makeHowever if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:

Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half

I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that

the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.


I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.

As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.

Michael Phelps’ hickies

Ok, they’re not hickies, but NPR referred to them as such, so I’m going with it… I’m talking about the cupping marks.

The NPR story can be heard (or read) here. There were two points made in this story that I think would be useful and fun to discuss in a stats course.

The first is the placebo effect. Often times in intro stats courses the placebo effect is mentioned as something undesirable that must be controlled for. This is true, but in this case the “placebo effect from cupping could work to reduce pain with or without an underlying physical benefit”. While there isn’t sufficient scientific evidence for the positive physical effect of cupping, the placebo effect might be just enough to give the edge to an individual olympian to outperform others by a small margin.

This brings me to my second point, the individual effect on extreme cases vs. a statistically significant effect on a population parameter. I briefly did a search on Google scholar for studies on the effectiveness of cupping and most use t-tests or ANOVAs to evaluate the effect on some average pain / severity of symptom score. If we can assume no adverse effect from cupping, might it still make sense for an individual to give the treatment a try even if the treatment has not been shown to statistically significantly improve average pain? I think this would be an interesting, and timely, question to discuss in class when introducing a method like the t-test. Often in tests of significance on a mean the variance of a treatment effect is viewed as a nuisance factor that is only useful for figuring out the variability of the sampling distribution of the mean, but in this case the variance of the treatment effect on individuals might also be of interest.

While my brief search didn’t result in any datasets on cupping, the following articles contain some summary statistics or citations to studies that report such statistics that one could bring into the classroom:

PS: I wanted to include a picture of these cupping marks on Michael Phelps, but I couldn’t easily find an image that was free to use or share. You can see a picture here.

PPS: Holy small sample sizes in some of the studies I came across!

JSM 2016 session on “Doing more with data”

The ASA’s most recent curriculum guidelines emphasize the increasing importance of data science, real applications, model diversity, and communication / teamwork in undergraduate education. In an effort to highlight recent efforts inspired by these guidelines, I organized a JSM session titled Doing more with data in and outside the undergraduate classroom. This session featured talks on recent curricular and extra-curricular efforts in this vein, with a particular emphasis on challenging students with real and complex data and data analysis. The speakers discussed how these pedagogical innovations aim to educate and engage the next generation, and help them acquire the statistical and data science skills necessary to succeed in a future of ever-increasing data. I’m posting the slides from this session for those who missed it as well as for those who want to review the resources linked in the slides.

Computational Thinking and Statistical Thinking: Foundations of Data Science

by Ani Adhikari and Michael I. Jordan, University of California at Berkeley


Learning Communities: An Emerging Platform for Research in Statistics

by Mark Daniel Ward, Purdue University


The ASA DataFest: Learning by Doing

by Robert Gould, University of California at Los Angeles

(See if you’re interested in organizing an ASA DataFest at your institution.)


Statistical Computing as an Introduction to Data Science

by Colin Rundel, Duke University [GitHub]

JSM 2016 roundtable on open resources in statistics education

Monday morning at JSM 2016 Andrew Bray and I hosted a roundtable on integrating open access and open source statistics education materials. It was a fruitful discussion with participants from 2-year colleges, 4-year colleges, and industry.

In preparation for the roundtable we put together a one-page handout listing a sampling of open access and open source statistics resources, with links to the resources. The handout is below for anyone who is interested (click on the image to get to the PDF with hyperlinks), and if you think of other resources that would be useful to list here, please comment below and I’ll periodically update list.

Open resources for stat ed

JSM 2016 session on Reproducibility in Statistics and Data Science

Will reproducibility always be this hard?Ten years after Ioannidis alleged that most scientific findings are false, reproducibility — or lack thereof — has become a full-blown crisis in science. Flagship journals like Nature and Science have published hand-wringing editorials and revised their policies in the hopes of heightening standards of reproducibility. In the statistical and data sciences, the barriers towards reproducibility are far lower, given that our analysis can usually be digitally encoded (e.g., scripts, algorithms, data files, etc.). Failure to ensure the credibility of our contributions will erode “the extraordinary power of statistics,” both among our colleagues and in our collaborations with scientists of all fields. This morning’s JSM session on Reproducibility in Statistics and Data Science featured talks on recent efforts in pursuit of reproducibility. The slides of talks by the speakers and the discussant are posted below.

Note that some links point to a GitHub repo including slides as well as other useful resources for the talk and for adopting reproducible frameworks for your research and teaching. I’m also including Twitter handles for the speakers which is likely the most efficient way for getting in touch with them if you have any questions for them.

This session was organized by Ben Baumer and myself as part of our Project TIER fellowship. Many thanks to Amelia McNamara, who is also a Project TIER fellow, for chairing the session (and correctly pronouncing my name)!

  • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets – Jenny Bryan – repo, including slides – @JennyBryan
  • Steps Toward Reproducible Research – Karl Broman – slides – @kwbroman
  • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! – Karthik Ram – slides – @_inundata
  • Integrating Reproducibility into the Undergraduate Statistics Curriculum – Mine Çetinkaya-Rundel – repo, including slides – @minebocek
  • Discussant: Yihui Xie – slides – @xieyihui

PS: Don’t miss this gem of a repo for links to many many more JSM 2016 slides. Thanks Karl for putting it together!

My JSM 2016 itinerary


JSM 2016 is almost here. I just spent an hour going through the (very) lengthy program. I think that was time well spent, though some might argue I should have been working on my talk instead…

Here is what my itinerary looks like as of today. If you know of a session that you think I might be interested in that I missed, please let me know! And if you go to any one of these sessions and not see me there, it means I got distracted by something else (or something close by).

Sunday, July 31

Unfortunately it looks like I’ll be in meetings all Sunday, but if there is an opportunity to sneak out I would love to see the following sessions:

4PM – 5:50pm

  • Making the Most of R Tools
    • Thinking with Data Using R and RStudio: Powerful Idioms for Analysts — Nicholas Jon Horton, Amherst College ; Randall Pruim, Calvin College ; Daniel Kaplan, Macalester College
    • Transform Your Workflow and Deliverables with Shiny and R Markdown — Garrett Grolemund, RStudio
    • Discussant: Hadley Wickham, Rice University
  • Media and Statistics
    • Causal Inferences from Observational Studies: Fracking, Earthquakes, and Oklahoma — Howard Wainer, NBME
    • It’s Not What We Say, It’s Not What They Hear, It’s What They Say They Heard — Barry Nussbaum, EPA
    • Bad Statistics, Bad Reporting, Bad Impact on Patients: The Story of the PACE Trial — Julie Rehmeyer, Discover Magazine
    • Can Statisticians Enlist the Media to Successfully Change Policy? — Donald A. Berry, MD Anderson Cancer Center
    • Discussant: Jessica Utts, University of California at Irvine

I’ll also be attending the ASA Awards Celebration (6:30 – 7:30pm) this evening.

Monday, August 1

On Monday there are a couple ASA DataFest related meetings. If you organized a DataFest in 2016, or would like to organize one in 2017 (especially if you will be doing so for the first time), please join us. Both meetings will be held at Hilton Chicago Hotel, Room H-PDR3.

  • 10:30am – 2016 ASA DataFest Debrief Meeting
  • 1pm – 2017ASA DataFest Planning Meeting

8:30AM – 10:20AM

  • Applied Data Visualization in Industry and Journalism
    • Linked Brushing in R — Hadley Wickham, Rice University
    • Creating Data Visualization Tools at Facebook — Andreas Gros, Facebook
    • Cocktail Party Horror Stories About Data Vis for Clients — Lynn Cherny, Ghostweather R&D
    • Visualizing the News at FiveThirtyEight — Andrei Scheinkman,
    • Teaching Data Visualization to 100k Data Scientists: Lessons from Evidence-Based Data Analysis — Jeffrey Leek, Johns Hopkins Bloomberg School of Public Health

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

I am planning on splitting my time between

4:45pm – 6:15pm

ASA President’s Invited Address – Science and News: A Marriage of Convenience — Joe Palca, NPR


I’ll be splitting my time between the Statistical Computing and Graphics Mixer (6 – 8pm) and the Duke StatSci Dinner.

Tuesday, August 2

8:30AM – 10:20am

  • Introductory Overview Lecture: Data Science
    • On Mining Big Data and Social Network Analysis — Philip S. Yu, University of Illinois at Chicago
    • On Computational Thinking and Inferential Thinking — Michael I. Jordan, University of California at Berkeley

10:30AM – 12:20pm

I’m organizing and chairing the following invited session. I think we have a fantastic line up. Hoping to see many of you in the audience!

  • Doing More with Data in and Outside the Undergraduate Classroom
    • Computational Thinking and Statistical Thinking: Foundations of Data Science — Ani Adhikari, University of California at Berkeley ; Michael I. Jordan, University of California at Berkeley
    • Learning Communities: An Emerging Platform for Research in Statistics — Mark Daniel Ward, Purdue University
    • The ASA DataFest: Learning by Doing — Robert Gould, University of California at Los Angeles
    • Statistical Computing as an Introduction to Data Science — Colin Rundel, Duke University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

  • Interactive Visualizations and Web Applications for Analytics
    • Radiant: A Platform-Independent Browser-Based Interface for Business Analytics in R — Vincent Nijs, Rady School of Management
    • Rbokeh: An R Interface to the Bokeh Plotting Library — Ryan Hafen, Hafen Consulting
    • Composable Linked Interactive Visualizations in R with Htmlwidgets and Shiny — Joseph Cheng, RStudio
    • Papayar: A Better Interactive Neuroimage Plotter in R — John Muschelli, The Johns Hopkins University
    • Interactive and Dynamic Web-Based Graphics for Data Analysis — Carson Sievert, Iowa State University
    • HTML Widgets: Interactive Visualizations from R Made Easy! — Yihui Xie, RStudio ; Ramnath Vaidyanathan, Alteryx

If I could be in two places at once, I’d also love to see:


I’ll be splitting my time between the UCLA Statistics/Biostatistics Mixer (5-7pm), Google Cruise, and maybe a peek at the Dance Party.

Sad to be missing the ASA President’s Address – Appreciating Statistics.

Wednesday, August 3

8:30AM – 10:20am

I’m speaking at the following session co-organized by Ben Baumer and myself. If you’re interested in reproducible data analysis, don’t miss it!

  • Reproducibility in Statistics and Data Science
    • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets — Jennifer Bryan, University of British Columbia
    • Steps Toward Reproducible Research — Karl W. Broman, University of Wisconsin – Madison
    • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! — Karthik Ram, University of California at Berkeley
    • Integrating Reproducibility into the Undergraduate Statistics Curriculum — Mine Cetinkaya-Rundel, Duke University
    • Discussant: Yihui Xie, RStudio

If I could be in two places at once, I’d also love to see:

10:30AM – 12:20pm

  • The 2016 Statistical Computing and Graphics Award Honors William S. Cleveland
    • Bill Cleveland: Il Maestro of Statistical Graphics — Nicholas Fisher, University of Sydney
    • Modern Crowd-Sourcing Validates Cleveland’s 1984 Hierarchy of Graphical Elements — Dianne Cook, Monash University
    • Some Reflections on Dynamic Graphics for Data Exploration — Luke-Jon Tierney, University of Iowa
    • Carpe Datum! Bill Cleveland’s Contributions to Data Science and Big Data Analysis — Steve Scott, Google Analytics
    • Scaling Up Statistical Models to Hadoop Using Tessera — Jim Harner, West Virginia University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

If I could be in two places at once, I’d also see:

4:45PM – 6:15pm


I’m planning on attending the Section on Statistical Education Meeting / Mixer (6-7:30pm).

Thursday, August 4

8:30AM – 10:20am

I think I have to attend a meeting at this time, but if I get a chance I’d love to see:

  • Big Data and Data Science Education
    • Teaching Students to Work with Big Data Through Visualizations — Shonda Kuiper, Grinnell College
    • A Data Visualization Course for Undergraduate Data Science Students — Silas Bergen, Winona State University
    • Intro Stats for Future Data Scientists — Brianna Heggeseth, Williams College ; Richard De Veaux, Williams College
    • An Undergraduate Data Science Program — James Albert, Bowling Green State University ; Maria Rizzo, Bowling Green State University
    • Modernizing an Undergraduate Multivariate Statistics Class — David Hitchcock, University of South Carolina ; Xiaoyan Lin, University of South Carolina ; Brian Habing, University of South Carolina
    • Business Analytics and Implications for Applied Statistics Education — Samuel Woolford, Bentley University
    • DataSurfing on the World Wide Web: Part 2 — Robin Lock, St. Lawrence University

10:30AM – 12:20pm

  • Showcasing Statistics and Public Policy
    • The Twentieth-Century Reversal: How Did the Republican States Switch to the Democrats and Vice Versa? — Andrew Gelman, Columbia University
    • A Commentary on Statistical Assessment of Violence Recidivism Risk — Peter B. Imrey, Cleveland Clinic ; Philip Dawid, University of Cambridge
    • Using Student Test Scores for Teacher Evaluations: The Pros and Cons of Student Growth Percentiles — J.R. Lockwood, Educational Testing Service ; Katherine E. Castellano, Educational Testing Service ; Daniel F. McCaffrey, Educational Testing Service
    • Discussant: David Banks, Duke University

If I could be in two places, I’d also love to see:

That’s it folks! It’s an ambitious itinerary, let’s hope I get through it all.

I probably won’t get a chance to write daily digests like I’ve tried to do in previous years at JSM, but I’ll tweet about interesting things I hear from @minebocek. I’m sure there will be lots of JSM chatter at #JSM2016 as well.

Now, somebody give me something else to look forward to, and tell me Chicago is cooler than Durham!

Statistics with R on Coursera

18332552I held off on posting about this until we had all the courses ready, and we still have a bit more work to do on the last component, but I’m proud to announce that the specialization called Statistics with R is now on Coursera!

Some of you might know that I’ve had a course on Coursera for a while now (whatever “a while” means on MOOC-land), but it was time to refresh things a bit to align the course with other Coursera offerings — shorter, modular, etc. So I chopped up the old course into bite size chunks and made some enhancements in each component such as

  • integrating dplyr and ggplot2 syntax into the R labs,
  • restructuring the labs to be completed in R Markdown to provide better scaffolding for a data analysis project for each course,
  • adding Shiny apps to some of the labs to better demonstrate statistical concepts without burdening the learners with coding beyond the level of the course,
  • creating an R package that contains all the data, custom functions, etc. used in the course, and
  • cleaning things up a bit to make the weekly workload consistent across weeks.

The underlying code for the labs and the package can be found at Here you can also find the R code for reproducing some of the figures and analyses shown on the course slides (and we’ll keep adding to that repo in the next few weeks).

The biggest change between the old course and the new specialization though is a completely new course: Bayesian Statistics. I touched on Bayesian inference a bit in my old course, and this generated lots of discussion on the course forums from learners wanting more on this content. Being at Duke, I figured who better to offer this course but us! (If you know anything about the Statistical Science department at Duke, you probably know it’s pretty Bayesian.) Note, I didn’t say “me”, I said “us”. I was able to convince a few colleagues (David Banks, Merlise Clyde, and Colin Rundel) to join me in developing this course, and I’m glad I did! Figuring out exactly how to teach this content in an effective way without assuming too much mathematical background took lots of thinking (and re-thinking, and re-thinking). We have also managed to feature a few interviews with researchers in academia and industry, such as Jim Berger (Duke), David Dunson (Duke), Amy Herring (UNC), and Steve Scott (Google) to provide a bit more context for learners on where and why Bayesian statistics is relevant. This course launched today, and I’m looking forward to seeing the feedback from the learners.

If you’re interested in the specialization, you can find out more about it here. The courses in the specialization are:

  1. Introduction to Probability and Data
  2. Inferential Statistics
  3. Linear Regression and Modeling
  4. Bayesian Statistics
  5. Statistics Capstone Project

You can take the courses individually or sign up for the whole specialization, but to do the capstone you need to have completed the 4 courses in the specialization. The landing page for the specialization outlines in further detail how to navigate everything, and relevant dates and deadlines.

Also note that while the graded components of the course which will allow you to pursue a certificate require payment, one can audit the courses for free and watch videos, complete practices quizzes, and work on the labs.

Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).


PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!