Blog Guilt and a Categorical Data Course

I read a blog post entitled On Not Writing and it felt a little close to home. The author, an academic who is in a non-tenure position, writes,

If you have the luxury to have time to write, do you write scholarship with the hope of forwarding an academic career, or do you write something you might find more fun, and hope to publish it another way?*

The footnote read, “Of course, all of this writing presupposes that the stacks of papers get graded.” Ouch. Too close to home. I sent this on to some of my non-tenure track peers and Rob responded that I had tapped into his blog guilt. My blog guilt had been at an all time high already, and so I vowed that I would immediately post something to Citizen Statistician. Well, that was several weeks ago, but I am finally posting.

Fall semester I taught a PhD seminar on categorical data analysis that I had proposed the previous spring. As with many first-time offerings, the amount of work was staggering and intellectually wonderful. The course notes, assignments, etc. are all available at the course website (which also doubled as the syllabus).

The course, like so many advanced seminars, had very few students actually take the course for a grade, but had quite a few auditors. The course projects were a blast to read and resulted in at least two pre-dissertation papers, a written prelim paper, and so far, two articles that have been submitted to journals!

After some reflection, there are some things I will do differently when I teach this again (likely an every-other-year offering):

  • I would like to spend more time on the classification methods. Although we talked about them a little, the beginning modeling took waaaay more time than I anticipated and I need to re-think that a bit.
  • I would like to cover mixed-effects models for binary outcomes in the future. This wasn’t possible this semester since we only had a regression course as the pre-requisite. Now, there is a new pre-requisite which includes linear mixed-effects models with continuous outcomes, so at least students will have been exposed to those types of models. This course also includes a much more in-depth introduction to likelihood, so that should also open up some time.
  • I will not teach the ordinal models in the future. Yuck. Disaster.
  • I probably won’t use the Agresti book in the future. While it is quite technical and comprehensive, it is expensive and the students did not like it for the course. I don’t know what I will use instead. Agresti will remain on a resources list.
  • The propensity score methods (PSM) were a hit with the students and those will be included again. I will also probably put together an assignment based on those.
  • I would like to add in survival analysis.

There are a ton of other topics that could be cool, but with limited time they probably aren’t feasible. I think in general my thought was to spend the first half of the course on introducing and using the logistic and multinomial models and the second half of the course on advanced applications (PSM, classification, etc.)

If anyone has any great ideas or suggestions, please leave comments. Also, I am always on the lookout for some datasets that were used in journal articles or are particularly relevant.

 

 

Data Analysis and Statistical Inference starts tomorrow on Coursera

It has been (and still is) lots of work putting this course together, but I’m incredibly excited about the opportunity to teach (and learn from) the masses! Course starts tomorrow (Feb 17, 2014) at noon EST.

coursera_dasi

A huge thanks also goes out to my student collaborators who helped develop, review, and revise much of the course materials (and who will be taking the role of Community TAs on the course discussion forums) and to Duke’s Center for Instructional Technology who pretty much runs the show.

This course is also part of the Reasoning, Data Analysis and Writing Specialization, along with Think Again: How to Reason and Argue and English Composition 1: Achieving Expertise. This interdisciplinary specialization is designed to strengthen students’ ability to engage with others’ ideas and communicate productively with them by analyzing their arguments, identifying the inferences they are drawing, and understanding the reasons that inform their beliefs. After taking all three courses, students complete an in-depth capstone project where they choose a controversial topic and write an article-length essay in which they use their analysis of the data to argue for their own position about that topic.

Let’s get this party started!

Warning: Mac OS 10.9 Mavericks and R Don’t Play Nicely

For some reason I was compelled to update my Mac’s OS and R on the same day. (I know…) It didn’t go well on several accounts and I mostly blame Apple. Here are the details.

  • I updated R to version 3.0.2 “Frisbee Sailing”
  • I updated my OS to 10.9 “Mavericks”

When I went to use R things were going fine until I mistyped a command. Rather than giving some sort of syntax error, R responded with,

> *** caught segfault *** 
> address 0x7c0, cause 'memory not mapped' 
> 
> Possible actions: 
> 1: abort (with core dump, if enabled) 
> 2: normal R exit 
> 3: exit R without saving workspace 
> 4: exit R saving workspace 
> Selection:

Unlike most of my experiences with computing, this I was able to replicate many times. After a day of panic and no luck on Google, I was finally able to find a post on one of the Google Groups from Simon Urbanek responding to someone with a similar problem. He points out that there are a couple of solutions, one of which is to wait until Apple gets things stabilized. (This is an issue since if you have ever tried to go back to a previous OS on a Mac, you will know that this might take several days of pain and swearing.)

The second solution he suggests is to install the nightly build or rebuild the GUI. To install the nightly build visit the R  for Mac OS X Developer’s page. Or, in Terminal issue the following commands,

svn co https://svn.r-project.org/R-packages/trunk/Mac-GUI 
cd Mac-GUI 
xcodebuild -configuration Debug 
open build/Debug/R.app

I tried both and this worked fine…until I needed to load a package. Then I was given an error that the package couldn’t be found. Now I realize that you can download the packages you need from source and compile them yourself, but I was trying to figure out how to deal with students who were in a similar situation. (This is not an option for most social science students.)

The best solution it turned out is to use RStudio, which my students pretty much all use anyway. (My problem is that I am a Sublime Text 2 user.) This allowed the newest version of R to run on the new Mac OS. But, as is pointed out on the RStudio blog,

As a result of a problem between Mavericks and the user interface toolkit underlying RStudio (Qt) the RStudio IDE is very slow in painting and user interactions  when running under Mavericks.

I re-downloaded the latest stable release of the R GUI about an hour ago, and so far it seems to be working fine with Mavericks (no abort message yet), so this whole post may be moot.

Crime data and bad graphics

I’m working on the 2nd edition of our textbook, Gould & Ryan, and was looking for some examples of bad statistical graphics.  Last time, I used FBI data and created a good and bad graphic from the data. This time, I was pleased to see that the FBI provided its own bad graphic.fbi crime bad graph

This shows a dramatic decrease in crime over the last 5 years.  (Not sure why 2012 data aren’t yet available.) Of course, this graph is only a bad graph if the purpose is to show the rate of decrease.  If you look at it simply as a table of numbers, it is not so bad.

Here’s the graph on the appropriate scale.

fbi crimes improved

Still, a decrease worth bragging about.  But, alas, somewhat less dramatic.

Statistics, the government shutdown, and causality.

There’s a  statistical meme that is making its way into pundits’ discussions (as we might politely call them) that is of interest to statistics educators.  There are several variations, but the basic theme is this:  because of the government shutdown, people are unable to benefit from the new drugs they receive by participating in clinical trials.  The L.A Times went so far as to publish an editorial from a gentleman who claimed that he was cured by his participation in a clinical trial.

Now if they had said that future patients are prevented from benefiting from what is learned from a clinical trial, then they’d nail it.  Instead, they seem to be overlooking the fact that some patients will be randomized to the control group, and probably get the same treatment as if there were no trial at all.  And in many trials (a majority?), the result will be that the experimental treatment had little or no effect beyond the traditional treatment.  And in a very small number of cases, the experimental effect will be found to have serious side effects.  And so the pundits should really be telling us that the government shutdown prevents patients from a small probability of a benefitting from experimental treatment.

All snarkiness aside, I think the prevalence of this meme points to the subtleties of interpreting probabilistic experiments, in which outcomes contain much variability, and so conclusions must be stated in terms of group characteristics.  This came out in the SRTL discussion in Minnesota this summer, when Maxinne Pfannkuch, Pip Arnold, and Stephanie Budgett at the University of Auckland  presented their work leading towards a framework for describing students’ understanding of causality.  I don’t remember very well the example they used, but it was similar to this (and was a real-life study):   patients were randomized to receive either fish oil or vegetable oil in their diet.  The goal of the study was to determine if fish oil lowered cholesterol.  At the end of the study, the fish oil group had a slightly lower average cholesterol levels.  A typical interpretation was, “If I take fish oil, my cholesterol will go down.”

One problem with this interpretation is that it ignored the within-group variation.  Some of patients in the fish oil group saw their cholesterol go up; some saw little or no change.  The study’s conclusion is about group means, not about individuals.  (There were other problems, too.  This interpretation ignores the existence of the control group: we don’t really know if fish oil improves cholesterol compared to your current diet; we know only that it tends to go down in comparison to a vegetable-oil diet.  Also, we know the effects only for those who participated in the study. We assume they were not special people, but possibly the results won’t hold for other groups.)

Understanding causality in probabilistic settings (or any setting) is a challenge for young students and even adults.  I’m very excited to see such a distinguished group of researchers begin  to help us understand.  Judea Pearl, at UCLA, has done much to encourage statisticians to think about the importance of teaching causal inference.  Recently, he helped the American Statistical Association establish the Causality in Statistics Education prize, won this year by Felix Elwert, a sociologist at the University of Wisconsin-Madison.  We still have a ways to go before we understand how to best teach this topic at the undergraduate level and even further before we understand how to teach it at earlier levels.  But, as the government shut down has shown, understanding probabilistic causality is an important component of statistical literacy.

Paint and Patch

IMG_0591

The other day I was painting the trim on our house and it got me reminiscing. The year was 2005. The conference was JSM. The location was Minneapolis. I had just finished my third year of graduate school and was slotted to present in a Topic Contributed session at my first JSM. The topic was Implementing the GAISE Guidelines in College Statistics Courses. My presentation was entitled, Using GAISE to Create a Better Introductory Statistics Course.

We had just finished doing a complete course revision for our undergraduate course based on the work we had been doing with our NSF-funded Adapting and Implementing Innovative Material in Statistics (AIMS) project. We had rewritten the entire curriculum, including all of our assessments and course activities.

The discussant for the session was Robin Lock. In his remarks about the presentations, Lock compared the re-structuring of a statistics course to the remodeling of a house. He described how some teachers restructure their courses according to a plan doing a complete teardown and rebuild. He brought the entire room to laughter as he described most teachers’ attempts, however, as “paint and patch,” fixing a few things that didn’t work quite so well, but mostly just sprucing things up.

The metaphor works. I have been thinking about this for the last eight years. Sometimes paint-and-patch is exactly what is needed. It is pretty easy and not very time consuming. On the other hand, if the structure underneath is rotten, no amount of paint-and-patch is going to work. There are times when it is better to tear down and rebuild.

As another academic year approaches, many of us are considering the changes to be made in courses we will soon be teaching. Is it time for a rebuild? Or will just a little touch-up do the trick?

A Course in Data and Computing Fundamentals

e6dda08f-58ba-40fe-8748-14c589d3ebe7

Daniel Kaplan and Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which was offered this semester at Macalester College. This course is part of a larger research and teaching effort funded by Howard Hughes Medical Institute (HHMI) to help students understand the fundamentals and structures of data, especially big data.  [Read more about the project in Macalester Magazine.]

The course introduces students to R and covers topics such as merging data sources, data formatting and cleaning, clustering and text mining. Within the course, the more specific goals are:

  • Introducing students to the basic ideas of data presentation
    • Graphics modalities
    • Transforming and combining data
    • Summarizing patterns with models
    • Classification and dimension reduction
  • Developing the skills students need to make effective data presentations
    • Access to tabular data
    • Re-organization of tabular data for combining different sources
    • Proficiency with basic techniques for modeling, classification, and dimension reduction.
    • Experience with choices in data presentation
  • Developing the confidence students need to work with modern tools
    • Computer commands
    • Documentation and work-flow

Kaplan and Shoop have put their entire course online using RPubs (the web publishing system hosted by RStudio).

Open Access Textbooks

In an effort to reduce costs for students, the College of Education and Human Development at the University of Minnesota has created this catalog of open textbooks. Open textbooks are complete textbooks released under a Creative Commons, or similar, license. Instructors can customize open textbooks to fit their course needs by remixing, editing, and adding their own content. Students can access free digital versions or purchase low-cost print copies of open textbooks.

The searchable catalog, which includes a few statistics books, can be accessed at https://open.umn.edu.

NCTM Essential Understandings

NCTM has finally published books on statistics in its EU series. This is a rather traditional approach to statistics, given the context of this blog. But, since I’m a co-author (along with Roxy Peck and Stephen Miller), why not point you to it?

http://www.nctm.org/catalog/product.aspx?ID=13804

And while the book is not computational in theme, it does address a central issue of this blog: universal statistical knowledge.

A grades 6-9 version is due out any moment. Stay tuned.

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.