iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

Planting seeds of reproducibility with knitr and markdown

I attended useR! 2012 this past summer and one of the highlights of the conference was a presentation by Yihui Xie and JJ Allaire on knitr. As an often frustrated user of Sweave, I was very impressed with how they streamlined the process of integrating R with LaTeX and other document types, and I was excited to take advantage of the tools. It also occurred to me that these tools, especially the simpler markdown language, could be useful to the students in my introductory statistics course.

For context, I teach a large introductory statistics class with mostly first and second year social science majors. The course has a weekly lab component where students start using R from day one. The emphasis is on concepts and interpretation, as a way of reinforcing lecture material, but students are also expected to be comfortable enough with R to analyze novel data.

So why should students use knitr? Almost all students in this class have no programming experience, and are unfamiliar with command line interfaces. As such, they tend towards  a trial-and-error approach where they repeat the same command over and over again with minor modifications until it yields a reasonable result. This exploratory phase is important, as it helps them become comfortable with R. However it comes with frustraring side effects, such as cluttering the console and workspace, and hence leading to errors that are difficult to debug (e.g., inadvertently overwriting needed variables) and making it difficult for the students to reproduce their own results.

As of this semester, my students are using knitr and markdown to complete their lab reports. In an effort to make the transition from standard word processors as painless as possible, we provide templates containing precise formatting that informs the students on where to embed code vs. written responses. Throughout the semester, the amount of instructions are decreased as the students become more comfortable with the language and the overall formatting of the lab write-ups.

This is still work in progress, but after five labs my impressions are very positive. Students are impressed that their plots show up “magically” in the reports, and enjoy being able to complete their analysis and the write up entirely in RStudio. This eliminates the need to copy and paste plots and outputs into a word processor, and makes revisions far less error prone. Another benefit is that this approach forces them to keep their analysis organized, which helps keep the frustration level down.

And the cherry on top – lab reports created using markdown are much easier for myself and the teaching assistants to grade, since code, output, and write-up are automatically organized in a logical order and all reports follow the same structure.

There were, of course, some initial issues:

  • Not immediately realizing that it is essential to embed the code inside chunks identified by “`{r} and “` in order for it to be processed.
  • Running the code in the console first and then copying and pasting into the markdown document results in stray > and + signs, which results in cryptic errors.
  • The resulting document can be quite lengthy with all of the code, R output, plots, and written responses, making it less likely for students thoroughly review and catch errors.
  • Certain mistakes in R code (such as an extraneous space in a variable name) prevent the document from compiling (other errors will result in a compiled document with the error output). This is perhaps the most frustrating problem since it makes it difficult for the students to identify the source of the error.

With guidance from peers, teaching assistants, and myself, the students quickly develop the skills necessary to troubleshoot these issues, and after 5 weeks, such errors have all but vanished.

It’s not all sunshine and lollipops though, there are some missing features that would make knitr / RStudio more user friendly in this context:

  • Write to PDF: The markdown document creates an HTML file when compiled, which is useful for creating webpages, but a PDF output would be much more useful for students turning in printed reports. (Suggestion: Pandoc, ref: stackoverflow.)
  • Smart page breaks: Since the resulting HTML document is not meant to be printed on letter sized pages, plots and R output can be split between two pages when printed, which is quite messy.
  • Word count:  A word count feature that only counts words in the content, and not in the code, would be immensely useful for setting and sticking to length requirements. (Suggestion: Marked, ref: this post.)

Tools for resolving some of these issues are out there, but none of them can currently be easily integrated into the RStudio workflow.

All of the above amount to mostly logistic practicalities for using knitr in an introductory course, but there is also a larger pedagogical argument for it: introducing reproducible research in a practical and painless way. Reproducible research is not something that many first or second year undergraduate students are aware of — after all, very few of them are actually engaged in research activities at that point in their academic careers. At best, students are usually aware that reproducibility is one of the central tenants of the scientific method, but have given very little thought to what that involves either as a researcher producing work that others will want to replicate, or as someone attempting to reproduce another author’s work. In the context of completing simple lab assignments and projects with knitr, students experience first hand the benefits and the frustrations of reproducible research, which is hopefully a lesson they’ll take away from the class, regardless of how much R or statistics they remember.

PS: If you’re interested in the nuts and bolts, you can review the labs as well as knitr templates here.