Thinking with technology

Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational  psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for  presenting findings, and this means there’s much lively, constructive discussion about works in progress.

I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One  recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2.  Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.

What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.

Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.)  This technology imposes a sort of discipline on students’ thinking. Using the  technology is easy enough  that they can be creative, but the technology is rigid enough that their mistakes are made apparent.  This means that mistakes are cheap, and attempts to repair mistakes are easily made.  And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.

I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker.  But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas.  But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise.  It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability.  More impressive still, the students were able to use these models to solve problems.  In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.

Tinkerplots is aimed at a younger audience than the one I teach.  But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

Planting seeds of reproducibility with knitr and markdown

I attended useR! 2012 this past summer and one of the highlights of the conference was a presentation by Yihui Xie and JJ Allaire on knitr. As an often frustrated user of Sweave, I was very impressed with how they streamlined the process of integrating R with LaTeX and other document types, and I was excited to take advantage of the tools. It also occurred to me that these tools, especially the simpler markdown language, could be useful to the students in my introductory statistics course.

For context, I teach a large introductory statistics class with mostly first and second year social science majors. The course has a weekly lab component where students start using R from day one. The emphasis is on concepts and interpretation, as a way of reinforcing lecture material, but students are also expected to be comfortable enough with R to analyze novel data.

So why should students use knitr? Almost all students in this class have no programming experience, and are unfamiliar with command line interfaces. As such, they tend towards  a trial-and-error approach where they repeat the same command over and over again with minor modifications until it yields a reasonable result. This exploratory phase is important, as it helps them become comfortable with R. However it comes with frustraring side effects, such as cluttering the console and workspace, and hence leading to errors that are difficult to debug (e.g., inadvertently overwriting needed variables) and making it difficult for the students to reproduce their own results.

As of this semester, my students are using knitr and markdown to complete their lab reports. In an effort to make the transition from standard word processors as painless as possible, we provide templates containing precise formatting that informs the students on where to embed code vs. written responses. Throughout the semester, the amount of instructions are decreased as the students become more comfortable with the language and the overall formatting of the lab write-ups.

This is still work in progress, but after five labs my impressions are very positive. Students are impressed that their plots show up “magically” in the reports, and enjoy being able to complete their analysis and the write up entirely in RStudio. This eliminates the need to copy and paste plots and outputs into a word processor, and makes revisions far less error prone. Another benefit is that this approach forces them to keep their analysis organized, which helps keep the frustration level down.

And the cherry on top – lab reports created using markdown are much easier for myself and the teaching assistants to grade, since code, output, and write-up are automatically organized in a logical order and all reports follow the same structure.

There were, of course, some initial issues:

  • Not immediately realizing that it is essential to embed the code inside chunks identified by “`{r} and “` in order for it to be processed.
  • Running the code in the console first and then copying and pasting into the markdown document results in stray > and + signs, which results in cryptic errors.
  • The resulting document can be quite lengthy with all of the code, R output, plots, and written responses, making it less likely for students thoroughly review and catch errors.
  • Certain mistakes in R code (such as an extraneous space in a variable name) prevent the document from compiling (other errors will result in a compiled document with the error output). This is perhaps the most frustrating problem since it makes it difficult for the students to identify the source of the error.

With guidance from peers, teaching assistants, and myself, the students quickly develop the skills necessary to troubleshoot these issues, and after 5 weeks, such errors have all but vanished.

It’s not all sunshine and lollipops though, there are some missing features that would make knitr / RStudio more user friendly in this context:

  • Write to PDF: The markdown document creates an HTML file when compiled, which is useful for creating webpages, but a PDF output would be much more useful for students turning in printed reports. (Suggestion: Pandoc, ref: stackoverflow.)
  • Smart page breaks: Since the resulting HTML document is not meant to be printed on letter sized pages, plots and R output can be split between two pages when printed, which is quite messy.
  • Word count:  A word count feature that only counts words in the content, and not in the code, would be immensely useful for setting and sticking to length requirements. (Suggestion: Marked, ref: this post.)

Tools for resolving some of these issues are out there, but none of them can currently be easily integrated into the RStudio workflow.

All of the above amount to mostly logistic practicalities for using knitr in an introductory course, but there is also a larger pedagogical argument for it: introducing reproducible research in a practical and painless way. Reproducible research is not something that many first or second year undergraduate students are aware of — after all, very few of them are actually engaged in research activities at that point in their academic careers. At best, students are usually aware that reproducibility is one of the central tenants of the scientific method, but have given very little thought to what that involves either as a researcher producing work that others will want to replicate, or as someone attempting to reproduce another author’s work. In the context of completing simple lab assignments and projects with knitr, students experience first hand the benefits and the frustrations of reproducible research, which is hopefully a lesson they’ll take away from the class, regardless of how much R or statistics they remember.

PS: If you’re interested in the nuts and bolts, you can review the labs as well as knitr templates here.