A Course in Data and Computing Fundamentals

e6dda08f-58ba-40fe-8748-14c589d3ebe7

Daniel Kaplan and Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which was offered this semester at Macalester College. This course is part of a larger research and teaching effort funded by Howard Hughes Medical Institute (HHMI) to help students understand the fundamentals and structures of data, especially big data.  [Read more about the project in Macalester Magazine.]

The course introduces students to R and covers topics such as merging data sources, data formatting and cleaning, clustering and text mining. Within the course, the more specific goals are:

  • Introducing students to the basic ideas of data presentation
    • Graphics modalities
    • Transforming and combining data
    • Summarizing patterns with models
    • Classification and dimension reduction
  • Developing the skills students need to make effective data presentations
    • Access to tabular data
    • Re-organization of tabular data for combining different sources
    • Proficiency with basic techniques for modeling, classification, and dimension reduction.
    • Experience with choices in data presentation
  • Developing the confidence students need to work with modern tools
    • Computer commands
    • Documentation and work-flow

Kaplan and Shoop have put their entire course online using RPubs (the web publishing system hosted by RStudio).

Open Access Textbooks

In an effort to reduce costs for students, the College of Education and Human Development at the University of Minnesota has created this catalog of open textbooks. Open textbooks are complete textbooks released under a Creative Commons, or similar, license. Instructors can customize open textbooks to fit their course needs by remixing, editing, and adding their own content. Students can access free digital versions or purchase low-cost print copies of open textbooks.

The searchable catalog, which includes a few statistics books, can be accessed at https://open.umn.edu.

NCTM Essential Understandings

NCTM has finally published books on statistics in its EU series. This is a rather traditional approach to statistics, given the context of this blog. But, since I’m a co-author (along with Roxy Peck and Stephen Miller), why not point you to it?

http://www.nctm.org/catalog/product.aspx?ID=13804

And while the book is not computational in theme, it does address a central issue of this blog: universal statistical knowledge.

A grades 6-9 version is due out any moment. Stay tuned.

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.

mural.

Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot (http://www.friendlyforecast.com/usa/archive) shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.

Books.

Books.

What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:

IMG_1369

Closer

Closer

waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.

assessment, research, teaching

A new report released by CAUSE is well worth reading: Connecting Research to Practice in a Culture of Assessment for Introductory College-level Statistics, www.causeweb.org/research/guidelines/ResearchReport_Dec_2012.pdf

Read it.  We’ll discuss later.  Pop quiz.

I haven’t yet read it myself (in my eagerness to publicize it as quickly as possible),  but of particular interest to this blog is the role that data science plays, or does not play.  For instance, Question 1 under Research Priority 1 is “What core learning outcomes  employed in a particular profession do individuals  need to develop in order to perform well in that profession (e.g., the outcomes that are common and those that are unique to disciplines such as psychology, biology, and economics?)”

I recently had a discussion with someone in a data-heavy business, and was struck by how core statistical concepts were seen as just one of many necessary core skills—the rest of the skills requiring computing, psychology, and communication.  It is fashionable in statistical circles to be somewhat dismissive of claims that computation take precedence over statistics, but in at least this case, I think that this paints an unfair portrait.  The data scientist in question held statistics in high esteem, and was well aware of the pitfalls of being lured by transitory patterns, as compelling as they might at first glance seem.  His use of statistics came at the ‘high end’, employing very modern data smoothing techniques, multivariate models, and a need for sophisticated understanding of model evaluation.  But he, like many in his field I suspect, came to statistics in a round-about way, after becoming successful in computer science and then studying statistics to close the gap.  I doubt he considered himself a statistician, but instead one who frequently found statistical tools and concepts to be useful for getting things done.

We’re in a very exciting position, as educators, to dream about how to develop future data scientists who incorporate statistics with computation from the very beginning of their conception of statistics.  But one thing to keep in mind, is that part of the excitement of this new age of statistics is that many of the careers we’re preparing our students for don’t yet exist.  It seems so many of the data challenges that are raised in a general realm of endeavor such as marketing,  the arts, genetics,  law, have solutions that don’t live purely in one field.  And so when we ask ourselves about skills needed in particular professions, let’s do so with our eyes open to the fact that the profession that many of us have in mind —data science—doesn’t really yet exist.

Data science is, or will be, a specialist’s field.  But this blog is devoted to considering the data science skills needed by all students.  I think, therefore, that the issues raised by this report concerning the ‘core’ skills are very important.  A data scientist may have a specialists’ collection of skills, in the aggregate, but many of this skills and understandings, in isolation, will need to be part of our core education.  This report encourages us all to think seriously about precisely which skills and understandings those should be.

Stats in School

Just read a great paper by Anna Bargagliotti in the current Journal of Stats Education, “How well do the NSF Funded Elementary Mathematics Curricula align with the GAISE report recommendations? “.  The answer: it depends.  Anna compares three math curricula designed to meet the Common Core Standards for grades K-12: “Investigations in Number, Data, and Space”, “Math Trailblazers”, and “Everyday Mathematics.”  Anna compared them to the Guidelines for Assessment and Instruction in Statistics Education K-12 report, which, to quote her paper, “defines a statistically literate person as one who is able to formulate questions, collect and analyze data, and interpret results.”  I personally feel the “analyze data” component is the most important, since this is a skill all students should acquire, and a skill that requires a strong  understanding of statistical concepts and methods.

The GAISE report identifies three developmental levels, labeled A, B and C.  Since Anna is concerned with earlier grades, she considers only levels A and B.  Level A is “below” level B in some sense, but the levels might overlap, and students might advance to level B on some topics while still studying at the Level A on others.  Levels aren’t associated with particular grades, but, roughly speaking, one might expect Level A to occupy most of a child’s K-6 years, and level B much of middle school and early high school.  For example, in Level A, students investigate situations in which they are not expected to go beyond the sample at hand.  In Level B, they begin to informally consider what the sample at hand has to say about a larger context.  In Level C, they learn formal methods for inference.

Two of the curriulca, “Investigations” and “Trailblazers”, according to Anna’s paper, move students from Level A to B and have strong data analysis components. The third, “Everyday”, favors probability, seems to ignore data analysis, and is so weighted towards computation that it was difficult to determine whether it was teaching at Level A or B.  (Well, that’s my reading of Anna’s findings.  There is room for more nuance there, but that’s one advantage of a blog over an academic paper: we can ignore nuance.)

Now here’s the depressing part: one of these curricula is used by 3 million students.  If you guessed “Everyday Mathematics” go to the head of the class.  Trailblazers is used by a healthy number, too: 500,000.  But that’s only 1/6 the size of Everyday.  So while the good news is that the Common Core provides students with the opportunity to learn some truly useful and needed statistics, the bad news is that most of them continue to be taught probability at the expense of data analysis.

Turning Tables into Graphs

We have just finished another semester, and before my mind completely turns to rubble, I want to share what I believe to be a fairly good assignment. What I present below was parts of two separate assignments that I gave this semester, but upon reflection I think it would be better as one.

—–

Read the article Let’s Practice What We Preach: Turning Tables into Graphs (full reference given below). In this article, Gelman, Pascarica, & Dodhia suggest that presentations of results using graphs are more effective than results presented in tables.

Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach: Turning tables into graphs. The American Statistician, 56(2), 121–130.

Find an article in a journal that presents results (or data) in a table. Re-create the data in a tabular format using R (or Excel).

  1. Use the functions in ggplot2 to produce a plot that conveys the same message as the original table.
  2.  Include the original table (this can be a screenshot or web-link) and citation, along with your plot.
  3. Write a few sentences describing why the plot you produced provides a better presentation of the results or data (be sure to use recommendations from the article in making your case).

In the second part of this assignment, you will write a tutorial for the process you followed for turning a table into a plot using R Markdown and will publish that tutorial on RPubs.

There are several resources for learning R Markdown.

Your tutorial should be written so that a student who was just learning ggplot could follow your directions easily. Include instructions for obtaining the data, getting it into a useable tabular format, manipulating the data so it can be used with ggplot, and well-commented instructions for creating your final plot. (Think of the level of detail you would want in a tutorial when you were first learning ggplot!)

It should also include:

  • a citation or link to the website/journal that published the original table
  • a view of your final data (full or a subset depending on size)
  • all commands necessary to create your final plot (with appropriate explanation), and
  • the final plot

When you knit the .Rmd document it should compile without errors.

—–

Students commented that they learned a lot about the use of ggplot during the initial assignment (this was the second assignment in the course). The Markdown part of the assignment I gave as an extra credit assignment at the end of the class, but in retrospect, I should have made it required and done it very early on.

Here are a couple of the tutorials that I have received so far:

  • These students took a table of characteristics of survey participants published in the Journal of Ethnic and Cultural Diversity in Social Work and turned it into a bar graph.  http://rpubs.com/TSK_2012/3184
  • These students took data about trends and topics discussed in Seventeen Magazine‘s Traumarama articles from 1994-2007 and turned it into a line plot. http://rpubs.com/opalc123/3155
  • These students took a table of data related to approval ratings and turned them into a box-and-whiskers plot. http://www.rpubs.com/GeorgeBrisse/3217
  • These students’ work depict a great example of how data initially presented in a table is much easier to process in a graph. The data, from a table published in the Journal of Deaf Studies and Deaf Education, show the academic status and progress of deaf and hard-of-hearing students in general education classrooms.  http://rpubs.com/mens0055/3211
  • These students used a stacked bar chart to show data about the sample sizes for different stages for 12 problem behaviors published in Health Psychology. http://rpubs.com/nikedenise/3256
  • These students created a line graph representing pre- and post-training scores for consonant, vowel, sentence, and gender perception scores in cochlear implant users to examine whether an auditory training program improves performance. http://rpubs.com/koern030/3255

Accessing your 11.0 iTunes library

One of the themes of this blog is to make statistics relevant and exciting to students by helping them understand the data that’s right under their noses.   Or inside their ears.  The iTunes library is a great place to start.

For awhile, iTunes made it easy to get your data onto your hard drive in a convenient, analysis-ready form. Then they made it hard.  Then (10.7) they made it easy again. Now, in 11.0, it is once again ‘hard’.  Prior to version 11.0, these instructions would do the trick: Open up iTunes, Control-click on the “Music” library and choose Export.  And a tab-delimited text file with all iTunes library data appears.

Now, iTunes 11.0 provides only an xml library.  This is a shame for us teachers, since the data is now one step further removed from student access. In particular, it’s a shame because the data structure is not terribly complex—a flat file should do the trick. (If want the xml file, select File>Library>Export.)

But all is not lost, with one simple work-around, you can get your data. First, create a smart playlist that has all of your songs.  I did this by including in the list all songs added before today’s date.  Now control-click on the name of the playlist, and choose Export.  Save the file wherever you wish, and you now have a tab-delimited file. (It does take a few minutes, if your library is anything near the size of my own. Not bragging.)

So now we can finally get to the main point of this post.  Which is to point out that almost all of the datasets I give my students, whether they are in intro stats or higher, have a small number of variables.  And even if not, the questions almost all involve using only a small number of variables.

But if students are to become Citizen Statisticians, they must learn to think more like scientists.  They must learn how to pose questions and create new measures.  I wonder what most of our students would do, when confronted with the 27 or so variables iTunes gives them.  Make histograms?  Fine, a good place to start.  But what real-world question are they answering with a histogram? Do students really care about the variability in length of their song tracks?

I suggest that one interesting question to ask students to explore is to see if their listening habits have changed over some period of time.  Now I know younger students won’t have much time to look back across.  But I think this is a meaningful question, and one that’s not answered in an obvious way.  More precisely, to answer this question requires thinking about what it means to have a ‘listening habit’, and to question how such a habit might be captured in the given data.

I’m not sure what my students would think of.  Or, frankly, what I would think of.  At the very least, the answer to any such question will require wrestling with the date variables and so require some work with dates.  Some basic questions might be to see how many songs I’ve added per year. This isn’t that easy in many software packages, because I have to loop over the year and count the number of entries. Another question that I might want to know: What proportion of songs remain unplayed each year? (In other words, am I wasting space storing music I don’t listen to?)  Has the mix of genres changed over time, or are my tastes relatively unchanged?

Speaking of genres…unless you’ve been really careful about your genre-field, you’re in for a mess. I thought I was careful. but here’s what I’ve got (as seen from Fathom):

If you want to see questions asked by some people, download the free software SuperAnalyzer (This links to the Mac version via CNET).  Below is a graph that shows the growth of my library over time, for example. (Thanks to Anelise Sabbag for pointing this out to me during a visit to U of Minnesota last year, and to Elizabeth Fry and Laura Ziegler for their endorsements of the app.)

And the most common words in the titles:

So let me know what you want to do with your iTunes library. Or what your students have done.  What was frustrating? Impossible? Easier than expected?

Inference for the population by the population — what does that even mean?

In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students  define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.

Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.

The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.

I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).

Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.

On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.

One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.

Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.

Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.

So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?