DataFest 2013

DataFest is growing larger and larger.  This year, we hosted an event at Duke (Mine organized this) with teams from NCSU and UNC, and at UCLA (Rob organized) with teams from Pomona College, Cal State Long Beach, University of Southern California, and UC Riverside.  We are very grateful to Vaclav Petricek at eHarmony for providing us with the data, which consisted of roughly one million “user-candidate” pairs, and a couple of hundred variables including “words friends would use to describe you”, ideal characteristics in a partner, the importance of those characteristics, and the all-important ‘did she email him’ and ‘did he email her’ variables.

The students had a great time, and worked hard for 48 hours to prepare short presentations for the judges.  This is the third year we’ve done this, and I’m growing impressed with the growing technical skills of the students.  (Which makes our life a lot easier, as far as providing help goes.)  Or maybe it’s just that I’ve been lucky enough to get more and more “VIP Consultants” (statisticians from off-campus) and talented and dedicated grad students to help out, so that I  can be comfortably oblivious to the technical struggles.  Or all of the above.

One thing I noticed that will definitely require some adjustment to our curriculum:  Our students had a hard time generating interesting questions from these data.  Part of the challenge is to look at a large, rich dataset and think “What can I show the world that the world would like to know?”  Too many students went directly to model-fitting, without making visuals or engaging in the content of the materials (a surprise, since we thought they would find this material much more easily-engageable than last year’s micro-lending transaction data), or strategizing around some Big Questions.  They managed to pull it off in the end, most of them, but would have done better to brainstorm some good questions to follow, and would have done much better to start with the visuals.

One of the fun parts of DataFest is the presentations.  Students have only 5 minutes and 2 slides to convince the judges of their worthiness.  At UCLA, because we were concerned about having too many teams for the judges to endure, we had two rounds.  First, a “speed dating” round in which participants had only 60 seconds and one slide.  We surprised them by announcing, at the start, that to move onto the next round, they would have to merge their team with one other team, and so these 60-second presentations should be viewed as pitches to potential partners.  We had hoped that teams would match on similar-themes or something, and this did happen; but many matches were between teams of friends.  The “super teams” were then allowed to make a 5-minute presentation, and awards were given to these large teams. The judges gave two awards for Best Insight (one to a super-team from Pomona College and another to a super-team from UCLA) and a Best Visualization (to the super-team from USC).  We did have two inter-collegiate super-teams (UCLA/Cal State Long Beach and UCLA/UCR) make it to the final round.

If you want to host your own DataFest, drop a line to Mine or me and we can give you lots of advice.  And if you sit on a large, interesting data set we can use for next year, definitely drop us a line!

participatory sensing

The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data.  What is participatory sensing, you ask?

I’ve recently been trying to answer this question, with mixed success.  As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it.  I like to think of it as observations on a living object.  Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize,  PS meant  specially equipped smart-phones to serve as sensors.  Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food.  A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment.  And there’s an equity issue, because what some people consider a common household item, others consider rare and precious.  And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.

If we ditch the gadgetry, any human being can serve as a sensor.  Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be  a sensor.  If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”.  She’ll turn to data next time she has a question or problem, too.

Nothing in this process is trivial.  Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data.  Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success.  And that’s the easy part.  How do we help them learn something of value about themselves or their world?

I’m open to suggestions here. Please.  One step seems to be to point them towards a larger context in which to make sense of their data.  This larger context could be a social network, or a community, or larger datasets collected on large populations.  And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream).  Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.

This is scary stuff.  Traditionally, we teach students how to make sense of *our* data.  And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right”  conclusions.  But these PS data have not before been analyzed.  Even if we the teacher may have seen similar data, we have not seen these data.  The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion.  What’s more disorienting, the teacher doesn’t have control of the method.  Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see.  But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)

So I’ll sign off with some questions.  Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data.  Students do not know any statistics–this is their first data experience.  What is the first thing we should show them?  A distribution? Of what? What concepts do students bring to the table that will help them make sense  of longitudinal data?  If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)?  And what’s the lesson, or should be the lesson, in examining such plots?

Thursday Next

From Jasper Fforde’s latest Thursday Next novel (The Woman Who Died Alot):

The Office for Ultimate Risk is one of the many departments within the Ministry of National Statistics. Although it was originally an “experimental” department, the statisticians at Ultimate Risk proved their worth by predicting the entire results of three football World Cups in succession, a finding that led to the discontinuation of football as a game and the results being calculated instead.

Wall Street Journal

Carl Bialik of the Wall Street Journal has a nice article about the growth of statistics.  The print version differs substantially from the online version in content, though not in message.

Missing from this message is the urgent need to have more teachers, at all levels, trained in statistics.  I’m currently at a meeting of the joint committee on education of the American Statistical Association and the National Council of Teachers of Mathematics.  A recurring theme is that K-12 stats education has arrived, but professional development lags far behind.  Agreed that our future needs statisticians.  But we also need teachers who know statistics, 3rd graders who know statistics, unskilled workers who know statistics, and, well, everyone needs some statistics.  And by “statistics”, we do not mean the ability to plug numbers into memorized formulas.  Instead, we mean “reasoning with data.”

Waiting for Tempo

I got pretty excited about a new calendar app, in part because I love these productivity tools and in part because I really hate the calendar that comes with the iPhone.  Tempo, as it is called, seemed nifty because it integrates data on your phone into the calendar so, for instance, you can get directions to your next meeting easily, alert people that you’re going to be late, have documents related to your appointment automatically opened, and other features that will either save lots of time and hassles or themselves become time-sinks and hassles.

But I need not have worried.  When I downloaded the app, I was informed that there is a bit of a queue for activating accounts.  I was currently 32,310th in line, with 2 people behind me.  So how much longer would I wait?

I logged on from time to time to assess my place in line.  The data are here: tempowaitlist.

I’m in Alexandria, VA, in town for a meeting of an ASA/NCTM education committee.  Although its late here, still feels early to us PST-ers, and so I thought I’d take a look at the data.  I’m almost to the point where half of the line is ahead of me, although mostly this is due to so many people lining up behind me.   (17,505 have queued up behind me—surely a sign that I’ve joined the right app?)  This graph shows my progress so far, in terms of how many are ahead of me.  You can, if boredom strikes, estimate how many more seconds before I arrive at the  front of the line. I’m somewhat encouraged by a hint of a sudden acceleration yesterday—but my excitement is no doubt reading too much into noise.

When I get there, I’ll let you know if it was worth the wait.

Number of People Ahead of me in line, as of Feb 28.

Number of People Ahead of me in line, as of Feb 28.

NCTM Essential Understandings

NCTM has finally published books on statistics in its EU series. This is a rather traditional approach to statistics, given the context of this blog. But, since I’m a co-author (along with Roxy Peck and Stephen Miller), why not point you to it?

http://www.nctm.org/catalog/product.aspx?ID=13804

And while the book is not computational in theme, it does address a central issue of this blog: universal statistical knowledge.

A grades 6-9 version is due out any moment. Stay tuned.

Extreme Fitbit

I have been mourning the loss this week of my FitBit.  No idea where it went.  That’s the problem with small, portable data collection devices.  The very feature that makes them useable makes them lose-able.  Then I came across this possible solution

http://www.engadget.com/2013/01/21/australian-firefighters-test-data-transmitting-pills/

which raises entirely new questions about edible lines of data collection devices.

 

Your Flowing Data Defended

I had the privilege last week of listening to the dissertation defense of UCLA Stat’s newest PhD: Nathan Yau.  Congratulations, Nathan!

Nathan runs the very popular and fantastic blog Flowing Data, and his dissertation is about, in part, the creation of his app Your Flowing Data.  Essentially, this is a tool for collecting and analyzing personal data–data about you and your life.

One aspect of the thesis I really liked is a description of types of insight he found from a paper by Pousman, Stasko and Mateas (2007): Casual information visualization: Depictions of Data in every day life. (IEEE Transactions on Visualization and Computer Graphics, 13(6): 1145-1152.)  Nathan quotes four types of insights:

  • Analytic Insight.  Nathan describes these as ‘traditional’ statistical insights obtained from statistical models.
  • Awareness insight. “…remaining aware of data streams such as the weather, news…” People are simply aware that these everyday streams exist and so know to seek them for information when needed
  • Social Insight. Involvement in social networks help people define a place for themselves in relation to particular social contexts.
  • Reflective Insight.  Viewers take a step back from data and can reflect on something they were perhaps unaware of, or have an emotional reaction.

With respect to my Walk to Venice Beach, I think it would be interesting to see how experiences such as that can be leveraged into insights in these categories.  Although these insights are not hierarchical, it would also be interesting to see how these fit into understandings of statistical thinking and reasoning.  For example, some stats ed researchers are grappling with the role of ‘informal’ vs. ‘formal’ statistical inference, and I see the last three insights as supporting informal inference (when inference is called for at all.)

Nathan has lots to say about the role that developers can play in assisting people in gaining insight from data.  Our job, I believe, is to think carefully about the role that educators can play in strengthening these insights.  We spend too much time on the first insight, I think, and not enough time on the others.  But the others are what students will remember and use from their stats class.

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?