Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Personal Data Apps

Fitbit, you know I love you and you’ll always have a special place in my pocket.  But now I have to make room for the Moves app to play a special role in my capture-the-moment-with-data existence.

Moves is an ios7 app that is free.  It eats up some extra battery power and in exchange records your location and merges this with various databases and syncs it up to other databases and produces some very nice “story lines” that remind you about the day you had and, as a bonus, can motivate you to improved your activity levels.  I’ve attached two example storylines that do not make it too embarrassingly clear how little exercise I have been getting. (I have what I can consider legitimate excuses, and once I get the dataset downloaded, maybe I’ll add them as covariates.)  One of the timelines is from a day that included an evening trip to Disneyland. The other is a Saturday spent running errands and capped with dinner at a friend’s.  Its pretty easy to tell which day is which.

movings1movings2

But there’s more.  Moves has an API, thus allowing developers to tap into their datastream to create apps.  There’s an app that exports the data for you (although I haven’t really had success with it yet) and several that create journals based on your Moves data.  You can also merge Foursquare, Twitter, and all the usual suspects.

I think it might be fun to have students discuss how one could go from the data Moves collects to creating the storylines it makes.  For instance, how does it know I’m in a car, and not just a very fast runner?  Actually, given LA traffic, a better question is how it knows I’m stuck in traffic and not just strolling down the freeway at a leisurely pace? (Answering these questions requires another type of inference than what we normally teach in statistics. )  Besides journals, what apps might they create with these data and what additional data would they need?

participatory sensing

The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data.  What is participatory sensing, you ask?

I’ve recently been trying to answer this question, with mixed success.  As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it.  I like to think of it as observations on a living object.  Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize,  PS meant  specially equipped smart-phones to serve as sensors.  Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food.  A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment.  And there’s an equity issue, because what some people consider a common household item, others consider rare and precious.  And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.

If we ditch the gadgetry, any human being can serve as a sensor.  Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be  a sensor.  If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”.  She’ll turn to data next time she has a question or problem, too.

Nothing in this process is trivial.  Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data.  Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success.  And that’s the easy part.  How do we help them learn something of value about themselves or their world?

I’m open to suggestions here. Please.  One step seems to be to point them towards a larger context in which to make sense of their data.  This larger context could be a social network, or a community, or larger datasets collected on large populations.  And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream).  Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.

This is scary stuff.  Traditionally, we teach students how to make sense of *our* data.  And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right”  conclusions.  But these PS data have not before been analyzed.  Even if we the teacher may have seen similar data, we have not seen these data.  The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion.  What’s more disorienting, the teacher doesn’t have control of the method.  Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see.  But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)

So I’ll sign off with some questions.  Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data.  Students do not know any statistics–this is their first data experience.  What is the first thing we should show them?  A distribution? Of what? What concepts do students bring to the table that will help them make sense  of longitudinal data?  If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)?  And what’s the lesson, or should be the lesson, in examining such plots?

Your Flowing Data Defended

I had the privilege last week of listening to the dissertation defense of UCLA Stat’s newest PhD: Nathan Yau.  Congratulations, Nathan!

Nathan runs the very popular and fantastic blog Flowing Data, and his dissertation is about, in part, the creation of his app Your Flowing Data.  Essentially, this is a tool for collecting and analyzing personal data–data about you and your life.

One aspect of the thesis I really liked is a description of types of insight he found from a paper by Pousman, Stasko and Mateas (2007): Casual information visualization: Depictions of Data in every day life. (IEEE Transactions on Visualization and Computer Graphics, 13(6): 1145-1152.)  Nathan quotes four types of insights:

  • Analytic Insight.  Nathan describes these as ‘traditional’ statistical insights obtained from statistical models.
  • Awareness insight. “…remaining aware of data streams such as the weather, news…” People are simply aware that these everyday streams exist and so know to seek them for information when needed
  • Social Insight. Involvement in social networks help people define a place for themselves in relation to particular social contexts.
  • Reflective Insight.  Viewers take a step back from data and can reflect on something they were perhaps unaware of, or have an emotional reaction.

With respect to my Walk to Venice Beach, I think it would be interesting to see how experiences such as that can be leveraged into insights in these categories.  Although these insights are not hierarchical, it would also be interesting to see how these fit into understandings of statistical thinking and reasoning.  For example, some stats ed researchers are grappling with the role of ‘informal’ vs. ‘formal’ statistical inference, and I see the last three insights as supporting informal inference (when inference is called for at all.)

Nathan has lots to say about the role that developers can play in assisting people in gaining insight from data.  Our job, I believe, is to think carefully about the role that educators can play in strengthening these insights.  We spend too much time on the first insight, I think, and not enough time on the others.  But the others are what students will remember and use from their stats class.

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.

mural.

Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot (http://www.friendlyforecast.com/usa/archive) shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.

Books.

Books.

What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:

IMG_1369

Closer

Closer

waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.

My Year of Reading in Review

Two years ago, I made a New Year’s Resolution to read more books. At that point I joined GoodReads to hold myself accountable. I read 47 books that year (at least that I recorded). In 2012, I didn’t re-make that resolution, and my reading productivity dropped to 29 (really 26 since I quit reading 3 books). While the number of books is lower, I did some minor analyses on these books based on data I scraped from GoodReads and Amazon.

One summary I created was to examine the number of books I read per month. I also wanted to account for the fact that some books are a lot shorter than others, so in addition I looked at the average number of pages I read per month as well.

Number of books and average pages read per month in 2012.

Number of books and average pages read per month in 2012.

It is clear that May and December are prolific reading months for me. My interpretation is that these are the months that semesters end, and very often I retreat into the pages of a book or two to escape for a bit.

How do I rate these books I read? Are Amazon and GoodReads raters giving the books I read the same rating?

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

I gave mostly 3/5 and 4/5 stars to the books I read. It is clear from the plot that there is an overall positive relationship: books that are rated highly by GoodReads raters are, on average, the same books being rated highly by Amazon raters–and vice-versa.

What book did I give a rating of 5/5 to? The synopsis of my reading year, including the answer to this question, is available here [2012-Annual-Reading-Synopsis].

I have also made the spreadsheet data (from both 2011 and 2012) available publicly on GoogleDocs [data].

Here is the R code to access that data.

library(RCurl)
myCsv <- getURL("https://docs.google.com/spreadsheet/pub?key=0AvanLJO1M39wdENZajR0RHJMSmZTWWtLNzhHMi1ySUE&single=true&gid=0&output=csv")
books <- read.csv(textConnection(myCsv))

Data Privacy for Kids

The L.A. Times ran an interesting article about the new Federal Trade Commission(downloads) report, “Mobile Apps for Kids: Disclosures Still Not Making the Grade”,  followed up on a February 2012 report, and concluded that “Yes, many apps included interactive features or shared kids’ information with third parties without disclosing these practices to parents.”

I think this is issue is intriguing on many levels, but of central concern is the fact that as we go about our daily business (or play, as the case may be), we leave a data trail, sometimes unwittingly.   Quite often unknowingly. Perhaps we’ve reached the point where there’s no going back, and we must accept the fact that when we engage with the mobile community, we engage in a data-exchange.  But it seems  an easy thing that standards should be set so that, maybe, developers are required to produce logs of the data transaction.  And third-party developers could write apps that let parents examine this exchange. (Without, of course, sharing this information with a third party.)  It would be interesting and fun, I would think, to create a visualization of the data flow in and out of your device across a period of time.

The report indicated that the most commonly shared datum was the device ID number.  Sounds innocent, but, as we all know, its the crosstabs that kill.  The device ID is a unique number, and is associated with other variables, such as the device operating system, primary language, the carrier, and other information.  It can also be used to access personal data, such as the name, address, email address, of the user.  While some apps share only the device ID, and thus may seem safe, other apps send more data to the same companies.  And so companies that receive data from multiple apps can build up a more detailed picture of the user.  And these companies share data with each other, and so can create even richer pictures.

There are some simple ways of introducing the topic into a stats course.  The report essentially conducted a random sample (well, lets say it had some random sampling components) of apps.  And reports estimated percentages. But never, of course, confidence intervals.  And so you can ask your students a question such as “The FTC randomly sampled 400 apps that were marketed to “kids” from the Google and iTunes app store.  60% of the apps in the sample transmitted the Device ID to a third-party.  Find a 95% confidence interval for the proportion of all apps….”  Or, “only 20% of the apps that transmitted private information to a third-party disclosed this fact to the user.  Find a 95% CI….”

The report contains some nice comparisons with the 2011 data concerning the types of “kids” apps available, as well as a discussion of the type of advertising that appears.  (An amusing example that shouldn’t be amusing, is an app targeted at kids that displays advertising for a dating web site: “1000+ singles!”.  Reminds me of something my sister told me when her kids were young:  she could always tell when they found something troubling on the computer because suddenly they would grow very quiet.

Accessing your 11.0 iTunes library

One of the themes of this blog is to make statistics relevant and exciting to students by helping them understand the data that’s right under their noses.   Or inside their ears.  The iTunes library is a great place to start.

For awhile, iTunes made it easy to get your data onto your hard drive in a convenient, analysis-ready form. Then they made it hard.  Then (10.7) they made it easy again. Now, in 11.0, it is once again ‘hard’.  Prior to version 11.0, these instructions would do the trick: Open up iTunes, Control-click on the “Music” library and choose Export.  And a tab-delimited text file with all iTunes library data appears.

Now, iTunes 11.0 provides only an xml library.  This is a shame for us teachers, since the data is now one step further removed from student access. In particular, it’s a shame because the data structure is not terribly complex—a flat file should do the trick. (If want the xml file, select File>Library>Export.)

But all is not lost, with one simple work-around, you can get your data. First, create a smart playlist that has all of your songs.  I did this by including in the list all songs added before today’s date.  Now control-click on the name of the playlist, and choose Export.  Save the file wherever you wish, and you now have a tab-delimited file. (It does take a few minutes, if your library is anything near the size of my own. Not bragging.)

So now we can finally get to the main point of this post.  Which is to point out that almost all of the datasets I give my students, whether they are in intro stats or higher, have a small number of variables.  And even if not, the questions almost all involve using only a small number of variables.

But if students are to become Citizen Statisticians, they must learn to think more like scientists.  They must learn how to pose questions and create new measures.  I wonder what most of our students would do, when confronted with the 27 or so variables iTunes gives them.  Make histograms?  Fine, a good place to start.  But what real-world question are they answering with a histogram? Do students really care about the variability in length of their song tracks?

I suggest that one interesting question to ask students to explore is to see if their listening habits have changed over some period of time.  Now I know younger students won’t have much time to look back across.  But I think this is a meaningful question, and one that’s not answered in an obvious way.  More precisely, to answer this question requires thinking about what it means to have a ‘listening habit’, and to question how such a habit might be captured in the given data.

I’m not sure what my students would think of.  Or, frankly, what I would think of.  At the very least, the answer to any such question will require wrestling with the date variables and so require some work with dates.  Some basic questions might be to see how many songs I’ve added per year. This isn’t that easy in many software packages, because I have to loop over the year and count the number of entries. Another question that I might want to know: What proportion of songs remain unplayed each year? (In other words, am I wasting space storing music I don’t listen to?)  Has the mix of genres changed over time, or are my tastes relatively unchanged?

Speaking of genres…unless you’ve been really careful about your genre-field, you’re in for a mess. I thought I was careful. but here’s what I’ve got (as seen from Fathom):

If you want to see questions asked by some people, download the free software SuperAnalyzer (This links to the Mac version via CNET).  Below is a graph that shows the growth of my library over time, for example. (Thanks to Anelise Sabbag for pointing this out to me during a visit to U of Minnesota last year, and to Elizabeth Fry and Laura Ziegler for their endorsements of the app.)

And the most common words in the titles:

So let me know what you want to do with your iTunes library. Or what your students have done.  What was frustrating? Impossible? Easier than expected?

More on FitBit data

First the good news:

 

Your data belongs to you!

And now the bad: It costs you $50/ year for your data to truly belong to you.  For a ‘premium’ membership, you can visit your data as often as you choose.  If only Andy had posted sooner, I would have saved $50.  But, dear readers, in order to explore all avenues, I spent the bucks.  And here’s some data (screenshot–I don’t want you analyzing *my* data!)

It’s pretty easy and painless.  Next I’ll try Andy’s advice, and see if I can save $50 next year.