participatory sensing

The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data.  What is participatory sensing, you ask?

I’ve recently been trying to answer this question, with mixed success.  As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it.  I like to think of it as observations on a living object.  Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize,  PS meant  specially equipped smart-phones to serve as sensors.  Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food.  A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment.  And there’s an equity issue, because what some people consider a common household item, others consider rare and precious.  And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.

If we ditch the gadgetry, any human being can serve as a sensor.  Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be  a sensor.  If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”.  She’ll turn to data next time she has a question or problem, too.

Nothing in this process is trivial.  Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data.  Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success.  And that’s the easy part.  How do we help them learn something of value about themselves or their world?

I’m open to suggestions here. Please.  One step seems to be to point them towards a larger context in which to make sense of their data.  This larger context could be a social network, or a community, or larger datasets collected on large populations.  And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream).  Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.

This is scary stuff.  Traditionally, we teach students how to make sense of *our* data.  And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right”  conclusions.  But these PS data have not before been analyzed.  Even if we the teacher may have seen similar data, we have not seen these data.  The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion.  What’s more disorienting, the teacher doesn’t have control of the method.  Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see.  But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)

So I’ll sign off with some questions.  Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data.  Students do not know any statistics–this is their first data experience.  What is the first thing we should show them?  A distribution? Of what? What concepts do students bring to the table that will help them make sense  of longitudinal data?  If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)?  And what’s the lesson, or should be the lesson, in examining such plots?

Thursday Next

From Jasper Fforde’s latest Thursday Next novel (The Woman Who Died Alot):

The Office for Ultimate Risk is one of the many departments within the Ministry of National Statistics. Although it was originally an “experimental” department, the statisticians at Ultimate Risk proved their worth by predicting the entire results of three football World Cups in succession, a finding that led to the discontinuation of football as a game and the results being calculated instead.

Prediction Accuracy of the NCAA Bracket: Results

 

In  Emanuel Derman’s book Models. Behaving. Badly, the author lays out a Modeler’s Hippocratic Oath.

  • I will remember that I didn’t make the world, and it doesn’t satisfy my equations.
  • Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
  •  I will never sacrifice reality for elegance without explaining why I have done so.
  • Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
  • I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

Since I have no desire for instilling false comfort with the non-replicable, fuzzy mascot/alphabetical order model that I used to predict the NCAA tournament, I report my results after two days in.

  • Overall the model has correctly predicted 19 of the 32 games correctly. This is not any better than chance (= .108, one-sided).
    • Conditionally, the model performed best in the East Regional (6/8, p = .035, one-sided). It was worst in the West and South Regionals (4/8 in both, p-value not reported due to complete stupidity of the model.). The performance in the Midwest Regional, like so many things Midwest, was so-so (5/8, p = .145, one-sided).
  • The model has not, as yet, “busted” my bracket.
    • I still have 11 of the Sweet Sixteen teams predicted still alive in my bracket.
    • I still have 7 of the Elite Eight teams predicted still alive in my bracket.
    • I still have 4 of the Final Four predicted teams still alive in my bracket.
  • The president also has 19 out of 32 correct predictions in his bracket. Thirteen of his Sweet Sixteen, 7 of his Elite Eight, and 4 of his final four  predictions are still alive.
  • According to my prediction about the Minnesota/UCLA game being a good matchup…it pretty much was. Both teams played terribly. It was so close that neither team scored a field goal in the first five minutes of the game.

To recover from no only watching this game, but also from the mundanity of this blog post, I offer you comfort in the visualization of Nate Silver’s bracket of predictions.

Screen Shot 2013-03-23 at 8.32.24 AM

This Day in Statistics

I was looking to find an add-on Google Calendar that included important days in the history of statistics. They have one for seemingly everything under the sun, except this. So I created one and made it public in honor of the International Year of Statistics. I will continually add to it as I find time.

Feel free to add it. As always, it is available in the following formats

Want me to add an important birthday? Add the info into the comments section. Want to be an author on the calendar so you can add all 100 statistician’s that you know I forgot? Send me an email and I will add you on.

Wall Street Journal

Carl Bialik of the Wall Street Journal has a nice article about the growth of statistics.  The print version differs substantially from the online version in content, though not in message.

Missing from this message is the urgent need to have more teachers, at all levels, trained in statistics.  I’m currently at a meeting of the joint committee on education of the American Statistical Association and the National Council of Teachers of Mathematics.  A recurring theme is that K-12 stats education has arrived, but professional development lags far behind.  Agreed that our future needs statisticians.  But we also need teachers who know statistics, 3rd graders who know statistics, unskilled workers who know statistics, and, well, everyone needs some statistics.  And by “statistics”, we do not mean the ability to plug numbers into memorized formulas.  Instead, we mean “reasoning with data.”

Dear Gmail…

I recently added a free application/service that analyzes my email called Gmail Meter. This service sends me a comprehensive weekly report full of summaries and plots that indicate how I use Gmail.

The first thing I learned is that Wednesdays are for emailing and I seem to respond in a timely manner, on average, to emails sent to me…when I actually respond (I have a 24.58% response rate. Yikes!) Wednesdays I only teach one class (at 4:40pm) this semester, but I have a morning meeting so I am on campus and generally have time to respond to emails that I may not have gotten to.

Summary of my Gmail

The plot of my daily email traffic shows that most email is sent to me during the day (typical work hours), while my email times tend to be prior to classes in the morning and after my evening courses. Also, it is clear I am sending far less that I receive. It appears I am doing my part to lower my email footprint!
chartI seem to be more prompt on my email responses (for the most part) than others who respond to me. What is interesting, is that people who respond to me are in primarily very quick (<4hrs) or take more than a day to get back to me. This fits with the behavior I expect from most academics. chart-2In the emails I send, I tend to be terse. Generally, I try to avoid long emails to people since when long emails are sent to me I tend to get cranky. (I recognize that sometimes it can’t be avoided.) I actually am quite pleased that the mode here is less than 10 words. (Again, yay for my footprint!)

I am not quite as happy to see that the mode for emails sent to me is the category indicating more than 200 words. Some of this is because of the university committees  I sit on. For example, the University of Minnesota Senate sends many emails. These emails often are lengthy because of the inclusion of bylaws and articles to the University Constitution that we will be voting on. That being said, I agree with this email charter which begs us all to keep it short.chart-3What kind of media attachments are taking up space in my Gmail box? It seems that most are Microsoft Word documents. Again, given my collaboration with other academics and feedback to students this makes sense to me. Since I have a Mac and most of my colleagues still work on PC, I send many documents as PDF files. My guess is that if this were sent to me a few years ago, the number of attachments would have been even higher. Our research group has slowly worked toward using sites like Dropbox to share documents. (Next stop…some versioning system.)chart-4Now for the plot that made me stop and write this post. Almost 90% of the email I received this week hit the trash can. Also a small percentage is still in my inbox. I am trying to achieve Inbox Zero, but just haven’t made it yet. I am currently down to xxx emails in my inbox. I signed up for the Mailbox app which should help with this goal when I check email on my phone, but like the Tempo app that Rob signed up for, there is a reservation system in place. Unlike Rob, my spot in the Mailbox line is nowhere near the bottom (last I looked 632,889 people in front of me) despite having reserved my place in line several weeks ago.chart-1I also receive information on the week’s top emailers to me (Joan) and the top recipients of my mail (one of my students); top conversation threads, a scatterplot of the number of words per email in a thread versus the rank of the email in the thread (was it the 1st email sent, 2nd, etc.). As one might expect there is a strong, negative relationship here. It also produces a word cloud based on the subjects and bodies of all messages sent or received directly. Lastly, it conditions emails received with attachments on whether they came from inside or outside the organization (University of Minnesota).

It is not clear that you can obtain the raw data, although it is not clear that you can’t either. There are of course ways to obtain the meta-data that Gmail Meter is using by scraping it using a program such as Python (see here). My guess is that you could also do this with R 9perhaps using the curl and XML packages). They have several feature requests for making Google Meter more customizable which would make it even cooler.

Waiting for Tempo

I got pretty excited about a new calendar app, in part because I love these productivity tools and in part because I really hate the calendar that comes with the iPhone.  Tempo, as it is called, seemed nifty because it integrates data on your phone into the calendar so, for instance, you can get directions to your next meeting easily, alert people that you’re going to be late, have documents related to your appointment automatically opened, and other features that will either save lots of time and hassles or themselves become time-sinks and hassles.

But I need not have worried.  When I downloaded the app, I was informed that there is a bit of a queue for activating accounts.  I was currently 32,310th in line, with 2 people behind me.  So how much longer would I wait?

I logged on from time to time to assess my place in line.  The data are here: tempowaitlist.

I’m in Alexandria, VA, in town for a meeting of an ASA/NCTM education committee.  Although its late here, still feels early to us PST-ers, and so I thought I’d take a look at the data.  I’m almost to the point where half of the line is ahead of me, although mostly this is due to so many people lining up behind me.   (17,505 have queued up behind me—surely a sign that I’ve joined the right app?)  This graph shows my progress so far, in terms of how many are ahead of me.  You can, if boredom strikes, estimate how many more seconds before I arrive at the  front of the line. I’m somewhat encouraged by a hint of a sudden acceleration yesterday—but my excitement is no doubt reading too much into noise.

When I get there, I’ll let you know if it was worth the wait.

Number of People Ahead of me in line, as of Feb 28.

Number of People Ahead of me in line, as of Feb 28.

Extreme Fitbit

I have been mourning the loss this week of my FitBit.  No idea where it went.  That’s the problem with small, portable data collection devices.  The very feature that makes them useable makes them lose-able.  Then I came across this possible solution

http://www.engadget.com/2013/01/21/australian-firefighters-test-data-transmitting-pills/

which raises entirely new questions about edible lines of data collection devices.

 

Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.

mural.

Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot (http://www.friendlyforecast.com/usa/archive) shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.

Books.

Books.

What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:

IMG_1369

Closer

Closer

waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.