Extreme Fitbit

I have been mourning the loss this week of my FitBit.  No idea where it went.  That’s the problem with small, portable data collection devices.  The very feature that makes them useable makes them lose-able.  Then I came across this possible solution

http://www.engadget.com/2013/01/21/australian-firefighters-test-data-transmitting-pills/

which raises entirely new questions about edible lines of data collection devices.

 

Your Flowing Data Defended

I had the privilege last week of listening to the dissertation defense of UCLA Stat’s newest PhD: Nathan Yau.  Congratulations, Nathan!

Nathan runs the very popular and fantastic blog Flowing Data, and his dissertation is about, in part, the creation of his app Your Flowing Data.  Essentially, this is a tool for collecting and analyzing personal data–data about you and your life.

One aspect of the thesis I really liked is a description of types of insight he found from a paper by Pousman, Stasko and Mateas (2007): Casual information visualization: Depictions of Data in every day life. (IEEE Transactions on Visualization and Computer Graphics, 13(6): 1145-1152.)  Nathan quotes four types of insights:

  • Analytic Insight.  Nathan describes these as ‘traditional’ statistical insights obtained from statistical models.
  • Awareness insight. “…remaining aware of data streams such as the weather, news…” People are simply aware that these everyday streams exist and so know to seek them for information when needed
  • Social Insight. Involvement in social networks help people define a place for themselves in relation to particular social contexts.
  • Reflective Insight.  Viewers take a step back from data and can reflect on something they were perhaps unaware of, or have an emotional reaction.

With respect to my Walk to Venice Beach, I think it would be interesting to see how experiences such as that can be leveraged into insights in these categories.  Although these insights are not hierarchical, it would also be interesting to see how these fit into understandings of statistical thinking and reasoning.  For example, some stats ed researchers are grappling with the role of ‘informal’ vs. ‘formal’ statistical inference, and I see the last three insights as supporting informal inference (when inference is called for at all.)

Nathan has lots to say about the role that developers can play in assisting people in gaining insight from data.  Our job, I believe, is to think carefully about the role that educators can play in strengthening these insights.  We spend too much time on the first insight, I think, and not enough time on the others.  But the others are what students will remember and use from their stats class.

Happy Birthday Florence Henderson

As a celebration of Florence Henderson’s 79th birthday (on February 14), I have created this scatterplot to use in my regression course.

ValDay

The plot depicts the relationship between time spent on mathematics homework outside of school (expressed as z-scores) and mathematics achievement scores (expressed as T-scores, M=50, SD=10) for 200 8th-graders taken from the 1988 National Education Longitudinal Study. The color–in a display of very poor data science–is just randomly applied to the observations rather than meaning anything substantial. (Bloggers Note: I think it fits with the spirit of Valentine’s Day…a gratuitous, yet meaningless, gesture intended to make the receiver feel all gooshy.)

I created the plot using the Valentine package (available here) which applies a Valentine’s Day theme to ggplot. I also applied a picture of Cupid into the background of the plot and used hearts instead of points to plot the observations. Lastly, I changed the default color and fill on the regression smoother to more aptly fit the color scheme.

Below I will explain the how-to of creating this plot.

Reading in the NELS Data

First, I read the NELS data into R. These data and its codebook are available via my regression course website.

nels <- read.csv("http://www.tc.umn.edu/~zief0002/Data/NELS.csv")

Using Hearts Instead of Points

I first needed to find an image of a heart that I liked. For icons of all sorts, I generally use The Noun Project. (This particular heart can be found here.) All of the images at The Noun Project are SVG files. This makes them very useful for display in browsers. To use them in ggplot, I converted the SVG image to a PNG file using the free image manipulation program GIMP (Perhaps you can use the SVG format directly without converting it, but I have never done that, so I don’t know.)

Using GIMP, I also replaced the black color to the hexadecimal color “ad97c6″ by selecting Color > Map > Color Exchange. (Double-clicking the color swatch under “To Color” allows for color entry in hexadecimal.) After this I saved the heart as “heart1.png”, and repeated the process four more times, but using the colors ”e58cbc” (heart2.png), “f2935b” (heart3.png), “9fc8b6″ (heart4.png), and “eddc74″ (heart5.png). (Note: These colors are the same colors I chose for the Valentine’s Day theme color and fill palettes and are based on the colors of the candy hearts you get in elementary school.)

I then used the png package to read the five PNG files into R. (Note: For anyone who doesn’t want to go through the hassle of coloring the hearts and reformatting both them and Cupid, I have made those files available for download here.)

library(png)
h1 <- readPNG("/Users/andrewz/Desktop/heart1.png", TRUE)
h2 <- readPNG("/Users/andrewz/Desktop/heart2.png", TRUE)
h3 <- readPNG("/Users/andrewz/Desktop/heart3.png", TRUE)
h4 <- readPNG("/Users/andrewz/Desktop/heart4.png", TRUE)
h5 <- readPNG("/Users/andrewz/Desktop/heart5.png", TRUE)

To randomly assign each observation to one of the five hearts (h1–h5), I used the sample() function inside the paste() function to concatenate the letter “h” and a random value from 1–5. I then used the do() function from the mosaic package to “do” this 200 times. Lastly, I appended this vector to the nels data frame, in the process, coercing it to characters (to be sure it isn’t appended as a factor–which will be needed later for use in the get() function).

library(mosaic)
heart <- do(200) * paste("h", sample(1:5, size = 1), sep="")
nels$Heart <- as.character(heart[, 1])
head(nels)
#  ID   Homework   Math Heart
#1  1 -0.3329931 42.432    h4
#2  2 -0.2136822 53.698    h2
#3  3 -1.0077991 49.205    h2
#4  4  0.2059000 53.698    h1
#5  5 -0.1177185 55.980    h5
#6  6  0.1413540 65.331    h3

This is all that is needed until we get to actually creating the plot.

Adding a Cupid into the Plot’s Background

I obtained the image of Cupid by doing a Google search on “Cupid, Public Domain”. (The actual image I used is available here.) Since the image was a JPEG, I again converted the image to a PNG (this time using Preview) and changed the image size to have a width of 800 pixels. I also used the Instant Alpha tool in Preview’s toolbar to make the white background transparent. As a sidebar, this could all be done in GIMP as well.

I again used the readPNG() function to read the image file into R, but this time setting the native= argument to FALSE. This will represent the image as an array rather than rasterizing it. I chose to do this so that I could make the image more transparent before rasterizing it.

cupid <- readPNG("/Users/andrewz/Desktop/cupid.png", FALSE)
w <- matrix(rgb(cupid[ , , 1], cupid[ , , 2], cupid[ , , 3], cupid[ , , 4] * 0.2), nrow = dim(cupid)[1])

Creating the Plot

Finally, we are ready to create the plot. In the initial calls to ggplot, I use the rasterGrob() function from the grid package to rasterize Cupid which is placed in the plot using the annotation_custom() layer. The color= and fill= arguments to the geom_smooth() layer set the beautiful magenta/pink color for the regression smoother. The theme_valentine() layer sets ggplot’s theme to use some specialized fonts (see the Valentine package on GitHub). The size of the points in the geom_point() layer is set to 0, to leave a blank canvas for us to add our hearts. This is assigned into the object p.

library(ggplot2)
library(grid)
library(Valentine)
p <- ggplot(data = nels, aes(x = Homework, y = Math)) +
	geom_point(size = 0) +
	annotation_custom(rasterGrob(w)) +
	theme_valentine() +
	geom_smooth(method = "lm", color = "#ec008c", fill = "#ec008c", lwd = 1.5) +
	ggtitle("HAPPY BIRTHDAY FLORENCE HENDERSON")

The hearts (points) are now added to the plot by cycling through a for loop. Each time we are going to rasterize the heart and add it to the plot. The optional arguments in the annotation_custom() layer set the horizontal and vertical position of the hearts in the plot.

for(i in 1:nrow(nels)){
    p <- p + annotation_custom(
      rasterGrob(get(nels$Heart[i])), 
      xmin = nels$Homework[i] - 0.5, xmax = nels$Homework[i] + 0.5, ymin = nels$Math[i] - 0.5, ymax = nels$Math[i] + 0.5
      ) 
    }

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.

mural.

Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot (http://www.friendlyforecast.com/usa/archive) shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.

Books.

Books.

What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:

IMG_1369

Closer

Closer

waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.

Miscellany that I have Read and been Thinking about this Last Week

I read a piece last night called 5 Ways Big Data Will Change Lives In 2013. I really wasn’t expecting much from it, just scrolling through accumulated articles on Zite. However, as with so many things, there were some gems to be had. I learned of Aadhar.

Aadhar is an ambitious government Big Data project aimed at becoming the world’s largest biometric database by 2014, with a goal of capturing about 600 million Indian identities…[which] could help India’s government and businesses deliver more efficient public services and facilitate direct cash transfers to some of the world’s poorest people — while saving billions of dollars each year.

The part that made me sit up and take notice was this line, “India’s Aadhar collects sensitive information, such as fingerprints and retinal scans. Yet people volunteer because the potential incentives can make the data privacy and security pitfalls look miniscule — especially if you’re impoverished.”

I have been reading and hearing about concerns of data privacy for quite awhile, yet nobody that I have been reading (or listening to) has once suggested what the circumstances are that would have citizens forego all sense of privacy. Poverty, especially extreme poverty, is one of those circumstances. As a humanist, I am all for facilitating resources in the most efficient ways possible, which inevitably involve technology. But, as a Citizen Statistician, I am all too aware of how a huge database of biometric data could be used (or mis-used as it were). It especially concerns me that our impoverished citizens, who are more likely to be in the database, will be more at risk for being taken advantage of.

A second headline that caught my eye was France Looks At Possibility Of Taxing Internet Companies For Data Mining. France is pointing out that companies such as Google and Facebook are making enormous sums of money dollars by mining and using citizens’ personal information, so why shouldn’t that be seen as a taxable asset? While this is a reasonable question, the article also points out that one potential consequence of such taxation is that the “free” model (at least monetarily) that these companies currently use might cease to exist.

Related to both of these articles, I also read a blog post about a seminar being offered in the Computer Science department at the University of Utah entitled Accountability in Data Mining. The professor of the course wrote in the post,

I’m a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be “in scope”. I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it’s a 1-credit seminar that meets once a week, I obviously can’t cover everything I’d like, but I’d like to flesh out the readings with related work that people can peruse later.

It is about time some university offered such a course. I think this will be ultimately useful (and probably should be required) content to include in every statistics course taught. In making decisions using data, who is accountable for those decisions, and the consequences thereof?

1331746205255_562228Lastly, I would be remiss to not include a link to what might be the article I resonated to most: It’s not 1989. The author points out that the excuse “I’m not good with computers” is not acceptable any longer, especially for educators. He makes a case for a minimum level of technological competency that teachers should have in today’s day and age. I especially agree with the last point,

Every teachers must have a willingness to continue to learn! Technology is ever evolving, and excellent teachers must be life-long learners. (Particularly in the realm of technology!)

The lack of ability with computers that I see on a day-to-day basis in several students and faculty (even the base-level literacy that the author wants) is frightening and saddening at the same time. I would love to see colleges and universities give all incoming students a computer literacy test at the same time as they take their math placement test. If you can’t copy-and-paste you should be sent to a remedial course to obtain the skills you need to acquire before taking any courses at the institution.

Clueful

Since posting last month about data-sharing concerns with some popular apps, I’ve since learned about Cluefulapp.com which, apparently, helps us see how are data are used by iOS apps. For instance, according to Cluefulapp, Google Maps can read my address book, uses my iPhone’s unique ID, encruypts stored data, “could” track my location, and uses an anonymous identifier.

Waze is somewhat similar.  It “could” track my location [quotes are because I wonder what they mean by could---does it?], connects to twitter and facebook, can read my address book (but does it?), and uses an anonymous identifier.

Still, I wonder if it goes far enough.  Google Maps seems relatively safe, until you think about what might happen if a third-party could merge this data with another app and learn more.  Say a restaurant owner sees several anonymous identifiers at his restaurant.  A look at Facebook reveals a small number of people who ‘checked in’ at that restaurant—perhaps their identifiers are among them?  Some of those people checked in at other places after the restaurant, and sure enough, the same identifier appears elsewhere.  Now the restaurant knows who the person is.

I’m not sure whether this is a likely scenario, but it seems the next step is a device that puts together a profile of how you might appear in public, if someone merged the data from all of the apps used in a day.  Or perhaps this already exists?

Introducing Statistics: A Graphic Guide

Source: introducingbooks.com

Source: introducingbooks.com

Over the winter break I was travelling in the UK and I came across this little book called “Introducing Statistics: A Graphic Guide” by Ellen Magnello and Borin Van Loon at the gift shop in the Tate Modern museum in London. The book is published in 2009, and Significance magazine already reviewed it here, so I won’t repeat their comments. I hadn’t heard about the book before, so I picked it up, along with a copy of Introducing Post-Modernism (they were 2 for £10, I had to get two, obviously).

I think the book would be more appropriately named “an illustrated guide”, since the images are mostly illustrations of statisticians with speech bubbles instead of graphics that help visualize the concepts being discussed. The most unexpected are the images of the author herself. The first time I came across one of those I was thinking “who is this lady in the pant-suit standing next to Karl Pearson?”. Needless to say, the illustrations sometimes distract from the text, but they’re fun and nicely drawn.

The book does a very good job of describing the differences between vital statistics and mathematical statistics, and what the terms “statistic” and “variability” mean. Therefore, while the audience of the book is not clear, it could be a perfect gift for parents of statisticians who still don’t quite understand what their offspring do. Or really anyone who is interested in statistics, but has no real formal experience with it.

While the book tells the early history of statistics well, the introduction of statistical concepts follow a strange order. It is useful for gaining familiarity with some terminology and simple statistical distributions and tests, but it would be quite difficult to acquire a thorough understanding of these concepts from the book’s introduction. However, I’m guessing this is not the intent of the book, anyway.

The book is part of a series called Introducing Books, which contain about 80 graphical guides from Introducing Aesthetics to Marxism to Wittgenstein. The museum shop where I got the book carried only about 10 of these titles, and I was happy to see that Introducing Statistics was one of them.