Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.


Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot ( shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.



What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:




waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.

Miscellany that I have Read and been Thinking about this Last Week

I read a piece last night called 5 Ways Big Data Will Change Lives In 2013. I really wasn’t expecting much from it, just scrolling through accumulated articles on Zite. However, as with so many things, there were some gems to be had. I learned of Aadhar.

Aadhar is an ambitious government Big Data project aimed at becoming the world’s largest biometric database by 2014, with a goal of capturing about 600 million Indian identities…[which] could help India’s government and businesses deliver more efficient public services and facilitate direct cash transfers to some of the world’s poorest people — while saving billions of dollars each year.

The part that made me sit up and take notice was this line, “India’s Aadhar collects sensitive information, such as fingerprints and retinal scans. Yet people volunteer because the potential incentives can make the data privacy and security pitfalls look miniscule — especially if you’re impoverished.”

I have been reading and hearing about concerns of data privacy for quite awhile, yet nobody that I have been reading (or listening to) has once suggested what the circumstances are that would have citizens forego all sense of privacy. Poverty, especially extreme poverty, is one of those circumstances. As a humanist, I am all for facilitating resources in the most efficient ways possible, which inevitably involve technology. But, as a Citizen Statistician, I am all too aware of how a huge database of biometric data could be used (or mis-used as it were). It especially concerns me that our impoverished citizens, who are more likely to be in the database, will be more at risk for being taken advantage of.

A second headline that caught my eye was France Looks At Possibility Of Taxing Internet Companies For Data Mining. France is pointing out that companies such as Google and Facebook are making enormous sums of money dollars by mining and using citizens’ personal information, so why shouldn’t that be seen as a taxable asset? While this is a reasonable question, the article also points out that one potential consequence of such taxation is that the “free” model (at least monetarily) that these companies currently use might cease to exist.

Related to both of these articles, I also read a blog post about a seminar being offered in the Computer Science department at the University of Utah entitled Accountability in Data Mining. The professor of the course wrote in the post,

I’m a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be “in scope”. I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it’s a 1-credit seminar that meets once a week, I obviously can’t cover everything I’d like, but I’d like to flesh out the readings with related work that people can peruse later.

It is about time some university offered such a course. I think this will be ultimately useful (and probably should be required) content to include in every statistics course taught. In making decisions using data, who is accountable for those decisions, and the consequences thereof?

1331746205255_562228Lastly, I would be remiss to not include a link to what might be the article I resonated to most: It’s not 1989. The author points out that the excuse “I’m not good with computers” is not acceptable any longer, especially for educators. He makes a case for a minimum level of technological competency that teachers should have in today’s day and age. I especially agree with the last point,

Every teachers must have a willingness to continue to learn! Technology is ever evolving, and excellent teachers must be life-long learners. (Particularly in the realm of technology!)

The lack of ability with computers that I see on a day-to-day basis in several students and faculty (even the base-level literacy that the author wants) is frightening and saddening at the same time. I would love to see colleges and universities give all incoming students a computer literacy test at the same time as they take their math placement test. If you can’t copy-and-paste you should be sent to a remedial course to obtain the skills you need to acquire before taking any courses at the institution.


Since posting last month about data-sharing concerns with some popular apps, I’ve since learned about which, apparently, helps us see how are data are used by iOS apps. For instance, according to Cluefulapp, Google Maps can read my address book, uses my iPhone’s unique ID, encruypts stored data, “could” track my location, and uses an anonymous identifier.

Waze is somewhat similar.  It “could” track my location [quotes are because I wonder what they mean by could—does it?], connects to twitter and facebook, can read my address book (but does it?), and uses an anonymous identifier.

Still, I wonder if it goes far enough.  Google Maps seems relatively safe, until you think about what might happen if a third-party could merge this data with another app and learn more.  Say a restaurant owner sees several anonymous identifiers at his restaurant.  A look at Facebook reveals a small number of people who ‘checked in’ at that restaurant—perhaps their identifiers are among them?  Some of those people checked in at other places after the restaurant, and sure enough, the same identifier appears elsewhere.  Now the restaurant knows who the person is.

I’m not sure whether this is a likely scenario, but it seems the next step is a device that puts together a profile of how you might appear in public, if someone merged the data from all of the apps used in a day.  Or perhaps this already exists?

Introducing Statistics: A Graphic Guide



Over the winter break I was travelling in the UK and I came across this little book called “Introducing Statistics: A Graphic Guide” by Ellen Magnello and Borin Van Loon at the gift shop in the Tate Modern museum in London. The book is published in 2009, and Significance magazine already reviewed it here, so I won’t repeat their comments. I hadn’t heard about the book before, so I picked it up, along with a copy of Introducing Post-Modernism (they were 2 for £10, I had to get two, obviously).

I think the book would be more appropriately named “an illustrated guide”, since the images are mostly illustrations of statisticians with speech bubbles instead of graphics that help visualize the concepts being discussed. The most unexpected are the images of the author herself. The first time I came across one of those I was thinking “who is this lady in the pant-suit standing next to Karl Pearson?”. Needless to say, the illustrations sometimes distract from the text, but they’re fun and nicely drawn.

The book does a very good job of describing the differences between vital statistics and mathematical statistics, and what the terms “statistic” and “variability” mean. Therefore, while the audience of the book is not clear, it could be a perfect gift for parents of statisticians who still don’t quite understand what their offspring do. Or really anyone who is interested in statistics, but has no real formal experience with it.

While the book tells the early history of statistics well, the introduction of statistical concepts follow a strange order. It is useful for gaining familiarity with some terminology and simple statistical distributions and tests, but it would be quite difficult to acquire a thorough understanding of these concepts from the book’s introduction. However, I’m guessing this is not the intent of the book, anyway.

The book is part of a series called Introducing Books, which contain about 80 graphical guides from Introducing Aesthetics to Marxism to Wittgenstein. The museum shop where I got the book carried only about 10 of these titles, and I was happy to see that Introducing Statistics was one of them.

My Year of Reading in Review

Two years ago, I made a New Year’s Resolution to read more books. At that point I joined GoodReads to hold myself accountable. I read 47 books that year (at least that I recorded). In 2012, I didn’t re-make that resolution, and my reading productivity dropped to 29 (really 26 since I quit reading 3 books). While the number of books is lower, I did some minor analyses on these books based on data I scraped from GoodReads and Amazon.

One summary I created was to examine the number of books I read per month. I also wanted to account for the fact that some books are a lot shorter than others, so in addition I looked at the average number of pages I read per month as well.

Number of books and average pages read per month in 2012.

Number of books and average pages read per month in 2012.

It is clear that May and December are prolific reading months for me. My interpretation is that these are the months that semesters end, and very often I retreat into the pages of a book or two to escape for a bit.

How do I rate these books I read? Are Amazon and GoodReads raters giving the books I read the same rating?

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

I gave mostly 3/5 and 4/5 stars to the books I read. It is clear from the plot that there is an overall positive relationship: books that are rated highly by GoodReads raters are, on average, the same books being rated highly by Amazon raters–and vice-versa.

What book did I give a rating of 5/5 to? The synopsis of my reading year, including the answer to this question, is available here [2012-Annual-Reading-Synopsis].

I have also made the spreadsheet data (from both 2011 and 2012) available publicly on GoogleDocs [data].

Here is the R code to access that data.

[sourcecode language=”r”]
myCsv <- getURL("")
books <- read.csv(textConnection(myCsv))

Big Data for Little People

On New Year’s Eve, All Tech Considered, had a segment looking ahead to interesting technology in the coming year.  One of the themes was Big Data, but I particularly liked the way they sold it: “Big Data For Little People.”  The basic idea being that much of big data is owned by big companies, which crunch the data for their own purposes.  But the NPR folks are seeing a trend for applications that crunch big data and bring the results to your smartphone or other app.

This is nothing new, of course, but I welcome more of it.  This year I started using Waze to help me commute.  Waze runs on my smartphone and analyzes the locations and velocities of other Waze users in order to predict which route will get me to my destination quickest.  If an accident happens down the road, Waze redirects me around it.  Despite a predilection for directing me to cross Wilshire Blvd (10 lanes of traffic) without the assistance of a stop light or stop sign, and despite the fact that every so often I realize that I’m being routed so as to collect velocity data on an as-yet-untested road, Waze does lots of smart things and definitely saves time, particularly on highways.

Of course, credit card companies have been providing a service like this for years, and having been a victim of identity theft twice this year (most recently on New Year’s Eve), I’ve grown to appreciate how rapidly these companies detect unusual patterns in my card usage.  Other apps (I use AllClear ID) monitor your other financial transactions as well, and alert you to potential fraud.  I also recall an app that compares your bank statements to others’ and pinpoints places where you might save money.  (Was it called PiggyBank?)

The reporters on NPR (Steve Henn and Laura Sydell) predict ‘smart wallets’ that will advise you on the wisdom of purchases you are making or considering, as well as traffic-prediction apps.

What big data for little people apps would you like to see?