Personal Data Apps

Fitbit, you know I love you and you’ll always have a special place in my pocket.  But now I have to make room for the Moves app to play a special role in my capture-the-moment-with-data existence.

Moves is an ios7 app that is free.  It eats up some extra battery power and in exchange records your location and merges this with various databases and syncs it up to other databases and produces some very nice “story lines” that remind you about the day you had and, as a bonus, can motivate you to improved your activity levels.  I’ve attached two example storylines that do not make it too embarrassingly clear how little exercise I have been getting. (I have what I can consider legitimate excuses, and once I get the dataset downloaded, maybe I’ll add them as covariates.)  One of the timelines is from a day that included an evening trip to Disneyland. The other is a Saturday spent running errands and capped with dinner at a friend’s.  Its pretty easy to tell which day is which.

movings1movings2

But there’s more.  Moves has an API, thus allowing developers to tap into their datastream to create apps.  There’s an app that exports the data for you (although I haven’t really had success with it yet) and several that create journals based on your Moves data.  You can also merge Foursquare, Twitter, and all the usual suspects.

I think it might be fun to have students discuss how one could go from the data Moves collects to creating the storylines it makes.  For instance, how does it know I’m in a car, and not just a very fast runner?  Actually, given LA traffic, a better question is how it knows I’m stuck in traffic and not just strolling down the freeway at a leisurely pace? (Answering these questions requires another type of inference than what we normally teach in statistics. )  Besides journals, what apps might they create with these data and what additional data would they need?

An Accidental Statistician

I just finished reading An Accidental Statistician: The Life and Memories of George E. P. Box. The book reads like he is recounting his memories (it is aptly named) rather than as a biography. I enjoyed the stories and vignettes of his work and his intersections with other statisticians. The book also included pictures of many famous statisticians (George’s friends and family—Fisher was his father-in-law for a bit) in social situations. My favorite was the picture of Dr. Frank Wilcoxon on his motorcycle (see below).

Wilcoxon

There were some very interesting and funny anecdotes. For example, when George recounted a trip to Israel, he was told to get to the airport very early because of the intense security measures. After standing in a non-moving line for several hours, he apparently quipped that he had never before physically seen a stationary process.

My favorite sections of the book were the stories he told of writing Statistics for Experimenters, his book—along with William (Bill) Hunter and Stu Hunter—on experimental design. He wrote about how the book evolved from mimeographed notes for a course he had taught to the published version. It took several years for them to finish the writing of the book, only to be met with horrible reviews. (Note: This makes me feel slightly better about the year it took to write our book.)

In a chapter written about Bill Hunter (who was one of George’s graduate students at the University of Wisconsin), George relates that Bill started his PhD in 1960. After he finished (in 1963!)  he was hired almost immediately by Wisconsin as an assistant professor. Three years later he was made associate professor, and in 1969 (eight years after he started his PhD) he was made full professor. Unbelievable!

Box, Hunter, and Hunter

Facebook Analytics

WolframAlpha has a tool that will analyze your Facebook network. I saw this awhile ago, but HollyLynne reminded me of this recently, and I tried it out. You need to give the app(?) permission to access your account (which I am sure means access to your data for Wolfram), after which you are given all sorts of interesting, pretty info. Note, you can also opt to have Wolfram track your data in order to determine how your network is changing.

Some of them are kind of informative, but others are not. Consider this scatterplot(???)-type plot that was entitled “Weekly Distribution”. Tufte could include this in his next book of worthless graphs.

MSP2681f8g112466i1797600003i1a9g2303dbh6b6

There are other analyses that are more useful. For example, I learned that my post announcing the Citizen Statistician blog was the most liked post I have, while the post showing photographic evidence that I held a baby as far back as 1976 was the most commented.

This plot was also interesting…too bad it is a pie chart (sigh).

MSP12421be34933bi49gcie0000299f2dbe6gb3406g

There is also a ton of other information, such as which friend has the most friends (Jayne at 1819), your youngest and oldest friends based on the reported birthdays, photos that are tagged the most, word clouds of your posts, etc.

This network was my favorite of them all. It shows the social insiders and outsiders in my network of friends, and identifies social connectors, neighbors, and gateways.

MSP1331ib228faf1ibdbhd00003d6006784d62952h

Once again, kind of a cool tool that works with the existing data, but there does not seem to be a way to obtain the data in a workable format.

Miscellany that I have Read and been Thinking about this Last Week

I read a piece last night called 5 Ways Big Data Will Change Lives In 2013. I really wasn’t expecting much from it, just scrolling through accumulated articles on Zite. However, as with so many things, there were some gems to be had. I learned of Aadhar.

Aadhar is an ambitious government Big Data project aimed at becoming the world’s largest biometric database by 2014, with a goal of capturing about 600 million Indian identities…[which] could help India’s government and businesses deliver more efficient public services and facilitate direct cash transfers to some of the world’s poorest people — while saving billions of dollars each year.

The part that made me sit up and take notice was this line, “India’s Aadhar collects sensitive information, such as fingerprints and retinal scans. Yet people volunteer because the potential incentives can make the data privacy and security pitfalls look miniscule — especially if you’re impoverished.”

I have been reading and hearing about concerns of data privacy for quite awhile, yet nobody that I have been reading (or listening to) has once suggested what the circumstances are that would have citizens forego all sense of privacy. Poverty, especially extreme poverty, is one of those circumstances. As a humanist, I am all for facilitating resources in the most efficient ways possible, which inevitably involve technology. But, as a Citizen Statistician, I am all too aware of how a huge database of biometric data could be used (or mis-used as it were). It especially concerns me that our impoverished citizens, who are more likely to be in the database, will be more at risk for being taken advantage of.

A second headline that caught my eye was France Looks At Possibility Of Taxing Internet Companies For Data Mining. France is pointing out that companies such as Google and Facebook are making enormous sums of money dollars by mining and using citizens’ personal information, so why shouldn’t that be seen as a taxable asset? While this is a reasonable question, the article also points out that one potential consequence of such taxation is that the “free” model (at least monetarily) that these companies currently use might cease to exist.

Related to both of these articles, I also read a blog post about a seminar being offered in the Computer Science department at the University of Utah entitled Accountability in Data Mining. The professor of the course wrote in the post,

I’m a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be “in scope”. I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it’s a 1-credit seminar that meets once a week, I obviously can’t cover everything I’d like, but I’d like to flesh out the readings with related work that people can peruse later.

It is about time some university offered such a course. I think this will be ultimately useful (and probably should be required) content to include in every statistics course taught. In making decisions using data, who is accountable for those decisions, and the consequences thereof?

1331746205255_562228Lastly, I would be remiss to not include a link to what might be the article I resonated to most: It’s not 1989. The author points out that the excuse “I’m not good with computers” is not acceptable any longer, especially for educators. He makes a case for a minimum level of technological competency that teachers should have in today’s day and age. I especially agree with the last point,

Every teachers must have a willingness to continue to learn! Technology is ever evolving, and excellent teachers must be life-long learners. (Particularly in the realm of technology!)

The lack of ability with computers that I see on a day-to-day basis in several students and faculty (even the base-level literacy that the author wants) is frightening and saddening at the same time. I would love to see colleges and universities give all incoming students a computer literacy test at the same time as they take their math placement test. If you can’t copy-and-paste you should be sent to a remedial course to obtain the skills you need to acquire before taking any courses at the institution.

Nate Silver’s New Book

I’ve been reading and greatly enjoying Nate Silver’s book, The Signal and the Noise: Why So Many Predictions Fail—and Some Don’t.  I’d recommend the book based on the introduction and first chapter alone. (And, no, that’s not because that’s all I’ve read so far.  It’s  because they’re that good.)  If you’re the sort who skips introductions, I strongly suggest you become a new sort and read this one. It’s a wonderful essay about the dangers of too much information, and the need to make sense of it.  Silver makes the point that, historically, when we’ve been faced with more information than we can handle, we tend to pick-and-choose which ‘facts’ we wish to believe.  Sounds like a presidential debate, no?

Another thing to like about the book is for the argument it provides  against the Wired Magazine view that Big Data means the end of scientific theory.  Chapter by chapter, Silver describes the very important role that theory and modeling play in making (successful) predictions.  In fact, a theme of the book is that prediction is a human endeavor, despite the attention data scientists pay to automated algorithmic procedures.  “Before we can demand more of our data, we need to demand more of ourselves.”  In other words, the Data Deluge requires us to find useful information, not just any old information. (Which is where we educators come in!)

The first chapter makes a strong argument that the financial crisis was, to a great extent, a failure to understand fundamentals of statistical modeling, in particular to realize that the models are not the thing they model.  Models are shaped by data but run on assumptions, and when the assumptions are wrong, the predictions fail.  Chillingly, Silver points out that recoveries from financial crises tend to be much, much slower than recoveries from economic crises and, in fact, some economies never recover.

Other chapters talk about baseball, weather, earthquakes, poker and more.  I particularly enjoyed the weather chapter because, well, who doesn’t enjoy talking about the weather? For me, perhaps because we are in the midst of elections, it also raised questions about the role of the U.S. federal government in supporting the economy.  Weather prediction plays a big role in our economic infrastructure, even though many people tend to be dismissive of our ability to predict the weather.  So it was interesting to see that, in fact, the government agencies do predict weather better than the private prediction firms (such as The Weather Channel), and are much better than local news channels’ predictions.  In fact, as Silver explains, the marketplace rewards poor predictions (at least when it comes to predicting rain).  For me, this underlines the importance of a ‘neutral’ party.

As I think about preparing students for the Deluge, I think that teaching prediction should take priority over teaching inference.  Inference is important, but it is a specialized skill, and so is not needed by all.  Prediction, on the other hand, is inherently important, and has been for millennia.Yes, prediction is a type of inference, but prediction and inference are not the same thing.  As Silver points out, estimating a candidate’s support for president is different from predicting whether or not the candidate will win. (Which leads me to propose a new slogan: “Prediction: Inference for Tomorrow!”  Or “Prediction: Inference for Procrastinators!”)

Much of this may be beyond the realm of introductory statistics, since some of the predictive models are complex.  But the basics are important for intro stats students.  All students  should understand what a statistical model is and what it is not.  Equally importantly, they should understand how to evaluate a model.  And I don’t mean that they should learn about r-squared (or only about r-squared.)  They should learn about the philosophy of measuring model performance.  In other words, intro stats students should understand why many predictions fail, but some don’t, and how to tell the difference.

So let’s talk specifics.  Post your comments on how you teach your students about prediction and modeling.