One big challenge we all face is understanding what’s good and what’s bad for us. And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster: Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US. Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.
The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…
What do we fear more? Losing data privacy to our government, or to corporate entities? On the one hand, we (still) have oversight over our government. On the other hand, the government is (still) more powerful than most corporate entities, and so perhaps better situated to frighten.
In these times of Snowden and the NSA, the L.A. Times ran an interesting story about just what tracking various internet companies perform. And it’s alarming. (“They’re watching your every move.”, July 10, 2013). Interestingly, the story does not seem to appear on their website as of this posting.) Like the government, most of these companies claim that (a) their ‘snooping’ is algorithmic; no human sees the data and (b) their data are anonymized. And yet…
To my knowledge, businesses aren’t required to adhere to, or even acknowledge, any standards or practices for dealing with private data. Thus, a human could snoop on particular data. We are left to ponder what that human will do with the information. In the best case scenario, the human would be fired, as, according to the L.A. Times, Google did when it fired an engineer for snooping on emails of some teenage girls.
But the data are anonymous, you say? Well, there’s anonymous and then there’s anonymous. As LaTanya Sweeney taught us in the 90’s, knowing a person’s zipcode, gender, and date of birth is sufficient to uniquely identify 85% of Americans. And the L.A. Times reports a similar study where just four hours of anonymized tracking data was sufficient to identify 95% of all individuals examined. So while your name might not be recorded, by merging enough data files, they will know it is you.
This article fits in really nicely with a fascinating, revelatory book I’m currently midway through: Jaron Lanier‘s Who Owns The Future? A basic theme of this book is that internet technology devalues products and goods (files) and values services (software). One process through which this happens is that we humans accept the marvelous free stuff that the internet provides (free google searches, free amazon shipping, easily pirated music files) in exchange for allowing companies to snoop. The companies turn our aggregated data into dollars by selling to advertisers.
A side affect of this, Lanier explains, is that there is a loss of social freedom. At some point, a service such as Facebook gets to be so large that failing to join means that you are losing out on possibly rich social interactions. (Yes, I know there are those who walk among us who refuse to join Facebook. But these people are probably not reading this blog, particularly since our tracking ‘bots tell us that most of our readers come from Facebook referrals. Oops. Was I allowed to reveal that?) So perhaps you shouldn’t complain about being snooped on since you signed away your privacy rights. (You did read the entire user agreement, right? Raise your hand if you did. Thought so.) On the other hand, if you don’t sign, you become a social pariah. (Well, an exaggeration. For now.)
Recently, I installed Ghostery, which tracks the automated snoopers that follow me during my browsing. Not only “tracks”, but also blocks. Go ahead and try it. It’s surprising how many different sources are following your every on-line move.
I have mixed feelings about blocking this data flow. The data-snooping industry is big business, and is responsible, in part, for the boom of stats majors and, more importantly, the boom in stats employment. And so indirectly, data-snooping is paying for my income. Lanier has an interesting solution: individuals should be paid for their data, particular when it leads to value. This means the era of ‘free’ is over–we might end up paying for searches and for reading wikipedia. But he makes a persuasive case that the benefits exceed the costs. (Well, I’m only half-way through the book. But so far, the case is persuasive.)
My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened. I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’ (We had talked a bit in class about what that meant, and about what devices were storing data.) They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.
The results were interesting. The vast majority “got” it. The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”
But those were very few (maybe 2 or 3). The rest were quite thoughtful. The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments). Many of the students were to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure. And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future. They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.
Here’s my question for you: what’s the next step? Where do we go from here to build on this lesson? And to what purpose?
Since posting last month about data-sharing concerns with some popular apps, I’ve since learned about Cluefulapp.com which, apparently, helps us see how are data are used by iOS apps. For instance, according to Cluefulapp, Google Maps can read my address book, uses my iPhone’s unique ID, encruypts stored data, “could” track my location, and uses an anonymous identifier.
Waze is somewhat similar. It “could” track my location [quotes are because I wonder what they mean by could—does it?], connects to twitter and facebook, can read my address book (but does it?), and uses an anonymous identifier.
Still, I wonder if it goes far enough. Google Maps seems relatively safe, until you think about what might happen if a third-party could merge this data with another app and learn more. Say a restaurant owner sees several anonymous identifiers at his restaurant. A look at Facebook reveals a small number of people who ‘checked in’ at that restaurant—perhaps their identifiers are among them? Some of those people checked in at other places after the restaurant, and sure enough, the same identifier appears elsewhere. Now the restaurant knows who the person is.
I’m not sure whether this is a likely scenario, but it seems the next step is a device that puts together a profile of how you might appear in public, if someone merged the data from all of the apps used in a day. Or perhaps this already exists?
The L.A. Times ran an interesting article about the new Federal Trade Commission(downloads) report, “Mobile Apps for Kids: Disclosures Still Not Making the Grade”, followed up on a February 2012 report, and concluded that “Yes, many apps included interactive features or shared kids’ information with third parties without disclosing these practices to parents.”
I think this is issue is intriguing on many levels, but of central concern is the fact that as we go about our daily business (or play, as the case may be), we leave a data trail, sometimes unwittingly. Quite often unknowingly. Perhaps we’ve reached the point where there’s no going back, and we must accept the fact that when we engage with the mobile community, we engage in a data-exchange. But it seems an easy thing that standards should be set so that, maybe, developers are required to produce logs of the data transaction. And third-party developers could write apps that let parents examine this exchange. (Without, of course, sharing this information with a third party.) It would be interesting and fun, I would think, to create a visualization of the data flow in and out of your device across a period of time.
The report indicated that the most commonly shared datum was the device ID number. Sounds innocent, but, as we all know, its the crosstabs that kill. The device ID is a unique number, and is associated with other variables, such as the device operating system, primary language, the carrier, and other information. It can also be used to access personal data, such as the name, address, email address, of the user. While some apps share only the device ID, and thus may seem safe, other apps send more data to the same companies. And so companies that receive data from multiple apps can build up a more detailed picture of the user. And these companies share data with each other, and so can create even richer pictures.
There are some simple ways of introducing the topic into a stats course. The report essentially conducted a random sample (well, lets say it had some random sampling components) of apps. And reports estimated percentages. But never, of course, confidence intervals. And so you can ask your students a question such as “The FTC randomly sampled 400 apps that were marketed to “kids” from the Google and iTunes app store. 60% of the apps in the sample transmitted the Device ID to a third-party. Find a 95% confidence interval for the proportion of all apps….” Or, “only 20% of the apps that transmitted private information to a third-party disclosed this fact to the user. Find a 95% CI….”
The report contains some nice comparisons with the 2011 data concerning the types of “kids” apps available, as well as a discussion of the type of advertising that appears. (An amusing example that shouldn’t be amusing, is an app targeted at kids that displays advertising for a dating web site: “1000+ singles!”. Reminds me of something my sister told me when her kids were young: she could always tell when they found something troubling on the computer because suddenly they would grow very quiet.
The L.A. Times today (Monday, November 19) ran an editorial about the benefits and costs of Big Data. I truly believe that statisticians should teach introductory students (and all students, really) about data privacy. But who feels they have a realistic handle on the nature of these threats and the size of the risk? I know I don’t. Does anyone teach this in their class? Let’s hear about it! In the meantime, you might enjoy reading (or re-reading) a classic on the topic by Latanya Sweeney: k-Anonymity: a model for protecting privacy.