Nathan Yau has a new book coming out, this about working with data. Pre-order now!
Since posting last month about data-sharing concerns with some popular apps, I’ve since learned about Cluefulapp.com which, apparently, helps us see how are data are used by iOS apps. For instance, according to Cluefulapp, Google Maps can read my address book, uses my iPhone’s unique ID, encruypts stored data, “could” track my location, and uses an anonymous identifier.
Waze is somewhat similar. It “could” track my location [quotes are because I wonder what they mean by could---does it?], connects to twitter and facebook, can read my address book (but does it?), and uses an anonymous identifier.
Still, I wonder if it goes far enough. Google Maps seems relatively safe, until you think about what might happen if a third-party could merge this data with another app and learn more. Say a restaurant owner sees several anonymous identifiers at his restaurant. A look at Facebook reveals a small number of people who ‘checked in’ at that restaurant—perhaps their identifiers are among them? Some of those people checked in at other places after the restaurant, and sure enough, the same identifier appears elsewhere. Now the restaurant knows who the person is.
I’m not sure whether this is a likely scenario, but it seems the next step is a device that puts together a profile of how you might appear in public, if someone merged the data from all of the apps used in a day. Or perhaps this already exists?
Two years ago, I made a New Year’s Resolution to read more books. At that point I joined GoodReads to hold myself accountable. I read 47 books that year (at least that I recorded). In 2012, I didn’t re-make that resolution, and my reading productivity dropped to 29 (really 26 since I quit reading 3 books). While the number of books is lower, I did some minor analyses on these books based on data I scraped from GoodReads and Amazon.
One summary I created was to examine the number of books I read per month. I also wanted to account for the fact that some books are a lot shorter than others, so in addition I looked at the average number of pages I read per month as well.
It is clear that May and December are prolific reading months for me. My interpretation is that these are the months that semesters end, and very often I retreat into the pages of a book or two to escape for a bit.
How do I rate these books I read? Are Amazon and GoodReads raters giving the books I read the same rating?
I gave mostly 3/5 and 4/5 stars to the books I read. It is clear from the plot that there is an overall positive relationship: books that are rated highly by GoodReads raters are, on average, the same books being rated highly by Amazon raters–and vice-versa.
What book did I give a rating of 5/5 to? The synopsis of my reading year, including the answer to this question, is available here [2012-Annual-Reading-Synopsis].
I have also made the spreadsheet data (from both 2011 and 2012) available publicly on GoogleDocs [data].
Here is the R code to access that data.
library(RCurl) myCsv <- getURL("https://docs.google.com/spreadsheet/pub?key=0AvanLJO1M39wdENZajR0RHJMSmZTWWtLNzhHMi1ySUE&single=true&gid=0&output=csv") books <- read.csv(textConnection(myCsv))
On New Year’s Eve, All Tech Considered, had a segment looking ahead to interesting technology in the coming year. One of the themes was Big Data, but I particularly liked the way they sold it: “Big Data For Little People.” The basic idea being that much of big data is owned by big companies, which crunch the data for their own purposes. But the NPR folks are seeing a trend for applications that crunch big data and bring the results to your smartphone or other app.
This is nothing new, of course, but I welcome more of it. This year I started using Waze to help me commute. Waze runs on my smartphone and analyzes the locations and velocities of other Waze users in order to predict which route will get me to my destination quickest. If an accident happens down the road, Waze redirects me around it. Despite a predilection for directing me to cross Wilshire Blvd (10 lanes of traffic) without the assistance of a stop light or stop sign, and despite the fact that every so often I realize that I’m being routed so as to collect velocity data on an as-yet-untested road, Waze does lots of smart things and definitely saves time, particularly on highways.
Of course, credit card companies have been providing a service like this for years, and having been a victim of identity theft twice this year (most recently on New Year’s Eve), I’ve grown to appreciate how rapidly these companies detect unusual patterns in my card usage. Other apps (I use AllClear ID) monitor your other financial transactions as well, and alert you to potential fraud. I also recall an app that compares your bank statements to others’ and pinpoints places where you might save money. (Was it called PiggyBank?)
The reporters on NPR (Steve Henn and Laura Sydell) predict ‘smart wallets’ that will advise you on the wisdom of purchases you are making or considering, as well as traffic-prediction apps.
What big data for little people apps would you like to see?
This article at Slate is interesting for a number of reasons. First, if offers a link to a data set listing names and data of the 325 people known to have been killed by guns since December 14, 2012. Slate is to be congratulated for providing data in a format that is easy for statistical software to read. (Still, some cleaning required. For example, ages include a mix of numbers and categorical values.) Second, the data are the result of an informal data compilation by an unknown tweeter, although s/he is careful to give sources for each report. (And, as Slate points out, deaths are more likely to be un-reported than falsely reported.) Data include names, data, city and state, longitude/latitude, and age of victim. Third, data such as these become richer when paired with other data, and I think it would be a great classroom project to create a data-driven story in which students used additional data sources to provide deeper context for these data. An obvious choice for such data is to extend the dataset back in time, possibly using official crime data (but I am probably hopelessly naive in thinking this is a simple task.)
The L.A. Times ran an interesting article about the new Federal Trade Commission(downloads) report, “Mobile Apps for Kids: Disclosures Still Not Making the Grade”, followed up on a February 2012 report, and concluded that “Yes, many apps included interactive features or shared kids’ information with third parties without disclosing these practices to parents.”
I think this is issue is intriguing on many levels, but of central concern is the fact that as we go about our daily business (or play, as the case may be), we leave a data trail, sometimes unwittingly. Quite often unknowingly. Perhaps we’ve reached the point where there’s no going back, and we must accept the fact that when we engage with the mobile community, we engage in a data-exchange. But it seems an easy thing that standards should be set so that, maybe, developers are required to produce logs of the data transaction. And third-party developers could write apps that let parents examine this exchange. (Without, of course, sharing this information with a third party.) It would be interesting and fun, I would think, to create a visualization of the data flow in and out of your device across a period of time.
The report indicated that the most commonly shared datum was the device ID number. Sounds innocent, but, as we all know, its the crosstabs that kill. The device ID is a unique number, and is associated with other variables, such as the device operating system, primary language, the carrier, and other information. It can also be used to access personal data, such as the name, address, email address, of the user. While some apps share only the device ID, and thus may seem safe, other apps send more data to the same companies. And so companies that receive data from multiple apps can build up a more detailed picture of the user. And these companies share data with each other, and so can create even richer pictures.
There are some simple ways of introducing the topic into a stats course. The report essentially conducted a random sample (well, lets say it had some random sampling components) of apps. And reports estimated percentages. But never, of course, confidence intervals. And so you can ask your students a question such as “The FTC randomly sampled 400 apps that were marketed to “kids” from the Google and iTunes app store. 60% of the apps in the sample transmitted the Device ID to a third-party. Find a 95% confidence interval for the proportion of all apps….” Or, “only 20% of the apps that transmitted private information to a third-party disclosed this fact to the user. Find a 95% CI….”
The report contains some nice comparisons with the 2011 data concerning the types of “kids” apps available, as well as a discussion of the type of advertising that appears. (An amusing example that shouldn’t be amusing, is an app targeted at kids that displays advertising for a dating web site: “1000+ singles!”. Reminds me of something my sister told me when her kids were young: she could always tell when they found something troubling on the computer because suddenly they would grow very quiet.
One of the themes of this blog is to make statistics relevant and exciting to students by helping them understand the data that’s right under their noses. Or inside their ears. The iTunes library is a great place to start.
For awhile, iTunes made it easy to get your data onto your hard drive in a convenient, analysis-ready form. Then they made it hard. Then (10.7) they made it easy again. Now, in 11.0, it is once again ‘hard’. Prior to version 11.0, these instructions would do the trick: Open up iTunes, Control-click on the “Music” library and choose Export. And a tab-delimited text file with all iTunes library data appears.
Now, iTunes 11.0 provides only an xml library. This is a shame for us teachers, since the data is now one step further removed from student access. In particular, it’s a shame because the data structure is not terribly complex—a flat file should do the trick. (If want the xml file, select File>Library>Export.)
But all is not lost, with one simple work-around, you can get your data. First, create a smart playlist that has all of your songs. I did this by including in the list all songs added before today’s date. Now control-click on the name of the playlist, and choose Export. Save the file wherever you wish, and you now have a tab-delimited file. (It does take a few minutes, if your library is anything near the size of my own. Not bragging.)
So now we can finally get to the main point of this post. Which is to point out that almost all of the datasets I give my students, whether they are in intro stats or higher, have a small number of variables. And even if not, the questions almost all involve using only a small number of variables.
But if students are to become Citizen Statisticians, they must learn to think more like scientists. They must learn how to pose questions and create new measures. I wonder what most of our students would do, when confronted with the 27 or so variables iTunes gives them. Make histograms? Fine, a good place to start. But what real-world question are they answering with a histogram? Do students really care about the variability in length of their song tracks?
I suggest that one interesting question to ask students to explore is to see if their listening habits have changed over some period of time. Now I know younger students won’t have much time to look back across. But I think this is a meaningful question, and one that’s not answered in an obvious way. More precisely, to answer this question requires thinking about what it means to have a ‘listening habit’, and to question how such a habit might be captured in the given data.
I’m not sure what my students would think of. Or, frankly, what I would think of. At the very least, the answer to any such question will require wrestling with the date variables and so require some work with dates. Some basic questions might be to see how many songs I’ve added per year. This isn’t that easy in many software packages, because I have to loop over the year and count the number of entries. Another question that I might want to know: What proportion of songs remain unplayed each year? (In other words, am I wasting space storing music I don’t listen to?) Has the mix of genres changed over time, or are my tastes relatively unchanged?
Speaking of genres…unless you’ve been really careful about your genre-field, you’re in for a mess. I thought I was careful. but here’s what I’ve got (as seen from Fathom):
If you want to see questions asked by some people, download the free software SuperAnalyzer (This links to the Mac version via CNET). Below is a graph that shows the growth of my library over time, for example. (Thanks to Anelise Sabbag for pointing this out to me during a visit to U of Minnesota last year, and to Elizabeth Fry and Laura Ziegler for their endorsements of the app.)
And the most common words in the titles:
So let me know what you want to do with your iTunes library. Or what your students have done. What was frustrating? Impossible? Easier than expected?
Participating in the “hangout” hosted by Jess Hemerly’s Policy By the Numbers blog was fun, but even better was learning about this cool blog. It’s very exciting to meet people from so many different backgrounds and from so many varied interests who share an interest in data accessibility. One feature of PBtN that I think many of our readers will find particularly useful is the weekly roundup of data in the news. Check it out!
In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.
Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.
The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.
I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).
Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.
On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.
One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.
Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.
Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.
So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?
The L.A. Times today (Monday, November 19) ran an editorial about the benefits and costs of Big Data. I truly believe that statisticians should teach introductory students (and all students, really) about data privacy. But who feels they have a realistic handle on the nature of these threats and the size of the risk? I know I don’t. Does anyone teach this in their class? Let’s hear about it! In the meantime, you might enjoy reading (or re-reading) a classic on the topic by Latanya Sweeney: k-Anonymity: a model for protecting privacy.