This article at Slate is interesting for a number of reasons. First, if offers a link to a data set listing names and data of the 325 people known to have been killed by guns since December 14, 2012. Slate is to be congratulated for providing data in a format that is easy for statistical software to read. (Still, some cleaning required. For example, ages include a mix of numbers and categorical values.) Second, the data are the result of an informal data compilation by an unknown tweeter, although s/he is careful to give sources for each report. (And, as Slate points out, deaths are more likely to be un-reported than falsely reported.) Data include names, data, city and state, longitude/latitude, and age of victim. Third, data such as these become richer when paired with other data, and I think it would be a great classroom project to create a data-driven story in which students used additional data sources to provide deeper context for these data. An obvious choice for such data is to extend the dataset back in time, possibly using official crime data (but I am probably hopelessly naive in thinking this is a simple task.)
A new report released by CAUSE is well worth reading: Connecting Research to Practice in a Culture of Assessment for Introductory College-level Statistics, www.causeweb.org/research/guidelines/ResearchReport_Dec_2012.pdf
Read it. We’ll discuss later. Pop quiz.
I haven’t yet read it myself (in my eagerness to publicize it as quickly as possible), but of particular interest to this blog is the role that data science plays, or does not play. For instance, Question 1 under Research Priority 1 is “What core learning outcomes employed in a particular profession do individuals need to develop in order to perform well in that profession (e.g., the outcomes that are common and those that are unique to disciplines such as psychology, biology, and economics?)”
I recently had a discussion with someone in a data-heavy business, and was struck by how core statistical concepts were seen as just one of many necessary core skills—the rest of the skills requiring computing, psychology, and communication. It is fashionable in statistical circles to be somewhat dismissive of claims that computation take precedence over statistics, but in at least this case, I think that this paints an unfair portrait. The data scientist in question held statistics in high esteem, and was well aware of the pitfalls of being lured by transitory patterns, as compelling as they might at first glance seem. His use of statistics came at the ‘high end’, employing very modern data smoothing techniques, multivariate models, and a need for sophisticated understanding of model evaluation. But he, like many in his field I suspect, came to statistics in a round-about way, after becoming successful in computer science and then studying statistics to close the gap. I doubt he considered himself a statistician, but instead one who frequently found statistical tools and concepts to be useful for getting things done.
We’re in a very exciting position, as educators, to dream about how to develop future data scientists who incorporate statistics with computation from the very beginning of their conception of statistics. But one thing to keep in mind, is that part of the excitement of this new age of statistics is that many of the careers we’re preparing our students for don’t yet exist. It seems so many of the data challenges that are raised in a general realm of endeavor such as marketing, the arts, genetics, law, have solutions that don’t live purely in one field. And so when we ask ourselves about skills needed in particular professions, let’s do so with our eyes open to the fact that the profession that many of us have in mind —data science—doesn’t really yet exist.
Data science is, or will be, a specialist’s field. But this blog is devoted to considering the data science skills needed by all students. I think, therefore, that the issues raised by this report concerning the ‘core’ skills are very important. A data scientist may have a specialists’ collection of skills, in the aggregate, but many of this skills and understandings, in isolation, will need to be part of our core education. This report encourages us all to think seriously about precisely which skills and understandings those should be.
Just read a great paper by Anna Bargagliotti in the current Journal of Stats Education, “How well do the NSF Funded Elementary Mathematics Curricula align with the GAISE report recommendations? “. The answer: it depends. Anna compares three math curricula designed to meet the Common Core Standards for grades K-12: “Investigations in Number, Data, and Space”, “Math Trailblazers”, and “Everyday Mathematics.” Anna compared them to the Guidelines for Assessment and Instruction in Statistics Education K-12 report, which, to quote her paper, “defines a statistically literate person as one who is able to formulate questions, collect and analyze data, and interpret results.” I personally feel the “analyze data” component is the most important, since this is a skill all students should acquire, and a skill that requires a strong understanding of statistical concepts and methods.
The GAISE report identifies three developmental levels, labeled A, B and C. Since Anna is concerned with earlier grades, she considers only levels A and B. Level A is “below” level B in some sense, but the levels might overlap, and students might advance to level B on some topics while still studying at the Level A on others. Levels aren’t associated with particular grades, but, roughly speaking, one might expect Level A to occupy most of a child’s K-6 years, and level B much of middle school and early high school. For example, in Level A, students investigate situations in which they are not expected to go beyond the sample at hand. In Level B, they begin to informally consider what the sample at hand has to say about a larger context. In Level C, they learn formal methods for inference.
Two of the curriulca, “Investigations” and “Trailblazers”, according to Anna’s paper, move students from Level A to B and have strong data analysis components. The third, “Everyday”, favors probability, seems to ignore data analysis, and is so weighted towards computation that it was difficult to determine whether it was teaching at Level A or B. (Well, that’s my reading of Anna’s findings. There is room for more nuance there, but that’s one advantage of a blog over an academic paper: we can ignore nuance.)
Now here’s the depressing part: one of these curricula is used by 3 million students. If you guessed “Everyday Mathematics” go to the head of the class. Trailblazers is used by a healthy number, too: 500,000. But that’s only 1/6 the size of Everyday. So while the good news is that the Common Core provides students with the opportunity to learn some truly useful and needed statistics, the bad news is that most of them continue to be taught probability at the expense of data analysis.
We have just finished another semester, and before my mind completely turns to rubble, I want to share what I believe to be a fairly good assignment. What I present below was parts of two separate assignments that I gave this semester, but upon reflection I think it would be better as one.
Read the article Let’s Practice What We Preach: Turning Tables into Graphs (full reference given below). In this article, Gelman, Pascarica, & Dodhia suggest that presentations of results using graphs are more effective than results presented in tables.
Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach: Turning tables into graphs. The American Statistician, 56(2), 121–130.
Find an article in a journal that presents results (or data) in a table. Re-create the data in a tabular format using R (or Excel).
- Use the functions in ggplot2 to produce a plot that conveys the same message as the original table.
- Include the original table (this can be a screenshot or web-link) and citation, along with your plot.
- Write a few sentences describing why the plot you produced provides a better presentation of the results or data (be sure to use recommendations from the article in making your case).
In the second part of this assignment, you will write a tutorial for the process you followed for turning a table into a plot using R Markdown and will publish that tutorial on RPubs.
There are several resources for learning R Markdown.
- RStudio’s [documentation] for writing a document with R Markdown
- Yihue’s [screencast] introducing R Markdown
- An [example/tutorial] from PSU
Your tutorial should be written so that a student who was just learning ggplot could follow your directions easily. Include instructions for obtaining the data, getting it into a useable tabular format, manipulating the data so it can be used with ggplot, and well-commented instructions for creating your final plot. (Think of the level of detail you would want in a tutorial when you were first learning ggplot!)
It should also include:
- a citation or link to the website/journal that published the original table
- a view of your final data (full or a subset depending on size)
- all commands necessary to create your final plot (with appropriate explanation), and
- the final plot
When you knit the .Rmd document it should compile without errors.
Students commented that they learned a lot about the use of ggplot during the initial assignment (this was the second assignment in the course). The Markdown part of the assignment I gave as an extra credit assignment at the end of the class, but in retrospect, I should have made it required and done it very early on.
Here are a couple of the tutorials that I have received so far:
- These students took a table of characteristics of survey participants published in the Journal of Ethnic and Cultural Diversity in Social Work and turned it into a bar graph. http://rpubs.com/TSK_2012/3184
- These students took data about trends and topics discussed in Seventeen Magazine‘s Traumarama articles from 1994-2007 and turned it into a line plot. http://rpubs.com/opalc123/3155
- These students took a table of data related to approval ratings and turned them into a box-and-whiskers plot. http://www.rpubs.com/GeorgeBrisse/3217
- These students’ work depict a great example of how data initially presented in a table is much easier to process in a graph. The data, from a table published in the Journal of Deaf Studies and Deaf Education, show the academic status and progress of deaf and hard-of-hearing students in general education classrooms. http://rpubs.com/mens0055/3211
- These students used a stacked bar chart to show data about the sample sizes for different stages for 12 problem behaviors published in Health Psychology. http://rpubs.com/nikedenise/3256
- These students created a line graph representing pre- and post-training scores for consonant, vowel, sentence, and gender perception scores in cochlear implant users to examine whether an auditory training program improves performance. http://rpubs.com/koern030/3255
The L.A. Times ran an interesting article about the new Federal Trade Commission(downloads) report, “Mobile Apps for Kids: Disclosures Still Not Making the Grade”, followed up on a February 2012 report, and concluded that “Yes, many apps included interactive features or shared kids’ information with third parties without disclosing these practices to parents.”
I think this is issue is intriguing on many levels, but of central concern is the fact that as we go about our daily business (or play, as the case may be), we leave a data trail, sometimes unwittingly. Quite often unknowingly. Perhaps we’ve reached the point where there’s no going back, and we must accept the fact that when we engage with the mobile community, we engage in a data-exchange. But it seems an easy thing that standards should be set so that, maybe, developers are required to produce logs of the data transaction. And third-party developers could write apps that let parents examine this exchange. (Without, of course, sharing this information with a third party.) It would be interesting and fun, I would think, to create a visualization of the data flow in and out of your device across a period of time.
The report indicated that the most commonly shared datum was the device ID number. Sounds innocent, but, as we all know, its the crosstabs that kill. The device ID is a unique number, and is associated with other variables, such as the device operating system, primary language, the carrier, and other information. It can also be used to access personal data, such as the name, address, email address, of the user. While some apps share only the device ID, and thus may seem safe, other apps send more data to the same companies. And so companies that receive data from multiple apps can build up a more detailed picture of the user. And these companies share data with each other, and so can create even richer pictures.
There are some simple ways of introducing the topic into a stats course. The report essentially conducted a random sample (well, lets say it had some random sampling components) of apps. And reports estimated percentages. But never, of course, confidence intervals. And so you can ask your students a question such as “The FTC randomly sampled 400 apps that were marketed to “kids” from the Google and iTunes app store. 60% of the apps in the sample transmitted the Device ID to a third-party. Find a 95% confidence interval for the proportion of all apps….” Or, “only 20% of the apps that transmitted private information to a third-party disclosed this fact to the user. Find a 95% CI….”
The report contains some nice comparisons with the 2011 data concerning the types of “kids” apps available, as well as a discussion of the type of advertising that appears. (An amusing example that shouldn’t be amusing, is an app targeted at kids that displays advertising for a dating web site: “1000+ singles!”. Reminds me of something my sister told me when her kids were young: she could always tell when they found something troubling on the computer because suddenly they would grow very quiet.
One of the themes of this blog is to make statistics relevant and exciting to students by helping them understand the data that’s right under their noses. Or inside their ears. The iTunes library is a great place to start.
For awhile, iTunes made it easy to get your data onto your hard drive in a convenient, analysis-ready form. Then they made it hard. Then (10.7) they made it easy again. Now, in 11.0, it is once again ‘hard’. Prior to version 11.0, these instructions would do the trick: Open up iTunes, Control-click on the “Music” library and choose Export. And a tab-delimited text file with all iTunes library data appears.
Now, iTunes 11.0 provides only an xml library. This is a shame for us teachers, since the data is now one step further removed from student access. In particular, it’s a shame because the data structure is not terribly complex—a flat file should do the trick. (If want the xml file, select File>Library>Export.)
But all is not lost, with one simple work-around, you can get your data. First, create a smart playlist that has all of your songs. I did this by including in the list all songs added before today’s date. Now control-click on the name of the playlist, and choose Export. Save the file wherever you wish, and you now have a tab-delimited file. (It does take a few minutes, if your library is anything near the size of my own. Not bragging.)
So now we can finally get to the main point of this post. Which is to point out that almost all of the datasets I give my students, whether they are in intro stats or higher, have a small number of variables. And even if not, the questions almost all involve using only a small number of variables.
But if students are to become Citizen Statisticians, they must learn to think more like scientists. They must learn how to pose questions and create new measures. I wonder what most of our students would do, when confronted with the 27 or so variables iTunes gives them. Make histograms? Fine, a good place to start. But what real-world question are they answering with a histogram? Do students really care about the variability in length of their song tracks?
I suggest that one interesting question to ask students to explore is to see if their listening habits have changed over some period of time. Now I know younger students won’t have much time to look back across. But I think this is a meaningful question, and one that’s not answered in an obvious way. More precisely, to answer this question requires thinking about what it means to have a ‘listening habit’, and to question how such a habit might be captured in the given data.
I’m not sure what my students would think of. Or, frankly, what I would think of. At the very least, the answer to any such question will require wrestling with the date variables and so require some work with dates. Some basic questions might be to see how many songs I’ve added per year. This isn’t that easy in many software packages, because I have to loop over the year and count the number of entries. Another question that I might want to know: What proportion of songs remain unplayed each year? (In other words, am I wasting space storing music I don’t listen to?) Has the mix of genres changed over time, or are my tastes relatively unchanged?
Speaking of genres…unless you’ve been really careful about your genre-field, you’re in for a mess. I thought I was careful. but here’s what I’ve got (as seen from Fathom):
If you want to see questions asked by some people, download the free software SuperAnalyzer (This links to the Mac version via CNET). Below is a graph that shows the growth of my library over time, for example. (Thanks to Anelise Sabbag for pointing this out to me during a visit to U of Minnesota last year, and to Elizabeth Fry and Laura Ziegler for their endorsements of the app.)
And the most common words in the titles:
So let me know what you want to do with your iTunes library. Or what your students have done. What was frustrating? Impossible? Easier than expected?