Conditional probabilities and kitties

I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.


And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):


The question asks: “My cat is indoor only. Is it still at risk?”

The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).

However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.

Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.

When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!

Thinking with technology

Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational  psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for  presenting findings, and this means there’s much lively, constructive discussion about works in progress.

I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One  recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2.  Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.

What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.

Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.)  This technology imposes a sort of discipline on students’ thinking. Using the  technology is easy enough  that they can be creative, but the technology is rigid enough that their mistakes are made apparent.  This means that mistakes are cheap, and attempts to repair mistakes are easily made.  And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.

I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker.  But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas.  But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise.  It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability.  More impressive still, the students were able to use these models to solve problems.  In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.

Tinkerplots is aimed at a younger audience than the one I teach.  But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.

DataFest 2013

DataFest is growing larger and larger.  This year, we hosted an event at Duke (Mine organized this) with teams from NCSU and UNC, and at UCLA (Rob organized) with teams from Pomona College, Cal State Long Beach, University of Southern California, and UC Riverside.  We are very grateful to Vaclav Petricek at eHarmony for providing us with the data, which consisted of roughly one million “user-candidate” pairs, and a couple of hundred variables including “words friends would use to describe you”, ideal characteristics in a partner, the importance of those characteristics, and the all-important ‘did she email him’ and ‘did he email her’ variables.

The students had a great time, and worked hard for 48 hours to prepare short presentations for the judges.  This is the third year we’ve done this, and I’m growing impressed with the growing technical skills of the students.  (Which makes our life a lot easier, as far as providing help goes.)  Or maybe it’s just that I’ve been lucky enough to get more and more “VIP Consultants” (statisticians from off-campus) and talented and dedicated grad students to help out, so that I  can be comfortably oblivious to the technical struggles.  Or all of the above.

One thing I noticed that will definitely require some adjustment to our curriculum:  Our students had a hard time generating interesting questions from these data.  Part of the challenge is to look at a large, rich dataset and think “What can I show the world that the world would like to know?”  Too many students went directly to model-fitting, without making visuals or engaging in the content of the materials (a surprise, since we thought they would find this material much more easily-engageable than last year’s micro-lending transaction data), or strategizing around some Big Questions.  They managed to pull it off in the end, most of them, but would have done better to brainstorm some good questions to follow, and would have done much better to start with the visuals.

One of the fun parts of DataFest is the presentations.  Students have only 5 minutes and 2 slides to convince the judges of their worthiness.  At UCLA, because we were concerned about having too many teams for the judges to endure, we had two rounds.  First, a “speed dating” round in which participants had only 60 seconds and one slide.  We surprised them by announcing, at the start, that to move onto the next round, they would have to merge their team with one other team, and so these 60-second presentations should be viewed as pitches to potential partners.  We had hoped that teams would match on similar-themes or something, and this did happen; but many matches were between teams of friends.  The “super teams” were then allowed to make a 5-minute presentation, and awards were given to these large teams. The judges gave two awards for Best Insight (one to a super-team from Pomona College and another to a super-team from UCLA) and a Best Visualization (to the super-team from USC).  We did have two inter-collegiate super-teams (UCLA/Cal State Long Beach and UCLA/UCR) make it to the final round.

If you want to host your own DataFest, drop a line to Mine or me and we can give you lots of advice.  And if you sit on a large, interesting data set we can use for next year, definitely drop us a line!

Datasets handpicked by students

I’m often on the hunt for datasets that will not only work well with the material we’re covering in class, but will (hopefully) pique students’ interest. One sure choice is to use data collected from the students, as it is easy to engage them with data about themselves. However I think it is also important to open their eyes to the vast amount of data collected and made available to the public. It’s always a guessing game whether a particular dataset will actually be interesting to students, so learning from the datasets they choose to work with seems like a good idea.

Below are a few datasets that I haven’t seen in previous project assignments. I’ve included the research question the students chose to pursue, but most of these datasets have multiple variables, so you might come up with different questions.

1. Religious service attendance and moral beliefs about contraceptive use: The data are from a February 2012 Pew Research poll. To download the dataset, go to You will be prompted to fill out some information and will receive a zipped folder including the questionnaire, methodology, the “topline” (distributions of some of the responses), as well as the raw data in SPSS format (.sav file). Below I’ve provided some code to load this dataset in R, and then to clean it up a bit. Most of the code should apply to any dataset released by Pew Research.

# read data
d_raw ="Feb12 political public.sav"))

# clean up
d = lapply(d_raw, function(x) str_replace(x, " \\[OR\\]", ""))
d = lapply(d, function(x) str_replace(x, "\\[VOL. DO NOT READ\\] ", ""))
d = lapply(d, function(x) str_replace(x, "\222", "'"))
d = lapply(d, function(x) str_replace(x, " \\(VOL.\\)", ""))
d$partysum = factor(d$partysum)
levels(d$partysum) = c("Refused","Democrat","Independent","Republican","No preference","Other party")

The student who found this dataset was interested examining the relationship between religious service attendance and views on contraceptive use. The code provided below can be used to organize the levels of these variables in a meaningful way, and to take a quick peek at a contingency table.

# variables of interest
d$attend = factor(d$attend, levels = c("More than once a week","Once a week", "Once or twice a month", "A few times a year", "Seldom", "Never", "Don't know/Refused"))
d$q40a = factor(d$q40a, levels = c("Morally acceptable","Morally wrong", "Not a moral issue", "Depends on situation", "Don't know/Refused"))
table(d$attend, d$q40a)

2. Social network use and reading: Another student was interested in the relationship between number of books read in the last year and social network use. This dataset is provided by the Pew Internet and American Life Project. You can download a .csv version of the data file at–Search-Social-Networking-Sites-and-Politics.aspx. The questionnaire can also be found at this website. One of the variables of interest, number of books read in the past 12 months (q2), is  recorded using the following scheme:

  • 0: none
  • 1-96: exact number
  • 97: 97 or more
  • 98: don’t know
  • 99: refused

This could be used to motivate a discussion about the importance doing exploratory data analysis prior to jumping into running inferential tests (like asking “Why are there no people who read more than 99 books?”) and also pointing out the importance of checking the codebook.

3. Parental involvement and disciplinary actions at schools: The 2007-2008 School Survey on Crime and Safety, conducted by the National Center for Education Statistics, contains school level data on crime and safety. The dataset can be downloaded at  The SPSS formatted version of the data file (.sav) can be loaded in R using the read.spss() function in the foreign library (used above in the first data example). The variables of interest for the particular research question the student proposed are parent involvement in school programs (C0204) and number of disciplinary actions (DISTOT08), but the dataset can be used to explore other interesting characteristics of schools, like type of security guards, whether guards are armed with firearms, etc.

4. Dieting in school-aged children: The Health Behavior in School-Aged Children is an international survey on health-risk behaviors of children in grades 6 through 10. The 2005-2006 US dataset can be found at You will need to log in to download the dataset, but you can do so using a Google or a Facebook account. There are multiple versions of the dataset posted, and the Delimited version (.tsv) can be easily loaded in R using the read.delim() function. The student who found this dataset was interested in exploring the relationship between race of the student (Q6_COMP) and whether or not the student is on a diet to lose weight (Q30). The survey also asks questions on body image, substance use, bullying, etc. that may be interesting to explore.

One common feature of the above datasets is that they are all observational/survey based as it’s more challenging to find experimental (raw) datasets online. Any suggestions?

participatory sensing

The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data.  What is participatory sensing, you ask?

I’ve recently been trying to answer this question, with mixed success.  As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it.  I like to think of it as observations on a living object.  Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize,  PS meant  specially equipped smart-phones to serve as sensors.  Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food.  A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment.  And there’s an equity issue, because what some people consider a common household item, others consider rare and precious.  And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.

If we ditch the gadgetry, any human being can serve as a sensor.  Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be  a sensor.  If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”.  She’ll turn to data next time she has a question or problem, too.

Nothing in this process is trivial.  Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data.  Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success.  And that’s the easy part.  How do we help them learn something of value about themselves or their world?

I’m open to suggestions here. Please.  One step seems to be to point them towards a larger context in which to make sense of their data.  This larger context could be a social network, or a community, or larger datasets collected on large populations.  And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream).  Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.

This is scary stuff.  Traditionally, we teach students how to make sense of *our* data.  And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right”  conclusions.  But these PS data have not before been analyzed.  Even if we the teacher may have seen similar data, we have not seen these data.  The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion.  What’s more disorienting, the teacher doesn’t have control of the method.  Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see.  But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)

So I’ll sign off with some questions.  Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data.  Students do not know any statistics–this is their first data experience.  What is the first thing we should show them?  A distribution? Of what? What concepts do students bring to the table that will help them make sense  of longitudinal data?  If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)?  And what’s the lesson, or should be the lesson, in examining such plots?

Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?

Miscellany that I have Read and been Thinking about this Last Week

I read a piece last night called 5 Ways Big Data Will Change Lives In 2013. I really wasn’t expecting much from it, just scrolling through accumulated articles on Zite. However, as with so many things, there were some gems to be had. I learned of Aadhar.

Aadhar is an ambitious government Big Data project aimed at becoming the world’s largest biometric database by 2014, with a goal of capturing about 600 million Indian identities…[which] could help India’s government and businesses deliver more efficient public services and facilitate direct cash transfers to some of the world’s poorest people — while saving billions of dollars each year.

The part that made me sit up and take notice was this line, “India’s Aadhar collects sensitive information, such as fingerprints and retinal scans. Yet people volunteer because the potential incentives can make the data privacy and security pitfalls look miniscule — especially if you’re impoverished.”

I have been reading and hearing about concerns of data privacy for quite awhile, yet nobody that I have been reading (or listening to) has once suggested what the circumstances are that would have citizens forego all sense of privacy. Poverty, especially extreme poverty, is one of those circumstances. As a humanist, I am all for facilitating resources in the most efficient ways possible, which inevitably involve technology. But, as a Citizen Statistician, I am all too aware of how a huge database of biometric data could be used (or mis-used as it were). It especially concerns me that our impoverished citizens, who are more likely to be in the database, will be more at risk for being taken advantage of.

A second headline that caught my eye was France Looks At Possibility Of Taxing Internet Companies For Data Mining. France is pointing out that companies such as Google and Facebook are making enormous sums of money dollars by mining and using citizens’ personal information, so why shouldn’t that be seen as a taxable asset? While this is a reasonable question, the article also points out that one potential consequence of such taxation is that the “free” model (at least monetarily) that these companies currently use might cease to exist.

Related to both of these articles, I also read a blog post about a seminar being offered in the Computer Science department at the University of Utah entitled Accountability in Data Mining. The professor of the course wrote in the post,

I’m a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be “in scope”. I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it’s a 1-credit seminar that meets once a week, I obviously can’t cover everything I’d like, but I’d like to flesh out the readings with related work that people can peruse later.

It is about time some university offered such a course. I think this will be ultimately useful (and probably should be required) content to include in every statistics course taught. In making decisions using data, who is accountable for those decisions, and the consequences thereof?

1331746205255_562228Lastly, I would be remiss to not include a link to what might be the article I resonated to most: It’s not 1989. The author points out that the excuse “I’m not good with computers” is not acceptable any longer, especially for educators. He makes a case for a minimum level of technological competency that teachers should have in today’s day and age. I especially agree with the last point,

Every teachers must have a willingness to continue to learn! Technology is ever evolving, and excellent teachers must be life-long learners. (Particularly in the realm of technology!)

The lack of ability with computers that I see on a day-to-day basis in several students and faculty (even the base-level literacy that the author wants) is frightening and saddening at the same time. I would love to see colleges and universities give all incoming students a computer literacy test at the same time as they take their math placement test. If you can’t copy-and-paste you should be sent to a remedial course to obtain the skills you need to acquire before taking any courses at the institution.

Gun deaths and data

This article at Slate is interesting for a number of reasons.  First, if offers a link to a data set listing names and data of the 325 people known to have been killed by guns since December 14, 2012.  Slate is to be congratulated for providing data in a format that is easy for statistical software to read.  (Still, some cleaning required.  For example, ages include a mix of numbers and categorical values.) Second, the data are the result of an informal data compilation by an unknown tweeter, although s/he is careful to give sources for each report.  (And, as Slate points out, deaths are more likely to be un-reported than falsely reported.)  Data include names, data, city and state, longitude/latitude, and age of victim.  Third, data such as these become richer when paired with other data, and I think it would be a great classroom project to create a data-driven story in which students used additional data sources to provide deeper context for these data.  An obvious choice for such data is to extend the dataset back in time, possibly using official crime data (but I am probably hopelessly naive in thinking this is a simple task.)


Accessing your 11.0 iTunes library

One of the themes of this blog is to make statistics relevant and exciting to students by helping them understand the data that’s right under their noses.   Or inside their ears.  The iTunes library is a great place to start.

For awhile, iTunes made it easy to get your data onto your hard drive in a convenient, analysis-ready form. Then they made it hard.  Then (10.7) they made it easy again. Now, in 11.0, it is once again ‘hard’.  Prior to version 11.0, these instructions would do the trick: Open up iTunes, Control-click on the “Music” library and choose Export.  And a tab-delimited text file with all iTunes library data appears.

Now, iTunes 11.0 provides only an xml library.  This is a shame for us teachers, since the data is now one step further removed from student access. In particular, it’s a shame because the data structure is not terribly complex—a flat file should do the trick. (If want the xml file, select File>Library>Export.)

But all is not lost, with one simple work-around, you can get your data. First, create a smart playlist that has all of your songs.  I did this by including in the list all songs added before today’s date.  Now control-click on the name of the playlist, and choose Export.  Save the file wherever you wish, and you now have a tab-delimited file. (It does take a few minutes, if your library is anything near the size of my own. Not bragging.)

So now we can finally get to the main point of this post.  Which is to point out that almost all of the datasets I give my students, whether they are in intro stats or higher, have a small number of variables.  And even if not, the questions almost all involve using only a small number of variables.

But if students are to become Citizen Statisticians, they must learn to think more like scientists.  They must learn how to pose questions and create new measures.  I wonder what most of our students would do, when confronted with the 27 or so variables iTunes gives them.  Make histograms?  Fine, a good place to start.  But what real-world question are they answering with a histogram? Do students really care about the variability in length of their song tracks?

I suggest that one interesting question to ask students to explore is to see if their listening habits have changed over some period of time.  Now I know younger students won’t have much time to look back across.  But I think this is a meaningful question, and one that’s not answered in an obvious way.  More precisely, to answer this question requires thinking about what it means to have a ‘listening habit’, and to question how such a habit might be captured in the given data.

I’m not sure what my students would think of.  Or, frankly, what I would think of.  At the very least, the answer to any such question will require wrestling with the date variables and so require some work with dates.  Some basic questions might be to see how many songs I’ve added per year. This isn’t that easy in many software packages, because I have to loop over the year and count the number of entries. Another question that I might want to know: What proportion of songs remain unplayed each year? (In other words, am I wasting space storing music I don’t listen to?)  Has the mix of genres changed over time, or are my tastes relatively unchanged?

Speaking of genres…unless you’ve been really careful about your genre-field, you’re in for a mess. I thought I was careful. but here’s what I’ve got (as seen from Fathom):

If you want to see questions asked by some people, download the free software SuperAnalyzer (This links to the Mac version via CNET).  Below is a graph that shows the growth of my library over time, for example. (Thanks to Anelise Sabbag for pointing this out to me during a visit to U of Minnesota last year, and to Elizabeth Fry and Laura Ziegler for their endorsements of the app.)

And the most common words in the titles:

So let me know what you want to do with your iTunes library. Or what your students have done.  What was frustrating? Impossible? Easier than expected?