Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Pie Charts. Are they worth the Fight?

Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.

screenshot-app

During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.

plots_1Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)

It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3″ browsers have a strong hold on the market.

The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.

The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.

So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.

*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.

**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.

# Input the ad data
ad = data.frame(
	type = c("Poster", "Billboard", "Bus", "Digital"),
	n = c(529, 356, 59, 81)
	)

# Bar plot
library(ggplot2)
ggplot(data = ad, aes(x = type, y = n, fill = type)) +
     geom_bar(stat = "identity", show_guide = FALSE) +
     theme_bw()

# Add addition columns to data, needed for donut plot.
ad$fraction = ad$n / sum(ad$n)
ad$ymax = cumsum(ad$fraction)
ad$ymin = c(0, head(ad$ymax, n = -1))

# Donut plot
ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
     geom_rect(colour = "grey30", show_guide = FALSE) +
     coord_polar(theta = "y") +
     xlim(c(0, 4)) +
     theme_bw() +
     theme(panel.grid=element_blank()) +
     theme(axis.text=element_blank()) +
     theme(axis.ticks=element_blank()) +
     geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) +
     xlab("") +
     ylab("")

 

 

Data Privacy (L.A. Times)

The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…

http://www.latimes.com/nation/politics/politicsnow/la-pn-white-house-big-data-privacy-report-20140501,0,5624003.story

Conditional probabilities and kitties

I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.

vodka

And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):

cond

The question asks: “My cat is indoor only. Is it still at risk?”

The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).

However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.

Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.

When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!

Ranked Choice Voting

The city of Minneapolis recently elected a new mayor. This is not newsworthy in and of itself, however the method they used was—ranked choice voting. Ranked choice voting is a method of voting allowing voters to rank multiple candidates in order of preference. In the Minneapolis mayoral election, voters ranked up to three candidates.

The interesting part of this whole thing was that it took over two days for the election officials to declare a winner. It turns out that the official procedure for calculating the winner of the ranked-choice vote involved cutting and pasting spreadsheets in Excel.

The technology coordinator at E-Democracy, Bill Bushey, posted the challenge of writing a program to calculate the winner of a ranked-choice election to the Twin Cities Javascript and Python meetup groups. Winston Chang also posted it to the Twin Cities R Meetup group. While not a super difficult problem, it is complicated enough that it can make for a nice project—especially for new R programmers. (In fact, our student R group is doing this.)

The algorithm, described by Bill Bushey, is

  1. Create a data structure that represents a ballot with voters’ 1st, 2nd, and 3rd choices
  2. Count up the number of 1st choice votes for each candidate. If a candidate has 50% + 1 votes, declare that candidate the winner.
  3. Else, select the candidate with the lowest number of 1st choice votes, remove that candidate completely from the data structure, make the 2nd choice of any voter who voted for the removed candidate the new 1st choice (and the old 3rd choice the new 2nd choice).
  4. Goto 2

As an example consider the following sample data:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    Frank
    2    Frank     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

In this data, James has the most 1st choice votes (4) but it is not enough to win the election (a candidate needs 6 votes = 50% of 10 votes cast + 1 to win). So at this point we determine the least voted for candidate…Frank, and delete him from the entire structure:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    <del>Frank</del>
    2    <del>Frank</del>     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

Then, the 2nd choice of any voter who voted for Frank now become the new “1st” choice. This is only Voter #2 in the sample data. Thus Fred would become Voter #2′s 1st choice and James would become Voter #2′s 2nd choice:

Voter  Choice1  Choice2  Choice3
    1    James     Fred
    2     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Laura
   10    David

James still has the most 1st choice votes, but not enough to win (he still needs 6 votes!). Fred has the fewest 1st choice votes, so he is eliminated, and his voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              
    7    Laura
    8    James
    9    David    Laura
   10    David

James now has five 1st choice votes, but still not enough to win. Laura has the fewest 1st choice votes, so she is eliminated, and her voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4     
    5    David 
    6    James              
    7    
    8    James
    9    David    Laura
   10    David

James retains his lead with five first place votes…but now he is declared the winner. Since Voter #4 and #7 do not have a 2nd or 3rd choice vote, they no longer count in the number of voters. Thus to win, a candidate only needs 5 votes = 50% of the 8 1st choice votes + 1.

The actual data from Minneapolis includes over 80,000 votes for 36 different candidates. There are also ballot issues such as undervoting and overvoting. This occurs when voters give multiple candidates the same ranking (overvoting) or do not select a candidate (undervoting).

The animated GIF below shows the results after each round of elimination for the Minneapolis mayoral election.

2013 Mayoral Race

The Minneapolis mayoral data is available on GitHub as a CSV file (along with some other smaller sample files to hone your programming algorithm). There is also a frequently asked questions webpage available from the City of Minneapolis regarding ranked choice voting.

In addition you can also listen to the Minnesota Public Radio broadcast in which they discussed the problems with the vote counting. The folks at the  R Users Group Meeting were featured and Winston brought the house down when commuting on the R program that computed the winner within a few seconds said, “it took me about an hour and a half to get something usable, but I was watching TV at the time”.

See the R syntax I used here.

 

Personal Data Apps

Fitbit, you know I love you and you’ll always have a special place in my pocket.  But now I have to make room for the Moves app to play a special role in my capture-the-moment-with-data existence.

Moves is an ios7 app that is free.  It eats up some extra battery power and in exchange records your location and merges this with various databases and syncs it up to other databases and produces some very nice “story lines” that remind you about the day you had and, as a bonus, can motivate you to improved your activity levels.  I’ve attached two example storylines that do not make it too embarrassingly clear how little exercise I have been getting. (I have what I can consider legitimate excuses, and once I get the dataset downloaded, maybe I’ll add them as covariates.)  One of the timelines is from a day that included an evening trip to Disneyland. The other is a Saturday spent running errands and capped with dinner at a friend’s.  Its pretty easy to tell which day is which.

movings1movings2

But there’s more.  Moves has an API, thus allowing developers to tap into their datastream to create apps.  There’s an app that exports the data for you (although I haven’t really had success with it yet) and several that create journals based on your Moves data.  You can also merge Foursquare, Twitter, and all the usual suspects.

I think it might be fun to have students discuss how one could go from the data Moves collects to creating the storylines it makes.  For instance, how does it know I’m in a car, and not just a very fast runner?  Actually, given LA traffic, a better question is how it knows I’m stuck in traffic and not just strolling down the freeway at a leisurely pace? (Answering these questions requires another type of inference than what we normally teach in statistics. )  Besides journals, what apps might they create with these data and what additional data would they need?

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.

Thinking with technology

Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational  psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for  presenting findings, and this means there’s much lively, constructive discussion about works in progress.

I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One  recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2.  Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.

What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.

Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.)  This technology imposes a sort of discipline on students’ thinking. Using the  technology is easy enough  that they can be creative, but the technology is rigid enough that their mistakes are made apparent.  This means that mistakes are cheap, and attempts to repair mistakes are easily made.  And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.

I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker.  But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas.  But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise.  It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability.  More impressive still, the students were able to use these models to solve problems.  In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.

Tinkerplots is aimed at a younger audience than the one I teach.  But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.

Here’s Looking At You!

What do we fear more?  Losing data privacy to our government, or to corporate entities?  On the one hand, we (still) have oversight over our government.  On the other hand, the government is (still) more powerful than most corporate entities, and so perhaps better situated to frighten.

In these times of Snowden and the NSA, the L.A. Times ran an interesting story about just what tracking various internet companies perform.  And it’s alarming. (“They’re watching your every move.”, July 10, 2013). Interestingly, the story does not seem to appear on their website as of this posting.)  Like the government, most of these companies claim that (a) their ‘snooping’ is algorithmic; no human sees the data and (b) their data are anonymized.  And yet…

To my knowledge, businesses aren’t required to adhere to, or even acknowledge, any standards or practices for dealing with private data.  Thus, a human could snoop on particular data.  We are left to ponder what that human will do with the information.  In the best case scenario, the human would be fired, as, according to the L.A. Times, Google did when it fired an engineer for snooping on emails of some teenage girls.

But the data are anonymous, you say?  Well, there’s anonymous and then there’s anonymous.  As LaTanya Sweeney taught us in the 90′s, knowing a person’s zipcode, gender, and date of birth is sufficient to uniquely identify 85% of Americans.  And the L.A. Times reports a similar study where just four hours of anonymized tracking data was sufficient to identify 95% of all individuals examined.  So while your name might not be recorded, by merging enough data files, they will know it is you.

This article fits in really nicely with a fascinating, revelatory book I’m currently midway through:  Jaron Lanier‘s Who Owns The Future? A basic theme of this book is that internet technology devalues products and goods (files) and values  services (software).  One process through which this happens is that we humans accept the marvelous free stuff that the internet provides (free google searches, free amazon shipping, easily pirated music files) in exchange for allowing companies to snoop. The companies turn our aggregated data into dollars by selling to advertisers.

A side affect of this, Lanier explains, is that there is a loss of social freedom.  At some point, a service such as Facebook gets to be so large that failing to join means that you are losing out on possibly rich social interactions.  (Yes, I know there are those who walk among us who refuse to join Facebook.  But these people are probably not reading this blog, particularly since our tracking ‘bots tell us that most of our readers come from Facebook referrals.  Oops.  Was I allowed to reveal that?)  So perhaps you shouldn’t complain about being snooped on since you signed away your privacy rights. (You did read the entire user agreement, right?  Raise your hand if you did.  Thought so.)  On the other hand, if you don’t sign, you become a social pariah.  (Well, an exaggeration.  For now.)

Recently, I installed Ghostery, which tracks the automated snoopers that follow me during my browsing.  Not only “tracks”, but also blocks.  Go ahead and try it.  It’s surprising how many different sources are following your every on-line move.

I have mixed feelings about blocking this data flow. The data-snooping industry is big business, and is responsible, in part, for the boom of stats majors and, more importantly, the boom in stats employment.  And so indirectly, data-snooping is paying for my income.  Lanier has an interesting solution:  individuals should be paid for their data, particular when it leads to value.  This means the era of ‘free’ is over–we might end up paying for searches and for reading wikipedia.  But he makes a persuasive case that the benefits exceed the costs.  (Well, I’m only half-way through the book.  But so far, the case is persuasive.)