Ranked Choice Voting

The city of Minneapolis recently elected a new mayor. This is not newsworthy in and of itself, however the method they used was—ranked choice voting. Ranked choice voting is a method of voting allowing voters to rank multiple candidates in order of preference. In the Minneapolis mayoral election, voters ranked up to three candidates.

The interesting part of this whole thing was that it took over two days for the election officials to declare a winner. It turns out that the official procedure for calculating the winner of the ranked-choice vote involved cutting and pasting spreadsheets in Excel.

The technology coordinator at E-Democracy, Bill Bushey, posted the challenge of writing a program to calculate the winner of a ranked-choice election to the Twin Cities Javascript and Python meetup groups. Winston Chang also posted it to the Twin Cities R Meetup group. While not a super difficult problem, it is complicated enough that it can make for a nice project—especially for new R programmers. (In fact, our student R group is doing this.)

The algorithm, described by Bill Bushey, is

  1. Create a data structure that represents a ballot with voters’ 1st, 2nd, and 3rd choices
  2. Count up the number of 1st choice votes for each candidate. If a candidate has 50% + 1 votes, declare that candidate the winner.
  3. Else, select the candidate with the lowest number of 1st choice votes, remove that candidate completely from the data structure, make the 2nd choice of any voter who voted for the removed candidate the new 1st choice (and the old 3rd choice the new 2nd choice).
  4. Goto 2

As an example consider the following sample data:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    Frank
    2    Frank     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

In this data, James has the most 1st choice votes (4) but it is not enough to win the election (a candidate needs 6 votes = 50% of 10 votes cast + 1 to win). So at this point we determine the least voted for candidate…Frank, and delete him from the entire structure:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    <del>Frank</del>
    2    <del>Frank</del>     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

Then, the 2nd choice of any voter who voted for Frank now become the new “1st” choice. This is only Voter #2 in the sample data. Thus Fred would become Voter #2′s 1st choice and James would become Voter #2′s 2nd choice:

Voter  Choice1  Choice2  Choice3
    1    James     Fred
    2     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Laura
   10    David

James still has the most 1st choice votes, but not enough to win (he still needs 6 votes!). Fred has the fewest 1st choice votes, so he is eliminated, and his voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              
    7    Laura
    8    James
    9    David    Laura
   10    David

James now has five 1st choice votes, but still not enough to win. Laura has the fewest 1st choice votes, so she is eliminated, and her voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4     
    5    David 
    6    James              
    7    
    8    James
    9    David    Laura
   10    David

James retains his lead with five first place votes…but now he is declared the winner. Since Voter #4 and #7 do not have a 2nd or 3rd choice vote, they no longer count in the number of voters. Thus to win, a candidate only needs 5 votes = 50% of the 8 1st choice votes + 1.

The actual data from Minneapolis includes over 80,000 votes for 36 different candidates. There are also ballot issues such as undervoting and overvoting. This occurs when voters give multiple candidates the same ranking (overvoting) or do not select a candidate (undervoting).

The animated GIF below shows the results after each round of elimination for the Minneapolis mayoral election.

2013 Mayoral Race

The Minneapolis mayoral data is available on GitHub as a CSV file (along with some other smaller sample files to hone your programming algorithm). There is also a frequently asked questions webpage available from the City of Minneapolis regarding ranked choice voting.

In addition you can also listen to the Minnesota Public Radio broadcast in which they discussed the problems with the vote counting. The folks at the  R Users Group Meeting were featured and Winston brought the house down when commuting on the R program that computed the winner within a few seconds said, “it took me about an hour and a half to get something usable, but I was watching TV at the time”.

See the R syntax I used here.

 

Facebook Analytics

WolframAlpha has a tool that will analyze your Facebook network. I saw this awhile ago, but HollyLynne reminded me of this recently, and I tried it out. You need to give the app(?) permission to access your account (which I am sure means access to your data for Wolfram), after which you are given all sorts of interesting, pretty info. Note, you can also opt to have Wolfram track your data in order to determine how your network is changing.

Some of them are kind of informative, but others are not. Consider this scatterplot(???)-type plot that was entitled “Weekly Distribution”. Tufte could include this in his next book of worthless graphs.

MSP2681f8g112466i1797600003i1a9g2303dbh6b6

There are other analyses that are more useful. For example, I learned that my post announcing the Citizen Statistician blog was the most liked post I have, while the post showing photographic evidence that I held a baby as far back as 1976 was the most commented.

This plot was also interesting…too bad it is a pie chart (sigh).

MSP12421be34933bi49gcie0000299f2dbe6gb3406g

There is also a ton of other information, such as which friend has the most friends (Jayne at 1819), your youngest and oldest friends based on the reported birthdays, photos that are tagged the most, word clouds of your posts, etc.

This network was my favorite of them all. It shows the social insiders and outsiders in my network of friends, and identifies social connectors, neighbors, and gateways.

MSP1331ib228faf1ibdbhd00003d6006784d62952h

Once again, kind of a cool tool that works with the existing data, but there does not seem to be a way to obtain the data in a workable format.

NCAA Basketball Visualization

It is time for the NCAA Basketball Tournament. Sixty-four teams dream big (er…I mean 68…well actually by now, 64) and schools like Iona and Florida Gulf Coast University (go Eagles!) are hoping that Robert Morris astounding victory in the N.I.T. isn’t just a flash in the pan.

My favorite part is filling out the bracket–see it below. (Imagine that…a statistician’s favorite part of the whole thing is making predictions.) Even President Obama filled out a bracket [see it here].

Andy's Bracket

My method for making predictions, I use a complicated formula that involves “coolness” factors of team mascots, alphabetical order (but only conditional on particular seedings), waving of hands, and guesswork. But, that was because I didn’t have access to my student Rodrigo Zamith’s latest blog post until today.

Rodrigo has put together side-by-side visualizations of many of the pertinent basketball statistics (e.g., points scored, rebounds, etc.) using the R package ggplot2. This would have been very helpful in my decisions where the mascot measure failed me and I was left with a toss-up (e.g., Oklahoma vs. San Diego State).

Preview of the March 22 Game between Minnesota and UCLA

Rodrigo has also made the data, not only from the 2012-2013 season available from his blog, but also the previous two seasons as well. Check it out at Rodrigo’s blog!

Now, all I have to do is hang tight until the 8:57pm (CST) game on March 22. Judging from the comparisons, it will be tight.

 

Gun deaths and data

This article at Slate is interesting for a number of reasons.  First, if offers a link to a data set listing names and data of the 325 people known to have been killed by guns since December 14, 2012.  Slate is to be congratulated for providing data in a format that is easy for statistical software to read.  (Still, some cleaning required.  For example, ages include a mix of numbers and categorical values.) Second, the data are the result of an informal data compilation by an unknown tweeter, although s/he is careful to give sources for each report.  (And, as Slate points out, deaths are more likely to be un-reported than falsely reported.)  Data include names, data, city and state, longitude/latitude, and age of victim.  Third, data such as these become richer when paired with other data, and I think it would be a great classroom project to create a data-driven story in which students used additional data sources to provide deeper context for these data.  An obvious choice for such data is to extend the dataset back in time, possibly using official crime data (but I am probably hopelessly naive in thinking this is a simple task.)

gundeathsages

Contributions to the 2012 Presidential Election Campaigns

With fewer than two weeks left till the US presidential elections, motivating class discussion with data related to the candidates, elections, or politics in general is quite easy. So for yesterday’s lab we used data released by The Federal Election Commission on contributions made to 2012 presidential campaigns. I came across the data last week, via a post on The Guardian Datablog. The post has a nice interactive feature for analyzing data from all contributions. The students started the lab by exploring the data using this applet, and then moved on to analyzing the data in R.

The original dataset can be found here. You can download data for all contributions (~620 MB csv file), or contributions by state (~16 MB for North Carolina, for example). The complete dataset has information on over 3.3 million contributions. The students worked with a random sample of 10,000 observations from this dataset. I chose to not use the entire population data because (1) it’s too large to efficiently work with in an introductory stats course, and (2) we’re currently covering inference so a setting where you start with random sample data to infer something about the population felt more natural.

While the data come in csv format, loading the data into R is slightly problematic. For some reason, all rows except the header row end with a comma, and hence naively loading the data into R using the read.csv function results in an error as R sees the extra comma as indicating an additional column and complains the header row does not have the same length as the rest of the dataset. Below are a couple ways to resolve this problem:

  • One solution is to simply open the csv file in Excel, and resave. This eliminates the spurious commas at the end of each line, making it possible to load the data using the read.csv function. However this solution is not ideal for the large dataset of all contributions.
  • Another solution for loading the population data (somewhat quickly) and taking a random sample is presented below:
x = readLines("P00000001-ALL.csv")
n = 10000 # desired sample size
s = sample(2:length(x), n)
header = strsplit(x[1],",")[[1]]
d = read.csv(textConnection(x[s]), header = FALSE)
d = d[,-ncol(d)]
colnames(d) = header

Our lab focused on comparing average contribution amounts among elections and candidates. But these data could also be used to compare contributions from different geographies (city, state, zip code), or to explore characteristics of contributions from individuals of various occupations, individuals vs. PACs etc.

If you’re interested, there should still be enough time for you to squeeze an analysis/discussion of these data in your class before the elections. But even if not, the data should still be interesting after November 6.

Red Bull Stratos Mission Data

Yesterday (October 14, 2012), Felix Baumgartner made history by becoming the first person to break the speed of sound during a free fall. He also set some other records (e.g., longest free fall, etc.) during the Red Bull Stratos Mission–which was broadcast live on the internet. Kind of cool, but imagine the conversation that took place daydreaming this one…

Red Bull Creative Person: What if we got some idiot to float up into the stratosphere in a space capsule and then had him step out of it and free fall four minutes breaking the sound barrier?

Another Red Bull Creative Person: Great idea! Lets’ also broadcast it live on the internet.

Well anyway, after the craziness ensued, It was suggested on Facebook that, “I think this data should be on someone’s blog!”. Rising to the bait, I immediately looked at the mission page,  but the data was no longer there. Thank goodness for Wikipedia [Red Bull Stratos Mission Data]. The data can be copied and pasted into an Excel sheet, or read in to R using the readHTMLTable() function from the XML package.

mission <- readHTMLTable(
  doc = "http://en.wikipedia.org/wiki/Red_Bull_Stratos/Mission_data",
  header = TRUE
  )

We can then write it to an external file, I called it Mission.csv and put it on my desktop, using the read.csv() function.

write.csv(mission,
  file = "/Users/andrewz/Desktop/Mission.csv",
  row.names = FALSE,
  quote = FALSE
  )

Opening the new file in a text editor, we see some issues to deal with (these are also apparent from looking at the data on the Wikipedia page).

  • The first line is the first table header, Elevation Data, which spanned three columns in the Wikipedia page. Delete it.
  • The last row are the re-printed variable names. Delete it.
  • Change the variable names in the current first row to be statistical software compliant (e.g., remove the commas and spaces from each variable). My first row looks like the following:
Time,Elevation,DeltaTime,Speed
  • Remove the commas from the values in the last column. With a comma separated value (CSV) file, they are trouble.
  • There are nine rows which have parentheses around their value in the last column. I don’t know what this means. For now, I will remove those values.

The file can be downloaded here.

Then you can plot (or analyze) away to your heart’s content.

# read in data to R
mission <- read.csv(file = "/Users/andrewz/Desktop/Mission.csv")

# Load ggplot2 library
library(ggplot2)

# Plot speed vs. time
ggplot(data = mission, aes(x = Time, y = Speed)) +
  geom_line()

# Plot elevation vs. time
ggplot(data = mission, aes(x = Time, y = Elevation)) +
  geom_line()

Since I have no idea what these really represent other than what the variable names tell me, I cannot interpret these very well. Perhaps someone else can.

TV Show hosts

A little bit ago [July 19, 2012 --- so I'm a little behind], the L.A. Times ran an article about whether TV hosts are pulling their own weight, salary wise. (What is the real value of TV stars and personalities?)  I took their data table and put it in a CSV format, and added a column called “epynomious”, which indicates whether the show is named after the host.  (This apparently doesn’t explain the salary variation.)  A later letter to the editor pointed out that the analysis doesn’t take into account how frequently the show must be recorded, and hence how often the host must come to work.  Your students might enjoy adding this variable and analyzing the data to see if it explains anything. Maybe this is a good candidate for ‘enrichment’ via Google Refine?  TV salaries from LA Times

More on FitBit data

First the good news:

 

Your data belongs to you!

And now the bad: It costs you $50/ year for your data to truly belong to you.  For a ‘premium’ membership, you can visit your data as often as you choose.  If only Andy had posted sooner, I would have saved $50.  But, dear readers, in order to explore all avenues, I spent the bucks.  And here’s some data (screenshot–I don’t want you analyzing *my* data!)

It’s pretty easy and painless.  Next I’ll try Andy’s advice, and see if I can save $50 next year.