Data News: Fitbit + iHealth, and Open Justice data

The LA Times reported today, along with several other sources, that the California Department of Justice has initiated a new “open justice” data initiative.  On their portal, the “Justice Dashboard“, you can view Arrest Rates, Deaths in Custody, or Law Enforcement Officers Killed or Assaulted.

I chose, for my first visit, to look at Deaths in Custody.  At first, I was disappointed with the quality of the data provided.  Instead of data, you see some nice graphical displays, mostly univariate but a few with two variables, addressing issues and questions that are probably on many people’s minds.  (Alarmingly, the second most common cause of death for people in custody is homicide by a law enforcement officer.)

However, if you scroll to the bottom, you’ll see that you can, in fact, download relatively raw data, in the form of a spreadsheet in which each row is a person in custody who died.  Variables include date of birth and death, gender, race, custody status, offense, reporting agency, and many other variables.  Altogether, there are 38 variables and over 15000 observations. The data set comes with a nice codebook, too.

FitBit vs. the iPhone

Onto a cheerier topic. This quarter I will be teaching regression, and once again my FitBit provided inspiration.  If you teach regression, you know one of the awful secrets of statistics: there are no linear associations. Well, they are few and far between.  And so I was pleased when a potentially linear association sprang to mind:  how well do FitBit step counts predict the Health app counts?

Health app is an ios8 app. It was automatically installed on your iPhone, whether you wanted it or not.  (I speak from the perspective of an iPhone 6 user, with ios8 installed.) Apparently, whether you know it or not, your steps are being counted.  If you have an Apple Watch, you know about this.  But if you don’t, it happens invisibly, until you open the app. Or buy the watch.

How can you access these data?  I did so by downloading the free app QS (for “Quantified Self”). The Quantified Self people have a quantified self website directing you to hundreds of apps you can use to learn more about yourself than you probably should.  Once installed, you simply open the app, choose which variables you wish to download, click ‘submit’, and a csv filed is emailed to you (or whomever you wish).

The FitBit data can only be downloaded if you have a premium account.  The FitBit premium website has a ‘custom option’ that allows you to download data for any time period you choose, but currently, due to an acknowledged bug, no matter which dates you select, only one month of data will be downloaded. Thus, you must download month by month.  I downloaded only two months, July and August, and at some point in August my FitBit went through the wash cycle, and then I misplaced it.  It’s around here, somewhere, I know. I just don’t know where.  For these reasons, the data are somewhat sparse.

I won’t bore you with details, but by applying functions from the lubridate package in R and using the gsub function to remove commas (because FitBit inexplicably inserts commas into its numbers and, I almost forgot, adds a superfluous title to the document which requires that you use the “skip =1” option in read.table), it was easy to merge a couple of months of FitBit with Health data.  And so here’s how they compare:


The regression line is Predicted.iOS.Steps = 1192 + 0.9553 (FitBit.Steps), r-squared is .9223.  (A residual plot shows that the relationship is not quite as linear as it looks. Damn.)

Questions I’m thinking of posing on the first day of my regression class this quarter:

  1. Which do you think is a more reliable counter of steps?
  2. How closely in agreement are these two step-counting tools? How would you measure this?
  3. What do the slope and intercept tell us?
  4. Why is there more variability for low fit-bit step counts than for high?
  5. I often lose my FitBit. Currently, for instance, I have no idea where it is.  On those days, FitBit reports “0 steps”. (I removed the 0’s from this analysis.)  Can I use this regression line to predict the values on days I lose my FitBit?  With how much precision?

I think it will be helpful to talk about these questions informally, on the first day, before they have learned more formal methods for tackling these.  And maybe I’ll add a few more months of data.

Ranked Choice Voting

The city of Minneapolis recently elected a new mayor. This is not newsworthy in and of itself, however the method they used was—ranked choice voting. Ranked choice voting is a method of voting allowing voters to rank multiple candidates in order of preference. In the Minneapolis mayoral election, voters ranked up to three candidates.

The interesting part of this whole thing was that it took over two days for the election officials to declare a winner. It turns out that the official procedure for calculating the winner of the ranked-choice vote involved cutting and pasting spreadsheets in Excel.

The technology coordinator at E-Democracy, Bill Bushey, posted the challenge of writing a program to calculate the winner of a ranked-choice election to the Twin Cities Javascript and Python meetup groups. Winston Chang also posted it to the Twin Cities R Meetup group. While not a super difficult problem, it is complicated enough that it can make for a nice project—especially for new R programmers. (In fact, our student R group is doing this.)

The algorithm, described by Bill Bushey, is

  1. Create a data structure that represents a ballot with voters’ 1st, 2nd, and 3rd choices
  2. Count up the number of 1st choice votes for each candidate. If a candidate has 50% + 1 votes, declare that candidate the winner.
  3. Else, select the candidate with the lowest number of 1st choice votes, remove that candidate completely from the data structure, make the 2nd choice of any voter who voted for the removed candidate the new 1st choice (and the old 3rd choice the new 2nd choice).
  4. Goto 2

As an example consider the following sample data:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    Frank
    2    Frank     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

In this data, James has the most 1st choice votes (4) but it is not enough to win the election (a candidate needs 6 votes = 50% of 10 votes cast + 1 to win). So at this point we determine the least voted for candidate…Frank, and delete him from the entire structure:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    <del>Frank</del>
    2    <del>Frank</del>     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

Then, the 2nd choice of any voter who voted for Frank now become the new “1st” choice. This is only Voter #2 in the sample data. Thus Fred would become Voter #2’s 1st choice and James would become Voter #2’s 2nd choice:

Voter  Choice1  Choice2  Choice3
    1    James     Fred
    2     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Laura
   10    David

James still has the most 1st choice votes, but not enough to win (he still needs 6 votes!). Fred has the fewest 1st choice votes, so he is eliminated, and his voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              
    7    Laura
    8    James
    9    David    Laura
   10    David

James now has five 1st choice votes, but still not enough to win. Laura has the fewest 1st choice votes, so she is eliminated, and her voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    5    David 
    6    James              
    8    James
    9    David    Laura
   10    David

James retains his lead with five first place votes…but now he is declared the winner. Since Voter #4 and #7 do not have a 2nd or 3rd choice vote, they no longer count in the number of voters. Thus to win, a candidate only needs 5 votes = 50% of the 8 1st choice votes + 1.

The actual data from Minneapolis includes over 80,000 votes for 36 different candidates. There are also ballot issues such as undervoting and overvoting. This occurs when voters give multiple candidates the same ranking (overvoting) or do not select a candidate (undervoting).

The animated GIF below shows the results after each round of elimination for the Minneapolis mayoral election.

2013 Mayoral Race

The Minneapolis mayoral data is available on GitHub as a CSV file (along with some other smaller sample files to hone your programming algorithm). There is also a frequently asked questions webpage available from the City of Minneapolis regarding ranked choice voting.

In addition you can also listen to the Minnesota Public Radio broadcast in which they discussed the problems with the vote counting. The folks at the  R Users Group Meeting were featured and Winston brought the house down when commuting on the R program that computed the winner within a few seconds said, “it took me about an hour and a half to get something usable, but I was watching TV at the time”.

See the R syntax I used here.


Facebook Analytics

WolframAlpha has a tool that will analyze your Facebook network. I saw this awhile ago, but HollyLynne reminded me of this recently, and I tried it out. You need to give the app(?) permission to access your account (which I am sure means access to your data for Wolfram), after which you are given all sorts of interesting, pretty info. Note, you can also opt to have Wolfram track your data in order to determine how your network is changing.

Some of them are kind of informative, but others are not. Consider this scatterplot(???)-type plot that was entitled “Weekly Distribution”. Tufte could include this in his next book of worthless graphs.


There are other analyses that are more useful. For example, I learned that my post announcing the Citizen Statistician blog was the most liked post I have, while the post showing photographic evidence that I held a baby as far back as 1976 was the most commented.

This plot was also interesting…too bad it is a pie chart (sigh).


There is also a ton of other information, such as which friend has the most friends (Jayne at 1819), your youngest and oldest friends based on the reported birthdays, photos that are tagged the most, word clouds of your posts, etc.

This network was my favorite of them all. It shows the social insiders and outsiders in my network of friends, and identifies social connectors, neighbors, and gateways.


Once again, kind of a cool tool that works with the existing data, but there does not seem to be a way to obtain the data in a workable format.

NCAA Basketball Visualization

It is time for the NCAA Basketball Tournament. Sixty-four teams dream big (er…I mean 68…well actually by now, 64) and schools like Iona and Florida Gulf Coast University (go Eagles!) are hoping that Robert Morris astounding victory in the N.I.T. isn’t just a flash in the pan.

My favorite part is filling out the bracket–see it below. (Imagine that…a statistician’s favorite part of the whole thing is making predictions.) Even President Obama filled out a bracket [see it here].

Andy's Bracket

My method for making predictions, I use a complicated formula that involves “coolness” factors of team mascots, alphabetical order (but only conditional on particular seedings), waving of hands, and guesswork. But, that was because I didn’t have access to my student Rodrigo Zamith’s latest blog post until today.

Rodrigo has put together side-by-side visualizations of many of the pertinent basketball statistics (e.g., points scored, rebounds, etc.) using the R package ggplot2. This would have been very helpful in my decisions where the mascot measure failed me and I was left with a toss-up (e.g., Oklahoma vs. San Diego State).

Preview of the March 22 Game between Minnesota and UCLA

Rodrigo has also made the data, not only from the 2012-2013 season available from his blog, but also the previous two seasons as well. Check it out at Rodrigo’s blog!

Now, all I have to do is hang tight until the 8:57pm (CST) game on March 22. Judging from the comparisons, it will be tight.


Gun deaths and data

This article at Slate is interesting for a number of reasons.  First, if offers a link to a data set listing names and data of the 325 people known to have been killed by guns since December 14, 2012.  Slate is to be congratulated for providing data in a format that is easy for statistical software to read.  (Still, some cleaning required.  For example, ages include a mix of numbers and categorical values.) Second, the data are the result of an informal data compilation by an unknown tweeter, although s/he is careful to give sources for each report.  (And, as Slate points out, deaths are more likely to be un-reported than falsely reported.)  Data include names, data, city and state, longitude/latitude, and age of victim.  Third, data such as these become richer when paired with other data, and I think it would be a great classroom project to create a data-driven story in which students used additional data sources to provide deeper context for these data.  An obvious choice for such data is to extend the dataset back in time, possibly using official crime data (but I am probably hopelessly naive in thinking this is a simple task.)


Contributions to the 2012 Presidential Election Campaigns

With fewer than two weeks left till the US presidential elections, motivating class discussion with data related to the candidates, elections, or politics in general is quite easy. So for yesterday’s lab we used data released by The Federal Election Commission on contributions made to 2012 presidential campaigns. I came across the data last week, via a post on The Guardian Datablog. The post has a nice interactive feature for analyzing data from all contributions. The students started the lab by exploring the data using this applet, and then moved on to analyzing the data in R.

The original dataset can be found here. You can download data for all contributions (~620 MB csv file), or contributions by state (~16 MB for North Carolina, for example). The complete dataset has information on over 3.3 million contributions. The students worked with a random sample of 10,000 observations from this dataset. I chose to not use the entire population data because (1) it’s too large to efficiently work with in an introductory stats course, and (2) we’re currently covering inference so a setting where you start with random sample data to infer something about the population felt more natural.

While the data come in csv format, loading the data into R is slightly problematic. For some reason, all rows except the header row end with a comma, and hence naively loading the data into R using the read.csv function results in an error as R sees the extra comma as indicating an additional column and complains the header row does not have the same length as the rest of the dataset. Below are a couple ways to resolve this problem:

  • One solution is to simply open the csv file in Excel, and resave. This eliminates the spurious commas at the end of each line, making it possible to load the data using the read.csv function. However this solution is not ideal for the large dataset of all contributions.
  • Another solution for loading the population data (somewhat quickly) and taking a random sample is presented below:
x = readLines("P00000001-ALL.csv")
n = 10000 # desired sample size
s = sample(2:length(x), n)
header = strsplit(x[1],",")[[1]]
d = read.csv(textConnection(x[s]), header = FALSE)
d = d[,-ncol(d)]
colnames(d) = header

Our lab focused on comparing average contribution amounts among elections and candidates. But these data could also be used to compare contributions from different geographies (city, state, zip code), or to explore characteristics of contributions from individuals of various occupations, individuals vs. PACs etc.

If you’re interested, there should still be enough time for you to squeeze an analysis/discussion of these data in your class before the elections. But even if not, the data should still be interesting after November 6.

Red Bull Stratos Mission Data

Yesterday (October 14, 2012), Felix Baumgartner made history by becoming the first person to break the speed of sound during a free fall. He also set some other records (e.g., longest free fall, etc.) during the Red Bull Stratos Mission–which was broadcast live on the internet. Kind of cool, but imagine the conversation that took place daydreaming this one…

Red Bull Creative Person: What if we got some idiot to float up into the stratosphere in a space capsule and then had him step out of it and free fall four minutes breaking the sound barrier?

Another Red Bull Creative Person: Great idea! Lets’ also broadcast it live on the internet.

Well anyway, after the craziness ensued, It was suggested on Facebook that, “I think this data should be on someone’s blog!”. Rising to the bait, I immediately looked at the mission page,  but the data was no longer there. Thank goodness for Wikipedia [Red Bull Stratos Mission Data]. The data can be copied and pasted into an Excel sheet, or read in to R using the readHTMLTable() function from the XML package.

mission <- readHTMLTable(
  doc = "",
  header = TRUE

We can then write it to an external file, I called it Mission.csv and put it on my desktop, using the read.csv() function.

  file = "/Users/andrewz/Desktop/Mission.csv",
  row.names = FALSE,
  quote = FALSE

Opening the new file in a text editor, we see some issues to deal with (these are also apparent from looking at the data on the Wikipedia page).

  • The first line is the first table header, Elevation Data, which spanned three columns in the Wikipedia page. Delete it.
  • The last row are the re-printed variable names. Delete it.
  • Change the variable names in the current first row to be statistical software compliant (e.g., remove the commas and spaces from each variable). My first row looks like the following:
  • Remove the commas from the values in the last column. With a comma separated value (CSV) file, they are trouble.
  • There are nine rows which have parentheses around their value in the last column. I don’t know what this means. For now, I will remove those values.

The file can be downloaded here.

Then you can plot (or analyze) away to your heart’s content.

# read in data to R
mission <- read.csv(file = "/Users/andrewz/Desktop/Mission.csv")

# Load ggplot2 library

# Plot speed vs. time
ggplot(data = mission, aes(x = Time, y = Speed)) +

# Plot elevation vs. time
ggplot(data = mission, aes(x = Time, y = Elevation)) +

Since I have no idea what these really represent other than what the variable names tell me, I cannot interpret these very well. Perhaps someone else can.

TV Show hosts

A little bit ago [July 19, 2012 — so I’m a little behind], the L.A. Times ran an article about whether TV hosts are pulling their own weight, salary wise. (What is the real value of TV stars and personalities?)  I took their data table and put it in a CSV format, and added a column called “epynomious”, which indicates whether the show is named after the host.  (This apparently doesn’t explain the salary variation.)  A later letter to the editor pointed out that the analysis doesn’t take into account how frequently the show must be recorded, and hence how often the host must come to work.  Your students might enjoy adding this variable and analyzing the data to see if it explains anything. Maybe this is a good candidate for ‘enrichment’ via Google Refine?  TV salaries from LA Times

More on FitBit data

First the good news:


Your data belongs to you!

And now the bad: It costs you $50/ year for your data to truly belong to you.  For a ‘premium’ membership, you can visit your data as often as you choose.  If only Andy had posted sooner, I would have saved $50.  But, dear readers, in order to explore all avenues, I spent the bucks.  And here’s some data (screenshot–I don’t want you analyzing *my* data!)

It’s pretty easy and painless.  Next I’ll try Andy’s advice, and see if I can save $50 next year.