ggplot2 Pinterest

I don’t understand the website Pinterest, but it looks pretty (especially on the iPad), and an undergraduate student said it was the greatest thing since Facebook, so I thought I would give it a shot. The idea is that Pinterest “lets you organize and share all the beautiful things you find on the web.” You organize beautiful things by creating a “board” (a page), and then adding “pins” (links to websites).

My thought was…plots in ggplot2 are beautiful…I will create a board with useful links/tutorials for creating ggplot2 plots!

I already have three followers. Now you can follow too.

http://pinterest.com/zief0002/ggplot2/

Contributions to the 2012 Presidential Election Campaigns

With fewer than two weeks left till the US presidential elections, motivating class discussion with data related to the candidates, elections, or politics in general is quite easy. So for yesterday’s lab we used data released by The Federal Election Commission on contributions made to 2012 presidential campaigns. I came across the data last week, via a post on The Guardian Datablog. The post has a nice interactive feature for analyzing data from all contributions. The students started the lab by exploring the data using this applet, and then moved on to analyzing the data in R.

The original dataset can be found here. You can download data for all contributions (~620 MB csv file), or contributions by state (~16 MB for North Carolina, for example). The complete dataset has information on over 3.3 million contributions. The students worked with a random sample of 10,000 observations from this dataset. I chose to not use the entire population data because (1) it’s too large to efficiently work with in an introductory stats course, and (2) we’re currently covering inference so a setting where you start with random sample data to infer something about the population felt more natural.

While the data come in csv format, loading the data into R is slightly problematic. For some reason, all rows except the header row end with a comma, and hence naively loading the data into R using the read.csv function results in an error as R sees the extra comma as indicating an additional column and complains the header row does not have the same length as the rest of the dataset. Below are a couple ways to resolve this problem:

  • One solution is to simply open the csv file in Excel, and resave. This eliminates the spurious commas at the end of each line, making it possible to load the data using the read.csv function. However this solution is not ideal for the large dataset of all contributions.
  • Another solution for loading the population data (somewhat quickly) and taking a random sample is presented below:
x = readLines("P00000001-ALL.csv")
n = 10000 # desired sample size
s = sample(2:length(x), n)
header = strsplit(x[1],",")[[1]]
d = read.csv(textConnection(x[s]), header = FALSE)
d = d[,-ncol(d)]
colnames(d) = header

Our lab focused on comparing average contribution amounts among elections and candidates. But these data could also be used to compare contributions from different geographies (city, state, zip code), or to explore characteristics of contributions from individuals of various occupations, individuals vs. PACs etc.

If you’re interested, there should still be enough time for you to squeeze an analysis/discussion of these data in your class before the elections. But even if not, the data should still be interesting after November 6.

Citizen Scientists in Space…

The L.A. Times had an interesting article about how a pair of ‘citizen scientists’ discovered a planet with four suns.  I would say that a more accurate term for the pair would be ‘citizen data miners’, because essentially the astronomy community crowd sources data mining by providing reams of data for anyone to examine.

It seemed timely for me, following a seminar at the UCLA Center for Applied Statistics by Kiri Wagstaff on automated procedures for discovering interesting features in large data sets. Kiri’s working on algorithms that use the pixels of photos (such as those produced by the Mars Rover) as data, and flag unusual features so that scientists can examine.  Basically, the algorithms weed out things scientists already know and try to present them with the unknown.

The data behind the “four suns” discovery comes from Planet Hunters--definitely worth a visit.   A parent website, zooniverse, offers opportunities to collect data, for example by transcribing historical ship logs for use in climate modeling. Who knows, you might make a discovery!

What  crowd-sourcing projects would you like to see to benefit statistics?

Recoding Variables in R: Pedagogic Considerations

I was creating a dataset this last week in which I had to partition the observed responses to show how the ANOVA model partitions the variability. I had the observed Y (in this case prices for 113 bottles of wine), and a categorical predictor X (the region of France that each bottle of wine came from). I was going to add three columns to this data, the first showing the marginal mean, the second showing the effect, and the third showing the residual. To create the variable indicating the effect, I essentially wanted to recode a particular region to a particular effect:

  • Bordeaux ==> 9.11
  • Burgundy ==> 4.20
  • Languedoc ==> –9.30
  • Rhone ==> –0.75

As I was considering how to do this, it struck me that several options were available to me. Here are two solutions that come up when Googling how to do this.

Use the recode() function from the car package.

library(car)
wine$Effect <- recode(wine$Region,
  " 'Bordeaux' = 9.11;
    'Bordeaux' = 4.20;
    'Languedoc' = -9.30;
    'Rhone' = -0.75 " )
This is a commonly suggested solution. The strings inside quotation marks, however, make it likely students (and teachers) will commit a syntax error. This is especially true when recoding a categorical variable into another categorical variable. R-wise (it’s a technical term) it also produces a factor, even though it is clear that the intent was to produce numerical values. This is of course, easily fixable using as.numeric(), but it can lead to confusion.
Another solution is to use indexing.
wine$Effect <- 9.11
wine$Effect[wine$Region == "Burgundy"] <- 4.20
wine$Effect[wine$Region == "Languedoc"] <- -9.30
wine$Effect[wine$Region == "Rhone"] <- -0.75
This solution is canonical in that it is clean and the R code is concise. (Note: This is what I ended up using to create this re-coded variable.) In my experience, however, this also means that students without a programming background don’t initially understand it. This alone makes it unattractive pedagogically.

A better solution pedagogically seems to be to create a new data frame of key-value pairs (in computer science this is called a hash table) and then use the join() function from the plyr package to `join’ the original data frame and the new data frame.

key <- data.frame(
  Region = c("Bordeaux", "Burgundy", "Languedoc", "Rhone"),
  Effect = c(9.11, 4.20, -9.33, -0.75)
  )
join(wine, key, by = Region)

For me this is a useful way to teach how to recode variables. It has a direct link to the Excel VLOOKUP function, and also to ideas of relational databases. It also allows more generalizability in terms of being able to merge data sets using a common variable.

R-wise, it is not difficult syntax, since almost every student has successfully used the data.frame() function to create a data frame. The join() function is also easily explained.

Nate Silver’s New Book

I’ve been reading and greatly enjoying Nate Silver’s book, The Signal and the Noise: Why So Many Predictions Fail—and Some Don’t.  I’d recommend the book based on the introduction and first chapter alone. (And, no, that’s not because that’s all I’ve read so far.  It’s  because they’re that good.)  If you’re the sort who skips introductions, I strongly suggest you become a new sort and read this one. It’s a wonderful essay about the dangers of too much information, and the need to make sense of it.  Silver makes the point that, historically, when we’ve been faced with more information than we can handle, we tend to pick-and-choose which ‘facts’ we wish to believe.  Sounds like a presidential debate, no?

Another thing to like about the book is for the argument it provides  against the Wired Magazine view that Big Data means the end of scientific theory.  Chapter by chapter, Silver describes the very important role that theory and modeling play in making (successful) predictions.  In fact, a theme of the book is that prediction is a human endeavor, despite the attention data scientists pay to automated algorithmic procedures.  “Before we can demand more of our data, we need to demand more of ourselves.”  In other words, the Data Deluge requires us to find useful information, not just any old information. (Which is where we educators come in!)

The first chapter makes a strong argument that the financial crisis was, to a great extent, a failure to understand fundamentals of statistical modeling, in particular to realize that the models are not the thing they model.  Models are shaped by data but run on assumptions, and when the assumptions are wrong, the predictions fail.  Chillingly, Silver points out that recoveries from financial crises tend to be much, much slower than recoveries from economic crises and, in fact, some economies never recover.

Other chapters talk about baseball, weather, earthquakes, poker and more.  I particularly enjoyed the weather chapter because, well, who doesn’t enjoy talking about the weather? For me, perhaps because we are in the midst of elections, it also raised questions about the role of the U.S. federal government in supporting the economy.  Weather prediction plays a big role in our economic infrastructure, even though many people tend to be dismissive of our ability to predict the weather.  So it was interesting to see that, in fact, the government agencies do predict weather better than the private prediction firms (such as The Weather Channel), and are much better than local news channels’ predictions.  In fact, as Silver explains, the marketplace rewards poor predictions (at least when it comes to predicting rain).  For me, this underlines the importance of a ‘neutral’ party.

As I think about preparing students for the Deluge, I think that teaching prediction should take priority over teaching inference.  Inference is important, but it is a specialized skill, and so is not needed by all.  Prediction, on the other hand, is inherently important, and has been for millennia.Yes, prediction is a type of inference, but prediction and inference are not the same thing.  As Silver points out, estimating a candidate’s support for president is different from predicting whether or not the candidate will win. (Which leads me to propose a new slogan: “Prediction: Inference for Tomorrow!”  Or “Prediction: Inference for Procrastinators!”)

Much of this may be beyond the realm of introductory statistics, since some of the predictive models are complex.  But the basics are important for intro stats students.  All students  should understand what a statistical model is and what it is not.  Equally importantly, they should understand how to evaluate a model.  And I don’t mean that they should learn about r-squared (or only about r-squared.)  They should learn about the philosophy of measuring model performance.  In other words, intro stats students should understand why many predictions fail, but some don’t, and how to tell the difference.

So let’s talk specifics.  Post your comments on how you teach your students about prediction and modeling.

Red Bull Stratos Mission Data

Yesterday (October 14, 2012), Felix Baumgartner made history by becoming the first person to break the speed of sound during a free fall. He also set some other records (e.g., longest free fall, etc.) during the Red Bull Stratos Mission–which was broadcast live on the internet. Kind of cool, but imagine the conversation that took place daydreaming this one…

Red Bull Creative Person: What if we got some idiot to float up into the stratosphere in a space capsule and then had him step out of it and free fall four minutes breaking the sound barrier?

Another Red Bull Creative Person: Great idea! Lets’ also broadcast it live on the internet.

Well anyway, after the craziness ensued, It was suggested on Facebook that, “I think this data should be on someone’s blog!”. Rising to the bait, I immediately looked at the mission page,  but the data was no longer there. Thank goodness for Wikipedia [Red Bull Stratos Mission Data]. The data can be copied and pasted into an Excel sheet, or read in to R using the readHTMLTable() function from the XML package.

mission <- readHTMLTable(
  doc = "http://en.wikipedia.org/wiki/Red_Bull_Stratos/Mission_data",
  header = TRUE
  )

We can then write it to an external file, I called it Mission.csv and put it on my desktop, using the read.csv() function.

write.csv(mission,
  file = "/Users/andrewz/Desktop/Mission.csv",
  row.names = FALSE,
  quote = FALSE
  )

Opening the new file in a text editor, we see some issues to deal with (these are also apparent from looking at the data on the Wikipedia page).

  • The first line is the first table header, Elevation Data, which spanned three columns in the Wikipedia page. Delete it.
  • The last row are the re-printed variable names. Delete it.
  • Change the variable names in the current first row to be statistical software compliant (e.g., remove the commas and spaces from each variable). My first row looks like the following:
Time,Elevation,DeltaTime,Speed
  • Remove the commas from the values in the last column. With a comma separated value (CSV) file, they are trouble.
  • There are nine rows which have parentheses around their value in the last column. I don’t know what this means. For now, I will remove those values.

The file can be downloaded here.

Then you can plot (or analyze) away to your heart’s content.

# read in data to R
mission <- read.csv(file = "/Users/andrewz/Desktop/Mission.csv")

# Load ggplot2 library
library(ggplot2)

# Plot speed vs. time
ggplot(data = mission, aes(x = Time, y = Speed)) +
  geom_line()

# Plot elevation vs. time
ggplot(data = mission, aes(x = Time, y = Elevation)) +
  geom_line()

Since I have no idea what these really represent other than what the variable names tell me, I cannot interpret these very well. Perhaps someone else can.

Current Population Survey Data using R

The Current Population Survey (CPS) is a statistical survey conducted by the United States Census Bureau for the Bureau of Labor Statistics. The data collected is used to provide a monthly report on employment  in the United States.

Although the CPS data are available, to this point it has really only been easy to deal with for SPSS, Stata, or SAS users. A new blog is also making it easy for R users to obtain and analyses these data. From their About/FAQ page:

This blog announces obsessively-detailed instructions to analyze us government survey data with free tools – the r language and the survey package.

 

They provide commented R scripts describing how to load, clean, configure, and analyze many current data sets available. Each script contains information on how to automatically download every microdata file from every survey year as an R data file onto your local disk. In addition, they detail how to use R to match the published results from other statistical languages. They also provide video showing how to do much of what they cover in the scripts.

Begin you foray into this and other government data sources here.

 

TV Show hosts

A little bit ago [July 19, 2012 — so I’m a little behind], the L.A. Times ran an article about whether TV hosts are pulling their own weight, salary wise. (What is the real value of TV stars and personalities?)  I took their data table and put it in a CSV format, and added a column called “epynomious”, which indicates whether the show is named after the host.  (This apparently doesn’t explain the salary variation.)  A later letter to the editor pointed out that the analysis doesn’t take into account how frequently the show must be recorded, and hence how often the host must come to work.  Your students might enjoy adding this variable and analyzing the data to see if it explains anything. Maybe this is a good candidate for ‘enrichment’ via Google Refine?  TV salaries from LA Times