A Course in Data and Computing Fundamentals

e6dda08f-58ba-40fe-8748-14c589d3ebe7

Daniel Kaplan and Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which was offered this semester at Macalester College. This course is part of a larger research and teaching effort funded by Howard Hughes Medical Institute (HHMI) to help students understand the fundamentals and structures of data, especially big data.  [Read more about the project in Macalester Magazine.]

The course introduces students to R and covers topics such as merging data sources, data formatting and cleaning, clustering and text mining. Within the course, the more specific goals are:

  • Introducing students to the basic ideas of data presentation
    • Graphics modalities
    • Transforming and combining data
    • Summarizing patterns with models
    • Classification and dimension reduction
  • Developing the skills students need to make effective data presentations
    • Access to tabular data
    • Re-organization of tabular data for combining different sources
    • Proficiency with basic techniques for modeling, classification, and dimension reduction.
    • Experience with choices in data presentation
  • Developing the confidence students need to work with modern tools
    • Computer commands
    • Documentation and work-flow

Kaplan and Shoop have put their entire course online using RPubs (the web publishing system hosted by RStudio).

Datasets handpicked by students

I’m often on the hunt for datasets that will not only work well with the material we’re covering in class, but will (hopefully) pique students’ interest. One sure choice is to use data collected from the students, as it is easy to engage them with data about themselves. However I think it is also important to open their eyes to the vast amount of data collected and made available to the public. It’s always a guessing game whether a particular dataset will actually be interesting to students, so learning from the datasets they choose to work with seems like a good idea.

Below are a few datasets that I haven’t seen in previous project assignments. I’ve included the research question the students chose to pursue, but most of these datasets have multiple variables, so you might come up with different questions.

1. Religious service attendance and moral beliefs about contraceptive use: The data are from a February 2012 Pew Research poll. To download the dataset, go to http://www.people-press.org/category/datasets/?download=20039620. You will be prompted to fill out some information and will receive a zipped folder including the questionnaire, methodology, the “topline” (distributions of some of the responses), as well as the raw data in SPSS format (.sav file). Below I’ve provided some code to load this dataset in R, and then to clean it up a bit. Most of the code should apply to any dataset released by Pew Research.

# read data
library(foreign)
d_raw = as.data.frame(read.spss("Feb12 political public.sav"))

# clean up
library(stringr)
d = lapply(d_raw, function(x) str_replace(x, " \\[OR\\]", ""))
d = lapply(d, function(x) str_replace(x, "\\[VOL. DO NOT READ\\] ", ""))
d = lapply(d, function(x) str_replace(x, "\222", "'"))
d = lapply(d, function(x) str_replace(x, " \\(VOL.\\)", ""))
d$partysum = factor(d$partysum)
levels(d$partysum) = c("Refused","Democrat","Independent","Republican","No preference","Other party")

The student who found this dataset was interested examining the relationship between religious service attendance and views on contraceptive use. The code provided below can be used to organize the levels of these variables in a meaningful way, and to take a quick peek at a contingency table.

# variables of interest
d$attend = factor(d$attend, levels = c("More than once a week","Once a week", "Once or twice a month", "A few times a year", "Seldom", "Never", "Don't know/Refused"))
d$q40a = factor(d$q40a, levels = c("Morally acceptable","Morally wrong", "Not a moral issue", "Depends on situation", "Don't know/Refused"))
table(d$attend, d$q40a)

2. Social network use and reading: Another student was interested in the relationship between number of books read in the last year and social network use. This dataset is provided by the Pew Internet and American Life Project. You can download a .csv version of the data file at http://www.pewinternet.org/Shared-Content/Data-Sets/2012/February-2012–Search-Social-Networking-Sites-and-Politics.aspx. The questionnaire can also be found at this website. One of the variables of interest, number of books read in the past 12 months (q2), is  recorded using the following scheme:

  • 0: none
  • 1-96: exact number
  • 97: 97 or more
  • 98: don’t know
  • 99: refused

This could be used to motivate a discussion about the importance doing exploratory data analysis prior to jumping into running inferential tests (like asking “Why are there no people who read more than 99 books?”) and also pointing out the importance of checking the codebook.

3. Parental involvement and disciplinary actions at schools: The 2007-2008 School Survey on Crime and Safety, conducted by the National Center for Education Statistics, contains school level data on crime and safety. The dataset can be downloaded at http://nces.ed.gov/surveys/ssocs/data_products.asp.  The SPSS formatted version of the data file (.sav) can be loaded in R using the read.spss() function in the foreign library (used above in the first data example). The variables of interest for the particular research question the student proposed are parent involvement in school programs (C0204) and number of disciplinary actions (DISTOT08), but the dataset can be used to explore other interesting characteristics of schools, like type of security guards, whether guards are armed with firearms, etc.

4. Dieting in school-aged children: The Health Behavior in School-Aged Children is an international survey on health-risk behaviors of children in grades 6 through 10. The 2005-2006 US dataset can be found at http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/28241. You will need to log in to download the dataset, but you can do so using a Google or a Facebook account. There are multiple versions of the dataset posted, and the Delimited version (.tsv) can be easily loaded in R using the read.delim() function. The student who found this dataset was interested in exploring the relationship between race of the student (Q6_COMP) and whether or not the student is on a diet to lose weight (Q30). The survey also asks questions on body image, substance use, bullying, etc. that may be interesting to explore.

One common feature of the above datasets is that they are all observational/survey based as it’s more challenging to find experimental (raw) datasets online. Any suggestions?

NCAA Basketball Visualization

It is time for the NCAA Basketball Tournament. Sixty-four teams dream big (er…I mean 68…well actually by now, 64) and schools like Iona and Florida Gulf Coast University (go Eagles!) are hoping that Robert Morris astounding victory in the N.I.T. isn’t just a flash in the pan.

My favorite part is filling out the bracket–see it below. (Imagine that…a statistician’s favorite part of the whole thing is making predictions.) Even President Obama filled out a bracket [see it here].

Andy's Bracket

My method for making predictions, I use a complicated formula that involves “coolness” factors of team mascots, alphabetical order (but only conditional on particular seedings), waving of hands, and guesswork. But, that was because I didn’t have access to my student Rodrigo Zamith’s latest blog post until today.

Rodrigo has put together side-by-side visualizations of many of the pertinent basketball statistics (e.g., points scored, rebounds, etc.) using the R package ggplot2. This would have been very helpful in my decisions where the mascot measure failed me and I was left with a toss-up (e.g., Oklahoma vs. San Diego State).

Preview of the March 22 Game between Minnesota and UCLA

Rodrigo has also made the data, not only from the 2012-2013 season available from his blog, but also the previous two seasons as well. Check it out at Rodrigo’s blog!

Now, all I have to do is hang tight until the 8:57pm (CST) game on March 22. Judging from the comparisons, it will be tight.

 

Happy Birthday Florence Henderson

As a celebration of Florence Henderson’s 79th birthday (on February 14), I have created this scatterplot to use in my regression course.

ValDay

The plot depicts the relationship between time spent on mathematics homework outside of school (expressed as z-scores) and mathematics achievement scores (expressed as T-scores, M=50, SD=10) for 200 8th-graders taken from the 1988 National Education Longitudinal Study. The color–in a display of very poor data science–is just randomly applied to the observations rather than meaning anything substantial. (Bloggers Note: I think it fits with the spirit of Valentine’s Day…a gratuitous, yet meaningless, gesture intended to make the receiver feel all gooshy.)

I created the plot using the Valentine package (available here) which applies a Valentine’s Day theme to ggplot. I also applied a picture of Cupid into the background of the plot and used hearts instead of points to plot the observations. Lastly, I changed the default color and fill on the regression smoother to more aptly fit the color scheme.

Below I will explain the how-to of creating this plot.

Reading in the NELS Data

First, I read the NELS data into R. These data and its codebook are available via my regression course website.

nels <- read.csv("http://www.tc.umn.edu/~zief0002/Data/NELS.csv")

Using Hearts Instead of Points

I first needed to find an image of a heart that I liked. For icons of all sorts, I generally use The Noun Project. (This particular heart can be found here.) All of the images at The Noun Project are SVG files. This makes them very useful for display in browsers. To use them in ggplot, I converted the SVG image to a PNG file using the free image manipulation program GIMP (Perhaps you can use the SVG format directly without converting it, but I have never done that, so I don’t know.)

Using GIMP, I also replaced the black color to the hexadecimal color “ad97c6″ by selecting Color > Map > Color Exchange. (Double-clicking the color swatch under “To Color” allows for color entry in hexadecimal.) After this I saved the heart as “heart1.png”, and repeated the process four more times, but using the colors ”e58cbc” (heart2.png), “f2935b” (heart3.png), “9fc8b6″ (heart4.png), and “eddc74″ (heart5.png). (Note: These colors are the same colors I chose for the Valentine’s Day theme color and fill palettes and are based on the colors of the candy hearts you get in elementary school.)

I then used the png package to read the five PNG files into R. (Note: For anyone who doesn’t want to go through the hassle of coloring the hearts and reformatting both them and Cupid, I have made those files available for download here.)

library(png)
h1 <- readPNG("/Users/andrewz/Desktop/heart1.png", TRUE)
h2 <- readPNG("/Users/andrewz/Desktop/heart2.png", TRUE)
h3 <- readPNG("/Users/andrewz/Desktop/heart3.png", TRUE)
h4 <- readPNG("/Users/andrewz/Desktop/heart4.png", TRUE)
h5 <- readPNG("/Users/andrewz/Desktop/heart5.png", TRUE)

To randomly assign each observation to one of the five hearts (h1–h5), I used the sample() function inside the paste() function to concatenate the letter “h” and a random value from 1–5. I then used the do() function from the mosaic package to “do” this 200 times. Lastly, I appended this vector to the nels data frame, in the process, coercing it to characters (to be sure it isn’t appended as a factor–which will be needed later for use in the get() function).

library(mosaic)
heart <- do(200) * paste("h", sample(1:5, size = 1), sep="")
nels$Heart <- as.character(heart[, 1])
head(nels)
#  ID   Homework   Math Heart
#1  1 -0.3329931 42.432    h4
#2  2 -0.2136822 53.698    h2
#3  3 -1.0077991 49.205    h2
#4  4  0.2059000 53.698    h1
#5  5 -0.1177185 55.980    h5
#6  6  0.1413540 65.331    h3

This is all that is needed until we get to actually creating the plot.

Adding a Cupid into the Plot’s Background

I obtained the image of Cupid by doing a Google search on “Cupid, Public Domain”. (The actual image I used is available here.) Since the image was a JPEG, I again converted the image to a PNG (this time using Preview) and changed the image size to have a width of 800 pixels. I also used the Instant Alpha tool in Preview’s toolbar to make the white background transparent. As a sidebar, this could all be done in GIMP as well.

I again used the readPNG() function to read the image file into R, but this time setting the native= argument to FALSE. This will represent the image as an array rather than rasterizing it. I chose to do this so that I could make the image more transparent before rasterizing it.

cupid <- readPNG("/Users/andrewz/Desktop/cupid.png", FALSE)
w <- matrix(rgb(cupid[ , , 1], cupid[ , , 2], cupid[ , , 3], cupid[ , , 4] * 0.2), nrow = dim(cupid)[1])

Creating the Plot

Finally, we are ready to create the plot. In the initial calls to ggplot, I use the rasterGrob() function from the grid package to rasterize Cupid which is placed in the plot using the annotation_custom() layer. The color= and fill= arguments to the geom_smooth() layer set the beautiful magenta/pink color for the regression smoother. The theme_valentine() layer sets ggplot’s theme to use some specialized fonts (see the Valentine package on GitHub). The size of the points in the geom_point() layer is set to 0, to leave a blank canvas for us to add our hearts. This is assigned into the object p.

library(ggplot2)
library(grid)
library(Valentine)
p <- ggplot(data = nels, aes(x = Homework, y = Math)) +
	geom_point(size = 0) +
	annotation_custom(rasterGrob(w)) +
	theme_valentine() +
	geom_smooth(method = "lm", color = "#ec008c", fill = "#ec008c", lwd = 1.5) +
	ggtitle("HAPPY BIRTHDAY FLORENCE HENDERSON")

The hearts (points) are now added to the plot by cycling through a for loop. Each time we are going to rasterize the heart and add it to the plot. The optional arguments in the annotation_custom() layer set the horizontal and vertical position of the hearts in the plot.

for(i in 1:nrow(nels)){
    p <- p + annotation_custom(
      rasterGrob(get(nels$Heart[i])), 
      xmin = nels$Homework[i] - 0.5, xmax = nels$Homework[i] + 0.5, ymin = nels$Math[i] - 0.5, ymax = nels$Math[i] + 0.5
      ) 
    }

iNZight

We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.

Turning Tables into Graphs

We have just finished another semester, and before my mind completely turns to rubble, I want to share what I believe to be a fairly good assignment. What I present below was parts of two separate assignments that I gave this semester, but upon reflection I think it would be better as one.

—–

Read the article Let’s Practice What We Preach: Turning Tables into Graphs (full reference given below). In this article, Gelman, Pascarica, & Dodhia suggest that presentations of results using graphs are more effective than results presented in tables.

Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach: Turning tables into graphs. The American Statistician, 56(2), 121–130.

Find an article in a journal that presents results (or data) in a table. Re-create the data in a tabular format using R (or Excel).

  1. Use the functions in ggplot2 to produce a plot that conveys the same message as the original table.
  2.  Include the original table (this can be a screenshot or web-link) and citation, along with your plot.
  3. Write a few sentences describing why the plot you produced provides a better presentation of the results or data (be sure to use recommendations from the article in making your case).

In the second part of this assignment, you will write a tutorial for the process you followed for turning a table into a plot using R Markdown and will publish that tutorial on RPubs.

There are several resources for learning R Markdown.

Your tutorial should be written so that a student who was just learning ggplot could follow your directions easily. Include instructions for obtaining the data, getting it into a useable tabular format, manipulating the data so it can be used with ggplot, and well-commented instructions for creating your final plot. (Think of the level of detail you would want in a tutorial when you were first learning ggplot!)

It should also include:

  • a citation or link to the website/journal that published the original table
  • a view of your final data (full or a subset depending on size)
  • all commands necessary to create your final plot (with appropriate explanation), and
  • the final plot

When you knit the .Rmd document it should compile without errors.

—–

Students commented that they learned a lot about the use of ggplot during the initial assignment (this was the second assignment in the course). The Markdown part of the assignment I gave as an extra credit assignment at the end of the class, but in retrospect, I should have made it required and done it very early on.

Here are a couple of the tutorials that I have received so far:

  • These students took a table of characteristics of survey participants published in the Journal of Ethnic and Cultural Diversity in Social Work and turned it into a bar graph.  http://rpubs.com/TSK_2012/3184
  • These students took data about trends and topics discussed in Seventeen Magazine‘s Traumarama articles from 1994-2007 and turned it into a line plot. http://rpubs.com/opalc123/3155
  • These students took a table of data related to approval ratings and turned them into a box-and-whiskers plot. http://www.rpubs.com/GeorgeBrisse/3217
  • These students’ work depict a great example of how data initially presented in a table is much easier to process in a graph. The data, from a table published in the Journal of Deaf Studies and Deaf Education, show the academic status and progress of deaf and hard-of-hearing students in general education classrooms.  http://rpubs.com/mens0055/3211
  • These students used a stacked bar chart to show data about the sample sizes for different stages for 12 problem behaviors published in Health Psychology. http://rpubs.com/nikedenise/3256
  • These students created a line graph representing pre- and post-training scores for consonant, vowel, sentence, and gender perception scores in cochlear implant users to examine whether an auditory training program improves performance. http://rpubs.com/koern030/3255

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

ggplot2 Pinterest

I don’t understand the website Pinterest, but it looks pretty (especially on the iPad), and an undergraduate student said it was the greatest thing since Facebook, so I thought I would give it a shot. The idea is that Pinterest “lets you organize and share all the beautiful things you find on the web.” You organize beautiful things by creating a “board” (a page), and then adding “pins” (links to websites).

My thought was…plots in ggplot2 are beautiful…I will create a board with useful links/tutorials for creating ggplot2 plots!

I already have three followers. Now you can follow too.

http://pinterest.com/zief0002/ggplot2/

Recoding Variables in R: Pedagogic Considerations

I was creating a dataset this last week in which I had to partition the observed responses to show how the ANOVA model partitions the variability. I had the observed Y (in this case prices for 113 bottles of wine), and a categorical predictor X (the region of France that each bottle of wine came from). I was going to add three columns to this data, the first showing the marginal mean, the second showing the effect, and the third showing the residual. To create the variable indicating the effect, I essentially wanted to recode a particular region to a particular effect:

  • Bordeaux ==> 9.11
  • Burgundy ==> 4.20
  • Languedoc ==> –9.30
  • Rhone ==> –0.75

As I was considering how to do this, it struck me that several options were available to me. Here are two solutions that come up when Googling how to do this.

Use the recode() function from the car package.

library(car)
wine$Effect <- recode(wine$Region,
  " 'Bordeaux' = 9.11;
    'Bordeaux' = 4.20;
    'Languedoc' = -9.30;
    'Rhone' = -0.75 " )
This is a commonly suggested solution. The strings inside quotation marks, however, make it likely students (and teachers) will commit a syntax error. This is especially true when recoding a categorical variable into another categorical variable. R-wise (it’s a technical term) it also produces a factor, even though it is clear that the intent was to produce numerical values. This is of course, easily fixable using as.numeric(), but it can lead to confusion.
Another solution is to use indexing.
wine$Effect <- 9.11
wine$Effect[wine$Region == "Burgundy"] <- 4.20
wine$Effect[wine$Region == "Languedoc"] <- -9.30
wine$Effect[wine$Region == "Rhone"] <- -0.75
This solution is canonical in that it is clean and the R code is concise. (Note: This is what I ended up using to create this re-coded variable.) In my experience, however, this also means that students without a programming background don’t initially understand it. This alone makes it unattractive pedagogically.

A better solution pedagogically seems to be to create a new data frame of key-value pairs (in computer science this is called a hash table) and then use the join() function from the plyr package to `join’ the original data frame and the new data frame.

key <- data.frame(
  Region = c("Bordeaux", "Burgundy", "Languedoc", "Rhone"),
  Effect = c(9.11, 4.20, -9.33, -0.75)
  )
join(wine, key, by = Region)

For me this is a useful way to teach how to recode variables. It has a direct link to the Excel VLOOKUP function, and also to ideas of relational databases. It also allows more generalizability in terms of being able to merge data sets using a common variable.

R-wise, it is not difficult syntax, since almost every student has successfully used the data.frame() function to create a data frame. The join() function is also easily explained.