NCTM Essential Understandings

NCTM has finally published books on statistics in its EU series. This is a rather traditional approach to statistics, given the context of this blog. But, since I’m a co-author (along with Roxy Peck and Stephen Miller), why not point you to it?

And while the book is not computational in theme, it does address a central issue of this blog: universal statistical knowledge.

A grades 6-9 version is due out any moment. Stay tuned.

Extreme Fitbit

I have been mourning the loss this week of my FitBit.  No idea where it went.  That’s the problem with small, portable data collection devices.  The very feature that makes them useable makes them lose-able.  Then I came across this possible solution

which raises entirely new questions about edible lines of data collection devices.


Your Flowing Data Defended

I had the privilege last week of listening to the dissertation defense of UCLA Stat’s newest PhD: Nathan Yau.  Congratulations, Nathan!

Nathan runs the very popular and fantastic blog Flowing Data, and his dissertation is about, in part, the creation of his app Your Flowing Data.  Essentially, this is a tool for collecting and analyzing personal data–data about you and your life.

One aspect of the thesis I really liked is a description of types of insight he found from a paper by Pousman, Stasko and Mateas (2007): Casual information visualization: Depictions of Data in every day life. (IEEE Transactions on Visualization and Computer Graphics, 13(6): 1145-1152.)  Nathan quotes four types of insights:

  • Analytic Insight.  Nathan describes these as ‘traditional’ statistical insights obtained from statistical models.
  • Awareness insight. “…remaining aware of data streams such as the weather, news…” People are simply aware that these everyday streams exist and so know to seek them for information when needed
  • Social Insight. Involvement in social networks help people define a place for themselves in relation to particular social contexts.
  • Reflective Insight.  Viewers take a step back from data and can reflect on something they were perhaps unaware of, or have an emotional reaction.

With respect to my Walk to Venice Beach, I think it would be interesting to see how experiences such as that can be leveraged into insights in these categories.  Although these insights are not hierarchical, it would also be interesting to see how these fit into understandings of statistical thinking and reasoning.  For example, some stats ed researchers are grappling with the role of ‘informal’ vs. ‘formal’ statistical inference, and I see the last three insights as supporting informal inference (when inference is called for at all.)

Nathan has lots to say about the role that developers can play in assisting people in gaining insight from data.  Our job, I believe, is to think carefully about the role that educators can play in strengthening these insights.  We spend too much time on the first insight, I think, and not enough time on the others.  But the others are what students will remember and use from their stats class.

Happy Birthday Florence Henderson

As a celebration of Florence Henderson’s 79th birthday (on February 14), I have created this scatterplot to use in my regression course.


The plot depicts the relationship between time spent on mathematics homework outside of school (expressed as z-scores) and mathematics achievement scores (expressed as T-scores, M=50, SD=10) for 200 8th-graders taken from the 1988 National Education Longitudinal Study. The color–in a display of very poor data science–is just randomly applied to the observations rather than meaning anything substantial. (Bloggers Note: I think it fits with the spirit of Valentine’s Day…a gratuitous, yet meaningless, gesture intended to make the receiver feel all gooshy.)

I created the plot using the Valentine package (available here) which applies a Valentine’s Day theme to ggplot. I also applied a picture of Cupid into the background of the plot and used hearts instead of points to plot the observations. Lastly, I changed the default color and fill on the regression smoother to more aptly fit the color scheme.

Below I will explain the how-to of creating this plot.

Reading in the NELS Data

First, I read the NELS data into R. These data and its codebook are available via my regression course website.

nels <- read.csv("")

Using Hearts Instead of Points

I first needed to find an image of a heart that I liked. For icons of all sorts, I generally use The Noun Project. (This particular heart can be found here.) All of the images at The Noun Project are SVG files. This makes them very useful for display in browsers. To use them in ggplot, I converted the SVG image to a PNG file using the free image manipulation program GIMP (Perhaps you can use the SVG format directly without converting it, but I have never done that, so I don’t know.)

Using GIMP, I also replaced the black color to the hexadecimal color “ad97c6” by selecting Color > Map > Color Exchange. (Double-clicking the color swatch under “To Color” allows for color entry in hexadecimal.) After this I saved the heart as “heart1.png”, and repeated the process four more times, but using the colors “e58cbc” (heart2.png), “f2935b” (heart3.png), “9fc8b6” (heart4.png), and “eddc74” (heart5.png). (Note: These colors are the same colors I chose for the Valentine’s Day theme color and fill palettes and are based on the colors of the candy hearts you get in elementary school.)

I then used the png package to read the five PNG files into R. (Note: For anyone who doesn’t want to go through the hassle of coloring the hearts and reformatting both them and Cupid, I have made those files available for download here.)

h1 <- readPNG("/Users/andrewz/Desktop/heart1.png", TRUE)
h2 <- readPNG("/Users/andrewz/Desktop/heart2.png", TRUE)
h3 <- readPNG("/Users/andrewz/Desktop/heart3.png", TRUE)
h4 <- readPNG("/Users/andrewz/Desktop/heart4.png", TRUE)
h5 <- readPNG("/Users/andrewz/Desktop/heart5.png", TRUE)

To randomly assign each observation to one of the five hearts (h1–h5), I used the sample() function inside the paste() function to concatenate the letter “h” and a random value from 1–5. I then used the do() function from the mosaic package to “do” this 200 times. Lastly, I appended this vector to the nels data frame, in the process, coercing it to characters (to be sure it isn’t appended as a factor–which will be needed later for use in the get() function).

heart <- do(200) * paste("h", sample(1:5, size = 1), sep="")
nels$Heart <- as.character(heart[, 1])
#  ID   Homework   Math Heart
#1  1 -0.3329931 42.432    h4
#2  2 -0.2136822 53.698    h2
#3  3 -1.0077991 49.205    h2
#4  4  0.2059000 53.698    h1
#5  5 -0.1177185 55.980    h5
#6  6  0.1413540 65.331    h3

This is all that is needed until we get to actually creating the plot.

Adding a Cupid into the Plot’s Background

I obtained the image of Cupid by doing a Google search on “Cupid, Public Domain”. (The actual image I used is available here.) Since the image was a JPEG, I again converted the image to a PNG (this time using Preview) and changed the image size to have a width of 800 pixels. I also used the Instant Alpha tool in Preview’s toolbar to make the white background transparent. As a sidebar, this could all be done in GIMP as well.

I again used the readPNG() function to read the image file into R, but this time setting the native= argument to FALSE. This will represent the image as an array rather than rasterizing it. I chose to do this so that I could make the image more transparent before rasterizing it.

cupid <- readPNG("/Users/andrewz/Desktop/cupid.png", FALSE)
w <- matrix(rgb(cupid[ , , 1], cupid[ , , 2], cupid[ , , 3], cupid[ , , 4] * 0.2), nrow = dim(cupid)[1])

Creating the Plot

Finally, we are ready to create the plot. In the initial calls to ggplot, I use the rasterGrob() function from the grid package to rasterize Cupid which is placed in the plot using the annotation_custom() layer. The color= and fill= arguments to the geom_smooth() layer set the beautiful magenta/pink color for the regression smoother. The theme_valentine() layer sets ggplot’s theme to use some specialized fonts (see the Valentine package on GitHub). The size of the points in the geom_point() layer is set to 0, to leave a blank canvas for us to add our hearts. This is assigned into the object p.

p <- ggplot(data = nels, aes(x = Homework, y = Math)) +
	geom_point(size = 0) +
	annotation_custom(rasterGrob(w)) +
	theme_valentine() +
	geom_smooth(method = "lm", color = "#ec008c", fill = "#ec008c", lwd = 1.5) +

The hearts (points) are now added to the plot by cycling through a for loop. Each time we are going to rasterize the heart and add it to the plot. The optional arguments in the annotation_custom() layer set the horizontal and vertical position of the hearts in the plot.

for(i in 1:nrow(nels)){
    p <- p + annotation_custom(
      xmin = nels$Homework[i] - 0.5, xmax = nels$Homework[i] + 0.5, ymin = nels$Math[i] - 0.5, ymax = nels$Math[i] + 0.5


We spend too much time musing about the Data Deluge, I fear, at the expense of talking about another component that has made citizen-statisticianship possible:  accessible statistical software.  “Accessible” in (at least) two senses:  affordable and ready-to-use.  This summer, Chris Wild demonstrated his group’s software iNZight at the Census@ School workshop in San Diego. iNZight is produced out of the University of Auckland, and is intended for kids to use along with the Census@Schools data.  Alas, the software is greatly hampered on a Mac, but even there has many features which kids and teachers will appreciate.  Their homepage says it all “A simple data analysis system which encourages exploring what data is saying without the distractions of driving complex software.”

First, it’s designed for easy-entry.  Kids can quickly upload data and see basic boxplots and summary statistics, without much effort. (There are some movies  on the homepage to help you get started, but it’s pretty much an intuitive interface.) Students can even easily calculate confidence intervals using bootstrapping or traditional methods.  Below are summaries of FitBit data collected this Fall quarter, and separated into days I taught in a classrom (Lecture==1) and days I did not.  It’s depressingly clear that teaching is good for me.  (It didn’t hurt that my classroom was almost a half mile from my office.)

Note that not only does the graphic look elegant, but it combines the dotplot with the boxplot, which helps cement the use of boxplots as summaries of distributions.  The green horizontal lines are 95% bootstrap confidence intervals for the medians.  stepsfitbitgraph

iNZight also lets students easily subset data, even against numerical variables.  For example, if I wanted to see how this relationship between teaching and non-teaching days held up depending on the number of stairs I climbed, I could subset, and the software automatically bins the subsetting variable, displaying separate boxplot pairs for each bin category.  There’s a slider that lets me move smoothly from bin to bin, although it’s not always easy to compare one pair of boxplots to another.  (This sort of thing is easier if, instead of examining a numerical-categorical relationship as I’ve chosen here, you do a numerical-numerical relationship.)

Advanced students can click on the “Advanced” tab and gain access to modeling features, time series, three-d rotating plots, and scatterplot matrices.  PC users can view some cool visualizations that emphasize the variability in re-sampling.