Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).

 

PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!

How do Readers Perceive the Results of a Data Analysis?

As a statistician who often needs to explain methods and results of analyses to non-statisticians, I have been receptive to the influx of literature related to the use of storytelling or a data narrative. (I am also aware of the backlash related to use of the word “storytelling” in regards to scientific analysis, although I am less concerned about this than, say, these scholars.) As a teacher of data analysis, the use of narrative is especially poignant in that it ties the analyses performed intrinsically to the data context—or at the very least, to a logical flow of methods used.

I recently read an article posted on Brain Pickings about the psychology behind great stories. In the article, the author, Maria Popova,  cites Jerome Bruner’s (a pioneer of cognitive psychology) essay “Two Modes of Thought”:

There are two modes of cognitive functioning, two modes of thought, each providing distinctive ways of ordering experience, of constructing reality. The two (though complementary) are irreducible to one another. Efforts to reduce one mode to the other or to ignore one at the expense of the other inevitably fail to capture the rich diversity of thought.

Each of the ways of knowing, moreover, has operating principles of its own and its own criteria of well-formedness. They differ radically in their procedures for verification. A good story and a well-formed argument are different natural kinds. Both can be used as means for convincing another. Yet what they convince of is fundamentally different: arguments convince one of their truth, stories of their lifelikeness. The one verifies by eventual appeal to procedures for establishing formal and empirical proof. The other establishes not truth but verisimilitude.

The essence of his essay is that, as Popova states, “a story (allegedly true or allegedly fictional) is judged for its goodness as a story by criteria that are of a different kind from those used to judge a logical argument as adequate or correct.” [highlighting is mine]

What type of implications does this have on a data narrative, where, in principle, both criteria are being judged? Does one outweigh the other in a reader’s judgment? How does this affect reviewers when they are making decisions about publication?

My sense is that psychologically, the judgment of two differing sets of criteria will lead most humans to judge one as being more salient than the other. Presumably, most scientists would want the logical argument (or as Brunner calls it, the logico-scientific argument) to prevail in this case. However, I think it is the story that most readers, even those with scientific backgrounds, will tend to remember.

As for reviewers, again, the presumption is that they will evaluate a paper’s merit on its scientific evidence. But, as any reviewer can tell you, the writing and narrative presenting that evidence is in some ways as important for that evidence to be believable. This is why great courtroom attorneys  spend just as much time on developing the story around a case as marshaling a logical argument they will use to entice a jury.

There is little guidance for statisticians, especially nascent statisticians, about what the “right” degree of paradigmatic and logico-scientific argument should be when writing up data analysis. In fact, many of us (including myself) do not often consider the impact of a reader’s weighing of these different types of evidence. My training in graduate school was more focused on the latter type of writing, and it was only in undergraduate writing courses and through reading other authors’ thoughts about writing that the former is even relevant to me. Ultimately there may be not much to do about how reader’s perceive our work. Perhaps it is as Sylvia Plath wrote about poetry, once a poem is made available to the public, the right of interpretation belongs to the reader.”

Tools for Managing Your Inbox

First of all, happy new year to all of our readers.

As my first contribution in 2016, I thought I would share a couple tools that have helped tame my email inbox with you. In my continued resolution to finally achieve Inbox Zero, I have made a major dent in the last month. This is thanks to two tools: Unroll Me and Google Mail’s “Save & Archive” button.

Unroll Me

The first tool I would like to share with you is an app called Unroll Me. This app makes unsubscribing to email services or adding them to a once-a-day-digest super easy…no more clicking “unsubscribe” in each individual email. After you sign up, Unroll Me examines your inbox for different email subscriptions. You then have three choices for each subscription you find:

  1. Unsubscribe
  2. Add to Rollup
  3. Keep in Inbox

The first and third are self-explanatory. The second option adds all selected subscriptions to a digest-like email that comes once-per-day. Here is an example of my rollup:

unrollmeUnroll Me keeps a history of your rollups for easy reference, and adds new subscriptions that it finds for you to manage. You can also change the preference for any of your subscriptions at any time. Also, since Unroll Me keeps a list of emails that you have unsubscribed from, it is easy to re-subscribe at any point.

This tool has changed my inbox. Digest-like emails are great for reducing inbox clutter (it is the email equivalent of putting household odds-and-ends into a nice wicker basket). Unfortunately, many places that should have an option for this, simply do not. For example, the University of Minnesota (my place of employment) has somehow auto subscribed me to a million email lists. In general, I am not interested in about 90% of what they send, and another 9% are related to things I don’t need to see immediately. Unroll Me has allowed me to keep the email subscription I want to see immediately in my inbox, put those that are less important in a rollup, and eliminates those emails I could care less about completely from my sight.

Google Mail’s “Send & Archive” Button

This is a Google Mail option I learned about on Lifehacker. Why it is not a default button, I do not know. Go to Google Mail’s Settings, and under the General Setting, click the option button labelled “Show ‘Send & Archive’ button in reply”.

senandarchiveThis will add a button to any email you reply to that allows you to send the email and archive the email message you just replied to, essentially combining two steps in one.

emailreplyYou also still have the send button if you want to keep the original email in your inbox.

I hope these tools might also help you. If you have further suggestions, put them in the comments to share.

 

Teaching computation as an argument for simulation based inference

Check out my guest post on the Simulation-based statistical inference blog:

Teaching computation as an argument for simulation-based inference

If you are interested in teaching simulation-based methods, or if you just want to find out more why others are, I highly recommend the posts on this blog. The page also hosts many other useful resources as well as information on upcoming workshops as well.

A two-hour introduction to data analysis in R

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
 
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

Fruit Plot: Plotting Using Multiple PNGs

In one of our previous posts (Halloween: An Excuse for Plotting with Icons), we gave a quick tutorial on how to plot using icons using ggplot. A reader, Dr. D. K. Samuel asked in a comment how to use multiple icons. His comment read,

…can you make a blog post on using multiple icons for such data
year, crop,yield
1995,Tomato,250
1995,Apple,300
1995,Orange,500
2000, Tomato,600
2000,Apple, 800
2000,Orange,900
it will be nice to use icons for each data point. It will also be nice if the (icon) data could be colored by year.

This blog post will address this request. First, the result…

fruit-plot

The process I used to create this plot is as follows:

  1. Find the icons that you want to use in place of the points on your scatterplot (or dot plot).

I used an apple icon (created by Creative Stall), an orange icon (created by Gui Zamarioli), and a tomato icon (created by Andrey Vasiliev); all obtained from The Noun Project.

  1. Color the icons.

After downloading the icons, I used Gimp, a free image manipulation program, to color each of the icons. I created a green version, and a blue version of each icon. (The request asked for the two different years to have different colors.) I also cropped the icons.

Given that there were only three icons, doing this manually was not much of a time burden (10 minutes after I selected the color palette—using colorbrewer.org). Could this be done programatically? I am not sure. A person, who is not me, might be able to write some commands to do this with ImageMagick or some other program. You might also be able to do this in R, but I sure don’t know how…I imagine it involves re-writing the values for the pixels you want to change the color of, but how you determine which of those you want is beyond me.

If you are interested in only changing the color of the icon outline, an alternative would be to download the SVGs rather than the PNGs. Opening the SVG file in a text editor gives the underlying syntax for the SVG. For example, the apple icon looks like this:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
  <g>
    <path d="M19.749,48c-1.662... />
    <path d="M24.001,14.866c-0.048, ... />
    <path d="M29.512, ... />
  </g>
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>
</svg>

The three path commands draw the actual apple. The first draws the apple, the second path command draws the leaf on top of the apple, and the third draws the stem. Adding the text, fill=”blue” to the end of each path command will change the color of the path from black to blue (see below).

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
  <g>
    <path d="M19.749,48c-1.662 ... fill="blue" />
    <path d="M24.001,14.866c-0.048, ... fill="blue" />
    <path d="M29.512, ... fill="blue" />
  </g>
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>
</svg>

This could easily be programmatically changed. Then the SVG images could also programmatically be exported to PNGs.

  1. Read in the icons (which are PNG files).

Here we use the readPNG() function from the png library to bring the icon into R.

library(png)
blue_apple = readPNG("~/Desktop/fruit-plot/blue_apple.png", TRUE)
green_apple = readPNG("~/Desktop/fruit-plot/green_apple.png", TRUE)
blue_orange = readPNG("~/Desktop/fruit-plot/blue_orange.png", TRUE)
green_orange = readPNG("~/Desktop/fruit-plot/green_orange.png", TRUE)
blue_tomato = readPNG("~/Desktop/fruit-plot/blue_tomato.png", TRUE)
green_tomato = readPNG("~/Desktop/fruit-plot/green_tomato.png", TRUE)
  1. Create the data.

Use the data.frame() function to create the data.

plotData = data.frame(
&nbsp; year = c(1995, 1995, 1995, 2000, 2000, 2000),
&nbsp; crop = c("tomato", "apple", "orange", "tomato", "apple", "orange"),
&nbsp; yield = c(250, 300, 500, 600, 800, 900)
)

plotData
  year   crop yield
1 1995 tomato   250
2 1995  apple   300
3 1995 orange   500
4 2000 tomato   600
5 2000  apple   800
6 2000 orange   900

Next we will add a column to our data frame that maps the year to color. This uses the ifelse() function. In this example, if the logical statement plotData$year == 1995 evaluates as TRUE, then the value will be “blue”. If it evaluates as FALSE, then the value will be “green”.

plotData$color = ifelse(plotData$year == 1995, "blue", "green")

plotData
  year   crop yield color
1 1995 tomato   250  blue
2 1995  apple   300  blue
3 1995 orange   500  blue
4 2000 tomato   600 green
5 2000  apple   800 green
6 2000 orange   900 green

Now we will use this new “color” column in conjunction with the “crop” column to identify the icon that will be plotted for each row. the paste0() function concatenates each argument together with no spaces between them. Here we are concatenating the color value, an underscore, and the crop value.

plotData$icon = paste0(plotData$color, "_", plotData$crop)

plotData
  year   crop yield color         icon
1 1995 tomato   250  blue  blue_tomato
2 1995  apple   300  blue   blue_apple
3 1995 orange   500  blue  blue_orange
4 2000 tomato   600 green green_tomato
5 2000  apple   800 green  green_apple
6 2000 orange   900 green green_orange
  1. Use ggplot to create a scatterplot of the data, making the size of the points 0.
library(ggplot2)

p = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  theme_bw() +
  xlab("Year") +
  ylab("Yield")
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

Similar to the previous post, we add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The key is that the image name is now in the “icon” column of the data frame. The values in the “icon” column are character data. To make R treat these as objects we first parse the character data using the parse() function, and then we use eval() to have R evaluate the parsed expression. A description of this appears in this Stack Overflow question.

library(grid)

for(i in 1:nrow(plotData)){
  p = p + annotation_custom(
    rasterGrob(eval(parse(text = plotData$icon[i]))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] - 20, ymax = plotData$yield[i] + 20
  )
} 

# Show plot
print(p)
  1. Some issues to consider and my alternative plot.

I think that plot is what was requested, but since I cannot help myself, I would propose a few changes that I think would make this plot better. First, I would add lines to connect each fruit (apple in 1995 to apple in 2000). This would help the reader to better track the change in yield over time.

Secondly, I would actually leave the fruit color constant across years and vary the color between fruits (probably coloring them according to their real-world colors). This again helps the reader in that they can more easily identify the fruits and also helps them track the change in yield. (It also avoids a Stroop-like effect of coloring an orange some other color than orange!)

Here is the code:

# Read in PNG files
apple = readPNG("~/Desktop/fruit-plot/red_apple.png", TRUE)
orange = readPNG("~/Desktop/fruit-plot/orange_orange.png", TRUE)
tomato = readPNG("~/Desktop/fruit-plot/red_tomato.png", TRUE)

# Plot
p2 = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  geom_line(aes(group = crop), lty = "dashed") +
  theme_bw()  +
  xlab("Year") +
  ylab("Yield") +
  annotate("text", x = 1997, y = 350, label = "Tomato created by Andrey Vasiliev from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 330, label = "Apple created by Creative Stall from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 310, label = "Orange created by Gui Zamarioli from the Noun Project", size = 2, hjust = 0)

for(i in 1:nrow(plotData)){
  p2 = p2 + annotation_custom(
    rasterGrob(eval(parse(text = as.character(plotData$crop[i])))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] -20, ymax = plotData$yield[i]+20
  )
}

# Show plot
print(p2)

And the result…

fruit-plot2

Halloween: An Excuse for Plotting with Icons

In my course on the GLM, we are discussing residual plots this week. Given that it is also Halloween this Saturday, it seems like a perfect time to code up a residual plot made of ghosts.

Ghost plotThe process I used to create this plot is as follows:

  1. Find an icon that you want to use in place of the points on your scatterplot (or dot plot).

I used a ghost icon (created by Andrea Mazzini) obtained from The Noun Project. After downloading the icon, I used Preview to create a new PNG file that had cut out the citation text in the downloaded image. I will add the citation text at a later stage in the plot itself. This new icon was 450×450 pixels.

  1. Use ggplot to create a scatterplot of a set of data, making the size of the points 0.

Here is the code that will create the data and make the plot that I used.

plotData = data.frame(
  .fitted = c(76.5, 81.3, 75.5, 79.5, 80.1, 78.5, 79.5, 77.5, 81.2, 80.4, 78.1, 79.5, 76.6, 79.4, 75.9, 86.6, 84.2, 83.1, 82.4, 78.4, 81.6, 79.6, 80.4, 82.3, 78.6, 82.1, 76.6, 82.1, 87, 82.2, 82.1, 87.2, 80.5, 84.9, 78.5, 79, 78.5, 81.5, 77.4, 76.8, 79.4, 75.5, 80.2, 80.4, 81.5, 81.5, 80.5, 79.2, 82.2, 83, 78.5, 79.2, 80.6, 78.6, 85.9, 76.5, 77.5, 84.1, 77.6, 81.2, 74.8, 83.4, 80.4, 77.6, 78.6, 83.3, 80.4, 80.5, 80.4, 83.8, 85.1, 82.2, 84.1, 80.2, 75.7, 83, 81.5, 83.1, 78.3, 76.9, 82, 82.3, 85.8, 78.5, 75.9, 80.4, 82.3, 75.7, 73.9, 80.4, 83.2, 85.2, 84.9, 80.4, 85.9, 76.8, 83.3, 80.2, 83.1, 77.6),
  .stdresid = c(0.2, -0.3, 0.5, 1.4, 0.3, -0.2, 1.2, -1.1, 0.7, -0.1, -0.3, -1.1, -1.5, -0.1, 0, -1, 1, 0.3, -0.5, 0.5, 1.8, 1.6, -0.1, -1.3, -0.2, -0.9, 1.1, -0.2, 1.5, -0.3, -1.2, -0.6, -0.4, -3, 0.5, 0.3, -0.8, 0.8, 0.5, 1.3, 1.8, 0.5, -1.6, -2, -2.1, -0.8, 0.4, -0.9, 0.4, -0.4, 0.6, 0.4, 1.4, -1.4, 1.3, 0.4, -0.8, -0.2, 0.5, 0.7, 0.5, 0.1, 0.1, -0.8, -2.1, 0, 1.9, -0.5, -0.1, -1.4, 0.6, 0.7, -0.3, 1, -0.7, 0.7, -0.2, 0.8, 1.3, -0.7, -0.4, 1.5, 2.1, 1.6, -1, 0.7, -1, 0.9, -0.3, 0.9, -0.3, -0.7, -0.9, -0.2, 1.2, -0.8, -0.9, -1.7, 0.6, -0.5)
  )

library(ggplot2)

p = ggplot(data = plotData, aes(x = .fitted, y = .stdresid)) +
    theme_bw() + 
    geom_hline(yintercept = 0) +
    geom_point(size = 0) +
    theme_bw() +
    xlab("Fitted values") +
    ylab("Standarized Residuals") +
    annotate("text", x = 76, y = -3, label = "Ghost created by Andrea Mazzini from Noun Project")
  1. Read in the icon (which is a PNG file).

Here we use the readPNG() function from the png library to bring the icon into R.

library(png)
ghost = readPNG("/Users/andrewz/Desktop/ghost.png", TRUE)
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

The idea is that since we have saved our plot in the object p, we can add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The pseudo-like code for this is:

for(i in 1:nrow(plotData)){
    p = p + 
      annotation_custom(
        our_image,
        xmin = minimum_x_value_for_the_image, 
        xmax = maximum_x_value_for_the_image, 
        ymin = minimum_y_value_for_the_image, 
        ymax = maximum_y_value_for_the_image
        ) 
    }

In order for the image to be plotted, we first have to make it plot-able by making it a graphical object, or GROB.

The rasterGrob() function (found in the grid,/b> package) renders a bitmap image (raster image) into a graphical object or GROB which can then be displayed at a specified location, orientation, etc. Read more about using Raster images in R here.

The arguments xmin, xmax, ymin, and ymax give the horizontal and vertical locations (in data coordinates) of the raster image. In our residual plot, we want the center of the image to be located at the coordinates (.fitted, .stdresid). In the syntax below, we add a small bit to the maximum values and subtract a small bit from the minimum values to force the icon into a box that will plot the icons a bit smaller than their actual size. (#protip: play around with this value until you get a plot that looks good.)

library(grid)

for(i in 1:nrow(plotData)){
    p = p + annotation_custom(
      rasterGrob(ghost),
      xmin = plotData$.fitted[i]-0.2, xmax = plotData$.fitted[i]+0.2, 
      ymin = plotData$.stdresid[i]-0.2, ymax = plotData$.stdresid[i]+0.2
      ) 
    }

Finally we print the plot to our graphics device using

print(p)

And the result is eerily pleasant!

The African Data Initiative

Are you looking for a way to celebrate World Statistics Day? I know you are. And I can’t think of a better way than supporting the African Data Initiative (ADI).

I’m proud to have met some of the statisticians, statisticis educators and researchers who are leading this initative at an International Association of Statistics Educators Roundtable workshop in Cebu, The Phillipines, in 2012. You can read about Roger and David’s Stern’s projects in Kenya here in the journal Technology Innovations in Statistics Education. This group — represented at the workshop by father-and-son Roger and David, and at-the-time grad students Zacharaiah Mbasu and James Musyoka — impressed me with their determination to improve international statistical literacy and  with their successful and creative pragmatic implementations to adjust to the needs of the local situations in Kenya.

The ADI is seeking funds within the next 18 days to adapt two existing software packages, R and Instat+ so that there is a free, open-source, easy-to-learn statistical software package available and accessible throughout the world. While R is free and open-sourced, it is not easy to learn (particularly in areas where English literacy is low). Instat+ is, they claim, easy to learn but not open-source (and also does not run on Linux or Mac).

One of the exciting things about this project is that these solutions to statistical literacy are being developed by Africans working and researching in Africa, and are not ‘imported’ by groups or corporations with little experience implementing in the local schools. One lesson I’ve learned from my experience working with the Los Angeles Unified School District is that you must work closely with the schools for which you are developing curricula; outsider efforts have a lower chance of success. I hope you’ll take a moment –in the next 18 days–to become acquainted with this worthy project!

World Statistics Day is October 20.  The theme is Better Data. Better Lives.

Research Hack: Obtaining an RSS Feed for Websites without RSS Feeds

In one of our older posts, I wrote about using feedreaders and aggregators to keep up-to-date on blogs, journals, etc. These work great when the site you want to read have an RSS feed. RSS (Rich Site Summary) is simply a format that retrieves updated content from a webpage. A feedreader (or aggregator) grabs the “feed” and displays the updated content for you to read. No more having to visit the website daily to see if something changed or was updated.

Many sites have an RSS feed and make the process of adding the content to your feedreader fairly painless. In general the process works something like this…look for the RSS icon, click it, and choose the reader you want to send the content to, and click subscribe.

Subscribing to an RSS Feed

Subscribing to an RSS Feed

But what about sites that do not have an RSS feed explicitly in place? It turns out that there are third-party applications that can piece together an RSS feed for these sites as well. There are several of these applications available online. I have used Page2RSS and had pretty good luck.

Page2RSSJust enter the URL of the site you want to keep tabs on into the “Page URL” box and click the “to RSS” button. This will produce another URL that can be copied and pasted into your feedreader or subscribed to directly.

I have successfully used this to add RSS feeds from many sites and journals to my feedreader. Below is a shot from Feedly (my current reader) on my iPad Mini of updated content (as of this morning) on the Journal of Statistics Education website.

JSE on Feedly

TIL what happens if you use %>% instead of + in ggplot2

This post is about ggplot2 and dplyr packages, so let’s start with loading them:

 
library(ggplot2) 
library(dplyr) 

I can’t be the first person to make the following mistake:

 
ggplot(mtcars, aes(x = wt, y = mpg)) %>%
    geom_point() 

Can you spot the mistake in the code above? Look closely at the end of the first line.

The operator should be the + used in ggplot2 for layering, not the %>% operator used in dplyr for piping, like this:

 
ggplot(mtcars, aes(x = wt, y = mpg)) +
    geom_point() 

So what happens if you accidentally use the pipe operator instead of the +? You get the following error:

Error in get(x, envir = this, inherits = inh)(this, ...) : 
 Mapping should be a list of unevaluated mappings created by aes or aes_string

My Google search for this error did not yield my careless mistake as a potential cause. Since many people use these two packages together, I’m guessing such mix-up of operators can’t be too uncommon (right? I can’t be the only one…). So I’m leaving this post here for the next person who makes the same mistake.