I was creating a dataset this last week in which I had to partition the observed responses to show how the ANOVA model partitions the variability. I had the observed Y (in this case prices for 113 bottles of wine), and a categorical predictor X (the region of France that each bottle of wine came from). I was going to add three columns to this data, the first showing the marginal mean, the second showing the effect, and the third showing the residual. To create the variable indicating the effect, I essentially wanted to recode a particular region to a particular effect:
- Bordeaux ==> 9.11
- Burgundy ==> 4.20
- Languedoc ==> –9.30
- Rhone ==> –0.75
As I was considering how to do this, it struck me that several options were available to me. Here are two solutions that come up when Googling how to do this.
Use the recode() function from the car package.
library(car)
wine$Effect <- recode(wine$Region,
" 'Bordeaux' = 9.11;
'Bordeaux' = 4.20;
'Languedoc' = -9.30;
'Rhone' = -0.75 " )
wine$Effect <- 9.11 wine$Effect[wine$Region == "Burgundy"] <- 4.20 wine$Effect[wine$Region == "Languedoc"] <- -9.30 wine$Effect[wine$Region == "Rhone"] <- -0.75
A better solution pedagogically seems to be to create a new data frame of key-value pairs (in computer science this is called a hash table) and then use the join() function from the plyr package to `join’ the original data frame and the new data frame.
key <- data.frame(
Region = c("Bordeaux", "Burgundy", "Languedoc", "Rhone"),
Effect = c(9.11, 4.20, -9.33, -0.75)
)
join(wine, key, by = Region)
For me this is a useful way to teach how to recode variables. It has a direct link to the Excel VLOOKUP function, and also to ideas of relational databases. It also allows more generalizability in terms of being able to merge data sets using a common variable.
R-wise, it is not difficult syntax, since almost every student has successfully used the data.frame() function to create a data frame. The join() function is also easily explained.
I’m with you on join for more complicated cases, but here I think regular character subsetting is all you need:
effects <- c(Bordeaux = 9.11, Burgundy = 4.20, Languedoc = -9.33, Rhone = -0.75)
wine$Effect <- effects[wine$Region]
That is the easiest and most concise solution I have seen for recoding. Yet, it wasn’t even on my radar. It is amazing, in some ways, that even after using and teaching R for this long I still don’t really have a handle on it. Thank you for the comment.