Dear Gmail…

I recently added a free application/service that analyzes my email called Gmail Meter. This service sends me a comprehensive weekly report full of summaries and plots that indicate how I use Gmail.

The first thing I learned is that Wednesdays are for emailing and I seem to respond in a timely manner, on average, to emails sent to me…when I actually respond (I have a 24.58% response rate. Yikes!) Wednesdays I only teach one class (at 4:40pm) this semester, but I have a morning meeting so I am on campus and generally have time to respond to emails that I may not have gotten to.

Summary of my Gmail

The plot of my daily email traffic shows that most email is sent to me during the day (typical work hours), while my email times tend to be prior to classes in the morning and after my evening courses. Also, it is clear I am sending far less that I receive. It appears I am doing my part to lower my email footprint!
chartI seem to be more prompt on my email responses (for the most part) than others who respond to me. What is interesting, is that people who respond to me are in primarily very quick (<4hrs) or take more than a day to get back to me. This fits with the behavior I expect from most academics. chart-2In the emails I send, I tend to be terse. Generally, I try to avoid long emails to people since when long emails are sent to me I tend to get cranky. (I recognize that sometimes it can’t be avoided.) I actually am quite pleased that the mode here is less than 10 words. (Again, yay for my footprint!)

I am not quite as happy to see that the mode for emails sent to me is the category indicating more than 200 words. Some of this is because of the university committees  I sit on. For example, the University of Minnesota Senate sends many emails. These emails often are lengthy because of the inclusion of bylaws and articles to the University Constitution that we will be voting on. That being said, I agree with this email charter which begs us all to keep it short.chart-3What kind of media attachments are taking up space in my Gmail box? It seems that most are Microsoft Word documents. Again, given my collaboration with other academics and feedback to students this makes sense to me. Since I have a Mac and most of my colleagues still work on PC, I send many documents as PDF files. My guess is that if this were sent to me a few years ago, the number of attachments would have been even higher. Our research group has slowly worked toward using sites like Dropbox to share documents. (Next stop…some versioning system.)chart-4Now for the plot that made me stop and write this post. Almost 90% of the email I received this week hit the trash can. Also a small percentage is still in my inbox. I am trying to achieve Inbox Zero, but just haven’t made it yet. I am currently down to xxx emails in my inbox. I signed up for the Mailbox app which should help with this goal when I check email on my phone, but like the Tempo app that Rob signed up for, there is a reservation system in place. Unlike Rob, my spot in the Mailbox line is nowhere near the bottom (last I looked 632,889 people in front of me) despite having reserved my place in line several weeks ago.chart-1I also receive information on the week’s top emailers to me (Joan) and the top recipients of my mail (one of my students); top conversation threads, a scatterplot of the number of words per email in a thread versus the rank of the email in the thread (was it the 1st email sent, 2nd, etc.). As one might expect there is a strong, negative relationship here. It also produces a word cloud based on the subjects and bodies of all messages sent or received directly. Lastly, it conditions emails received with attachments on whether they came from inside or outside the organization (University of Minnesota).

It is not clear that you can obtain the raw data, although it is not clear that you can’t either. There are of course ways to obtain the meta-data that Gmail Meter is using by scraping it using a program such as Python (see here). My guess is that you could also do this with R 9perhaps using the curl and XML packages). They have several feature requests for making Google Meter more customizable which would make it even cooler.

Happy Birthday Florence Henderson

As a celebration of Florence Henderson’s 79th birthday (on February 14), I have created this scatterplot to use in my regression course.

ValDay

The plot depicts the relationship between time spent on mathematics homework outside of school (expressed as z-scores) and mathematics achievement scores (expressed as T-scores, M=50, SD=10) for 200 8th-graders taken from the 1988 National Education Longitudinal Study. The color–in a display of very poor data science–is just randomly applied to the observations rather than meaning anything substantial. (Bloggers Note: I think it fits with the spirit of Valentine’s Day…a gratuitous, yet meaningless, gesture intended to make the receiver feel all gooshy.)

I created the plot using the Valentine package (available here) which applies a Valentine’s Day theme to ggplot. I also applied a picture of Cupid into the background of the plot and used hearts instead of points to plot the observations. Lastly, I changed the default color and fill on the regression smoother to more aptly fit the color scheme.

Below I will explain the how-to of creating this plot.

Reading in the NELS Data

First, I read the NELS data into R. These data and its codebook are available via my regression course website.

nels <- read.csv("http://www.tc.umn.edu/~zief0002/Data/NELS.csv")

Using Hearts Instead of Points

I first needed to find an image of a heart that I liked. For icons of all sorts, I generally use The Noun Project. (This particular heart can be found here.) All of the images at The Noun Project are SVG files. This makes them very useful for display in browsers. To use them in ggplot, I converted the SVG image to a PNG file using the free image manipulation program GIMP (Perhaps you can use the SVG format directly without converting it, but I have never done that, so I don’t know.)

Using GIMP, I also replaced the black color to the hexadecimal color “ad97c6″ by selecting Color > Map > Color Exchange. (Double-clicking the color swatch under “To Color” allows for color entry in hexadecimal.) After this I saved the heart as “heart1.png”, and repeated the process four more times, but using the colors ”e58cbc” (heart2.png), “f2935b” (heart3.png), “9fc8b6″ (heart4.png), and “eddc74″ (heart5.png). (Note: These colors are the same colors I chose for the Valentine’s Day theme color and fill palettes and are based on the colors of the candy hearts you get in elementary school.)

I then used the png package to read the five PNG files into R. (Note: For anyone who doesn’t want to go through the hassle of coloring the hearts and reformatting both them and Cupid, I have made those files available for download here.)

library(png)
h1 <- readPNG("/Users/andrewz/Desktop/heart1.png", TRUE)
h2 <- readPNG("/Users/andrewz/Desktop/heart2.png", TRUE)
h3 <- readPNG("/Users/andrewz/Desktop/heart3.png", TRUE)
h4 <- readPNG("/Users/andrewz/Desktop/heart4.png", TRUE)
h5 <- readPNG("/Users/andrewz/Desktop/heart5.png", TRUE)

To randomly assign each observation to one of the five hearts (h1–h5), I used the sample() function inside the paste() function to concatenate the letter “h” and a random value from 1–5. I then used the do() function from the mosaic package to “do” this 200 times. Lastly, I appended this vector to the nels data frame, in the process, coercing it to characters (to be sure it isn’t appended as a factor–which will be needed later for use in the get() function).

library(mosaic)
heart <- do(200) * paste("h", sample(1:5, size = 1), sep="")
nels$Heart <- as.character(heart[, 1])
head(nels)
#  ID   Homework   Math Heart
#1  1 -0.3329931 42.432    h4
#2  2 -0.2136822 53.698    h2
#3  3 -1.0077991 49.205    h2
#4  4  0.2059000 53.698    h1
#5  5 -0.1177185 55.980    h5
#6  6  0.1413540 65.331    h3

This is all that is needed until we get to actually creating the plot.

Adding a Cupid into the Plot’s Background

I obtained the image of Cupid by doing a Google search on “Cupid, Public Domain”. (The actual image I used is available here.) Since the image was a JPEG, I again converted the image to a PNG (this time using Preview) and changed the image size to have a width of 800 pixels. I also used the Instant Alpha tool in Preview’s toolbar to make the white background transparent. As a sidebar, this could all be done in GIMP as well.

I again used the readPNG() function to read the image file into R, but this time setting the native= argument to FALSE. This will represent the image as an array rather than rasterizing it. I chose to do this so that I could make the image more transparent before rasterizing it.

cupid <- readPNG("/Users/andrewz/Desktop/cupid.png", FALSE)
w <- matrix(rgb(cupid[ , , 1], cupid[ , , 2], cupid[ , , 3], cupid[ , , 4] * 0.2), nrow = dim(cupid)[1])

Creating the Plot

Finally, we are ready to create the plot. In the initial calls to ggplot, I use the rasterGrob() function from the grid package to rasterize Cupid which is placed in the plot using the annotation_custom() layer. The color= and fill= arguments to the geom_smooth() layer set the beautiful magenta/pink color for the regression smoother. The theme_valentine() layer sets ggplot’s theme to use some specialized fonts (see the Valentine package on GitHub). The size of the points in the geom_point() layer is set to 0, to leave a blank canvas for us to add our hearts. This is assigned into the object p.

library(ggplot2)
library(grid)
library(Valentine)
p <- ggplot(data = nels, aes(x = Homework, y = Math)) +
	geom_point(size = 0) +
	annotation_custom(rasterGrob(w)) +
	theme_valentine() +
	geom_smooth(method = "lm", color = "#ec008c", fill = "#ec008c", lwd = 1.5) +
	ggtitle("HAPPY BIRTHDAY FLORENCE HENDERSON")

The hearts (points) are now added to the plot by cycling through a for loop. Each time we are going to rasterize the heart and add it to the plot. The optional arguments in the annotation_custom() layer set the horizontal and vertical position of the hearts in the plot.

for(i in 1:nrow(nels)){
    p <- p + annotation_custom(
      rasterGrob(get(nels$Heart[i])), 
      xmin = nels$Homework[i] - 0.5, xmax = nels$Homework[i] + 0.5, ymin = nels$Math[i] - 0.5, ymax = nels$Math[i] + 0.5
      ) 
    }

Miscellany that I have Read and been Thinking about this Last Week

I read a piece last night called 5 Ways Big Data Will Change Lives In 2013. I really wasn’t expecting much from it, just scrolling through accumulated articles on Zite. However, as with so many things, there were some gems to be had. I learned of Aadhar.

Aadhar is an ambitious government Big Data project aimed at becoming the world’s largest biometric database by 2014, with a goal of capturing about 600 million Indian identities…[which] could help India’s government and businesses deliver more efficient public services and facilitate direct cash transfers to some of the world’s poorest people — while saving billions of dollars each year.

The part that made me sit up and take notice was this line, “India’s Aadhar collects sensitive information, such as fingerprints and retinal scans. Yet people volunteer because the potential incentives can make the data privacy and security pitfalls look miniscule — especially if you’re impoverished.”

I have been reading and hearing about concerns of data privacy for quite awhile, yet nobody that I have been reading (or listening to) has once suggested what the circumstances are that would have citizens forego all sense of privacy. Poverty, especially extreme poverty, is one of those circumstances. As a humanist, I am all for facilitating resources in the most efficient ways possible, which inevitably involve technology. But, as a Citizen Statistician, I am all too aware of how a huge database of biometric data could be used (or mis-used as it were). It especially concerns me that our impoverished citizens, who are more likely to be in the database, will be more at risk for being taken advantage of.

A second headline that caught my eye was France Looks At Possibility Of Taxing Internet Companies For Data Mining. France is pointing out that companies such as Google and Facebook are making enormous sums of money dollars by mining and using citizens’ personal information, so why shouldn’t that be seen as a taxable asset? While this is a reasonable question, the article also points out that one potential consequence of such taxation is that the “free” model (at least monetarily) that these companies currently use might cease to exist.

Related to both of these articles, I also read a blog post about a seminar being offered in the Computer Science department at the University of Utah entitled Accountability in Data Mining. The professor of the course wrote in the post,

I’m a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be “in scope”. I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it’s a 1-credit seminar that meets once a week, I obviously can’t cover everything I’d like, but I’d like to flesh out the readings with related work that people can peruse later.

It is about time some university offered such a course. I think this will be ultimately useful (and probably should be required) content to include in every statistics course taught. In making decisions using data, who is accountable for those decisions, and the consequences thereof?

1331746205255_562228Lastly, I would be remiss to not include a link to what might be the article I resonated to most: It’s not 1989. The author points out that the excuse “I’m not good with computers” is not acceptable any longer, especially for educators. He makes a case for a minimum level of technological competency that teachers should have in today’s day and age. I especially agree with the last point,

Every teachers must have a willingness to continue to learn! Technology is ever evolving, and excellent teachers must be life-long learners. (Particularly in the realm of technology!)

The lack of ability with computers that I see on a day-to-day basis in several students and faculty (even the base-level literacy that the author wants) is frightening and saddening at the same time. I would love to see colleges and universities give all incoming students a computer literacy test at the same time as they take their math placement test. If you can’t copy-and-paste you should be sent to a remedial course to obtain the skills you need to acquire before taking any courses at the institution.

My Year of Reading in Review

Two years ago, I made a New Year’s Resolution to read more books. At that point I joined GoodReads to hold myself accountable. I read 47 books that year (at least that I recorded). In 2012, I didn’t re-make that resolution, and my reading productivity dropped to 29 (really 26 since I quit reading 3 books). While the number of books is lower, I did some minor analyses on these books based on data I scraped from GoodReads and Amazon.

One summary I created was to examine the number of books I read per month. I also wanted to account for the fact that some books are a lot shorter than others, so in addition I looked at the average number of pages I read per month as well.

Number of books and average pages read per month in 2012.

Number of books and average pages read per month in 2012.

It is clear that May and December are prolific reading months for me. My interpretation is that these are the months that semesters end, and very often I retreat into the pages of a book or two to escape for a bit.

How do I rate these books I read? Are Amazon and GoodReads raters giving the books I read the same rating?

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

Average rating of the books I read in 2012 for Amazon and GoodReads raters. Size and color of the points indicate my GoodReads rating.

I gave mostly 3/5 and 4/5 stars to the books I read. It is clear from the plot that there is an overall positive relationship: books that are rated highly by GoodReads raters are, on average, the same books being rated highly by Amazon raters–and vice-versa.

What book did I give a rating of 5/5 to? The synopsis of my reading year, including the answer to this question, is available here [2012-Annual-Reading-Synopsis].

I have also made the spreadsheet data (from both 2011 and 2012) available publicly on GoogleDocs [data].

Here is the R code to access that data.

library(RCurl)
myCsv <- getURL("https://docs.google.com/spreadsheet/pub?key=0AvanLJO1M39wdENZajR0RHJMSmZTWWtLNzhHMi1ySUE&single=true&gid=0&output=csv")
books <- read.csv(textConnection(myCsv))

Turning Tables into Graphs

We have just finished another semester, and before my mind completely turns to rubble, I want to share what I believe to be a fairly good assignment. What I present below was parts of two separate assignments that I gave this semester, but upon reflection I think it would be better as one.

—–

Read the article Let’s Practice What We Preach: Turning Tables into Graphs (full reference given below). In this article, Gelman, Pascarica, & Dodhia suggest that presentations of results using graphs are more effective than results presented in tables.

Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach: Turning tables into graphs. The American Statistician, 56(2), 121–130.

Find an article in a journal that presents results (or data) in a table. Re-create the data in a tabular format using R (or Excel).

  1. Use the functions in ggplot2 to produce a plot that conveys the same message as the original table.
  2.  Include the original table (this can be a screenshot or web-link) and citation, along with your plot.
  3. Write a few sentences describing why the plot you produced provides a better presentation of the results or data (be sure to use recommendations from the article in making your case).

In the second part of this assignment, you will write a tutorial for the process you followed for turning a table into a plot using R Markdown and will publish that tutorial on RPubs.

There are several resources for learning R Markdown.

Your tutorial should be written so that a student who was just learning ggplot could follow your directions easily. Include instructions for obtaining the data, getting it into a useable tabular format, manipulating the data so it can be used with ggplot, and well-commented instructions for creating your final plot. (Think of the level of detail you would want in a tutorial when you were first learning ggplot!)

It should also include:

  • a citation or link to the website/journal that published the original table
  • a view of your final data (full or a subset depending on size)
  • all commands necessary to create your final plot (with appropriate explanation), and
  • the final plot

When you knit the .Rmd document it should compile without errors.

—–

Students commented that they learned a lot about the use of ggplot during the initial assignment (this was the second assignment in the course). The Markdown part of the assignment I gave as an extra credit assignment at the end of the class, but in retrospect, I should have made it required and done it very early on.

Here are a couple of the tutorials that I have received so far:

  • These students took a table of characteristics of survey participants published in the Journal of Ethnic and Cultural Diversity in Social Work and turned it into a bar graph.  http://rpubs.com/TSK_2012/3184
  • These students took data about trends and topics discussed in Seventeen Magazine‘s Traumarama articles from 1994-2007 and turned it into a line plot. http://rpubs.com/opalc123/3155
  • These students took a table of data related to approval ratings and turned them into a box-and-whiskers plot. http://www.rpubs.com/GeorgeBrisse/3217
  • These students’ work depict a great example of how data initially presented in a table is much easier to process in a graph. The data, from a table published in the Journal of Deaf Studies and Deaf Education, show the academic status and progress of deaf and hard-of-hearing students in general education classrooms.  http://rpubs.com/mens0055/3211
  • These students used a stacked bar chart to show data about the sample sizes for different stages for 12 problem behaviors published in Health Psychology. http://rpubs.com/nikedenise/3256
  • These students created a line graph representing pre- and post-training scores for consonant, vowel, sentence, and gender perception scores in cochlear implant users to examine whether an auditory training program improves performance. http://rpubs.com/koern030/3255

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

Data Sets: A List in Flux

After my Pinterest post, I got a little bit hooked, mostly because I realized that it was a visual way for me to see my bookmarks. This makes it easier for me to find the information I am looking for quickly. One problem is that it requires an image, so I quickly realized that the links for data sets wouldn’t work so well on Pinterest.

Then I remembered that I have used my personal blog as an organized reminder list (see this post where I remind myself how to re-set features on my computer after disaster), and thought I could do the same here, but with  data sets that others could also use. So, inspired by the post Rob linked to some time ago (Finding Data) I thought I would start putting together a more comprehensive list as I have time. This way, when I come across new data sets, I can just add them in.

Below, I have taken the data sets listed by RevoJoe from his post for Inside-R, and reorganized them a little. Over time, they could be re-organized again, and again, and again. It would be nice to add a short description for each as well (maybe someday). I have added some others to the list as well.

COLLECTIONS

ECONOMICS

EDUCATION

ENTERTAINMENT

FINANCE

GOVERNMENT, WORLD-LEVEL

GOVERNMENT, COUNTRY-LEVEL

GOVERNMENT, CITY-LEVEL

MACHINE LEARNING

SCIENCE

SOCIAL SCIENCES

TIME SERIES

UNIVERSITIES

USING R TO PULL DATA

ggplot2 Pinterest

I don’t understand the website Pinterest, but it looks pretty (especially on the iPad), and an undergraduate student said it was the greatest thing since Facebook, so I thought I would give it a shot. The idea is that Pinterest “lets you organize and share all the beautiful things you find on the web.” You organize beautiful things by creating a “board” (a page), and then adding “pins” (links to websites).

My thought was…plots in ggplot2 are beautiful…I will create a board with useful links/tutorials for creating ggplot2 plots!

I already have three followers. Now you can follow too.

http://pinterest.com/zief0002/ggplot2/