JSM 2016 session on Reproducibility in Statistics and Data Science

Will reproducibility always be this hard?Ten years after Ioannidis alleged that most scientific findings are false, reproducibility — or lack thereof — has become a full-blown crisis in science. Flagship journals like Nature and Science have published hand-wringing editorials and revised their policies in the hopes of heightening standards of reproducibility. In the statistical and data sciences, the barriers towards reproducibility are far lower, given that our analysis can usually be digitally encoded (e.g., scripts, algorithms, data files, etc.). Failure to ensure the credibility of our contributions will erode “the extraordinary power of statistics,” both among our colleagues and in our collaborations with scientists of all fields. This morning’s JSM session on Reproducibility in Statistics and Data Science featured talks on recent efforts in pursuit of reproducibility. The slides of talks by the speakers and the discussant are posted below.

Note that some links point to a GitHub repo including slides as well as other useful resources for the talk and for adopting reproducible frameworks for your research and teaching. I’m also including Twitter handles for the speakers which is likely the most efficient way for getting in touch with them if you have any questions for them.

This session was organized by Ben Baumer and myself as part of our Project TIER fellowship. Many thanks to Amelia McNamara, who is also a Project TIER fellow, for chairing the session (and correctly pronouncing my name)!

  • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets – Jenny Bryan – repo, including slides – @JennyBryan
  • Steps Toward Reproducible Research – Karl Broman – slides – @kwbroman
  • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! – Karthik Ram – slides – @_inundata
  • Integrating Reproducibility into the Undergraduate Statistics Curriculum – Mine Çetinkaya-Rundel – repo, including slides – @minebocek
  • Discussant: Yihui Xie – slides – @xieyihui

PS: Don’t miss this gem of a repo for links to many many more JSM 2016 slides. Thanks Karl for putting it together!

My JSM 2016 itinerary

JSM2016

JSM 2016 is almost here. I just spent an hour going through the (very) lengthy program. I think that was time well spent, though some might argue I should have been working on my talk instead…

Here is what my itinerary looks like as of today. If you know of a session that you think I might be interested in that I missed, please let me know! And if you go to any one of these sessions and not see me there, it means I got distracted by something else (or something close by).

Sunday, July 31

Unfortunately it looks like I’ll be in meetings all Sunday, but if there is an opportunity to sneak out I would love to see the following sessions:

4PM – 5:50pm

  • Making the Most of R Tools
    • Thinking with Data Using R and RStudio: Powerful Idioms for Analysts — Nicholas Jon Horton, Amherst College ; Randall Pruim, Calvin College ; Daniel Kaplan, Macalester College
    • Transform Your Workflow and Deliverables with Shiny and R Markdown — Garrett Grolemund, RStudio
    • Discussant: Hadley Wickham, Rice University
  • Media and Statistics
    • Causal Inferences from Observational Studies: Fracking, Earthquakes, and Oklahoma — Howard Wainer, NBME
    • It’s Not What We Say, It’s Not What They Hear, It’s What They Say They Heard — Barry Nussbaum, EPA
    • Bad Statistics, Bad Reporting, Bad Impact on Patients: The Story of the PACE Trial — Julie Rehmeyer, Discover Magazine
    • Can Statisticians Enlist the Media to Successfully Change Policy? — Donald A. Berry, MD Anderson Cancer Center
    • Discussant: Jessica Utts, University of California at Irvine

I’ll also be attending the ASA Awards Celebration (6:30 – 7:30pm) this evening.

Monday, August 1

On Monday there are a couple ASA DataFest related meetings. If you organized a DataFest in 2016, or would like to organize one in 2017 (especially if you will be doing so for the first time), please join us. Both meetings will be held at Hilton Chicago Hotel, Room H-PDR3.

  • 10:30am – 2016 ASA DataFest Debrief Meeting
  • 1pm – 2017ASA DataFest Planning Meeting

8:30AM – 10:20AM

  • Applied Data Visualization in Industry and Journalism
    • Linked Brushing in R — Hadley Wickham, Rice University
    • Creating Data Visualization Tools at Facebook — Andreas Gros, Facebook
    • Cocktail Party Horror Stories About Data Vis for Clients — Lynn Cherny, Ghostweather R&D
    • Visualizing the News at FiveThirtyEight — Andrei Scheinkman, FiveThirtyEight.com
    • Teaching Data Visualization to 100k Data Scientists: Lessons from Evidence-Based Data Analysis — Jeffrey Leek, Johns Hopkins Bloomberg School of Public Health

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

I am planning on splitting my time between

4:45pm – 6:15pm

ASA President’s Invited Address – Science and News: A Marriage of Convenience — Joe Palca, NPR

Evening

I’ll be splitting my time between the Statistical Computing and Graphics Mixer (6 – 8pm) and the Duke StatSci Dinner.

Tuesday, August 2

8:30AM – 10:20am

  • Introductory Overview Lecture: Data Science
    • On Mining Big Data and Social Network Analysis — Philip S. Yu, University of Illinois at Chicago
    • On Computational Thinking and Inferential Thinking — Michael I. Jordan, University of California at Berkeley

10:30AM – 12:20pm

I’m organizing and chairing the following invited session. I think we have a fantastic line up. Hoping to see many of you in the audience!

  • Doing More with Data in and Outside the Undergraduate Classroom
    • Computational Thinking and Statistical Thinking: Foundations of Data Science — Ani Adhikari, University of California at Berkeley ; Michael I. Jordan, University of California at Berkeley
    • Learning Communities: An Emerging Platform for Research in Statistics — Mark Daniel Ward, Purdue University
    • The ASA DataFest: Learning by Doing — Robert Gould, University of California at Los Angeles
    • Statistical Computing as an Introduction to Data Science — Colin Rundel, Duke University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

  • Interactive Visualizations and Web Applications for Analytics
    • Radiant: A Platform-Independent Browser-Based Interface for Business Analytics in R — Vincent Nijs, Rady School of Management
    • Rbokeh: An R Interface to the Bokeh Plotting Library — Ryan Hafen, Hafen Consulting
    • Composable Linked Interactive Visualizations in R with Htmlwidgets and Shiny — Joseph Cheng, RStudio
    • Papayar: A Better Interactive Neuroimage Plotter in R — John Muschelli, The Johns Hopkins University
    • Interactive and Dynamic Web-Based Graphics for Data Analysis — Carson Sievert, Iowa State University
    • HTML Widgets: Interactive Visualizations from R Made Easy! — Yihui Xie, RStudio ; Ramnath Vaidyanathan, Alteryx

If I could be in two places at once, I’d also love to see:

Evening

I’ll be splitting my time between the UCLA Statistics/Biostatistics Mixer (5-7pm), Google Cruise, and maybe a peek at the Dance Party.

Sad to be missing the ASA President’s Address – Appreciating Statistics.

Wednesday, August 3

8:30AM – 10:20am

I’m speaking at the following session co-organized by Ben Baumer and myself. If you’re interested in reproducible data analysis, don’t miss it!

  • Reproducibility in Statistics and Data Science
    • Reproducibility for All and Our Love/Hate Relationship with Spreadsheets — Jennifer Bryan, University of British Columbia
    • Steps Toward Reproducible Research — Karl W. Broman, University of Wisconsin – Madison
    • Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! — Karthik Ram, University of California at Berkeley
    • Integrating Reproducibility into the Undergraduate Statistics Curriculum — Mine Cetinkaya-Rundel, Duke University
    • Discussant: Yihui Xie, RStudio

If I could be in two places at once, I’d also love to see:

10:30AM – 12:20pm

  • The 2016 Statistical Computing and Graphics Award Honors William S. Cleveland
    • Bill Cleveland: Il Maestro of Statistical Graphics — Nicholas Fisher, University of Sydney
    • Modern Crowd-Sourcing Validates Cleveland’s 1984 Hierarchy of Graphical Elements — Dianne Cook, Monash University
    • Some Reflections on Dynamic Graphics for Data Exploration — Luke-Jon Tierney, University of Iowa
    • Carpe Datum! Bill Cleveland’s Contributions to Data Science and Big Data Analysis — Steve Scott, Google Analytics
    • Scaling Up Statistical Models to Hadoop Using Tessera — Jim Harner, West Virginia University

If I could be in two places at once, I’d also love to see:

2PM – 3:50pm

If I could be in two places at once, I’d also see:

4:45PM – 6:15pm

Evening

I’m planning on attending the Section on Statistical Education Meeting / Mixer (6-7:30pm).

Thursday, August 4

8:30AM – 10:20am

I think I have to attend a meeting at this time, but if I get a chance I’d love to see:

  • Big Data and Data Science Education
    • Teaching Students to Work with Big Data Through Visualizations — Shonda Kuiper, Grinnell College
    • A Data Visualization Course for Undergraduate Data Science Students — Silas Bergen, Winona State University
    • Intro Stats for Future Data Scientists — Brianna Heggeseth, Williams College ; Richard De Veaux, Williams College
    • An Undergraduate Data Science Program — James Albert, Bowling Green State University ; Maria Rizzo, Bowling Green State University
    • Modernizing an Undergraduate Multivariate Statistics Class — David Hitchcock, University of South Carolina ; Xiaoyan Lin, University of South Carolina ; Brian Habing, University of South Carolina
    • Business Analytics and Implications for Applied Statistics Education — Samuel Woolford, Bentley University
    • DataSurfing on the World Wide Web: Part 2 — Robin Lock, St. Lawrence University

10:30AM – 12:20pm

  • Showcasing Statistics and Public Policy
    • The Twentieth-Century Reversal: How Did the Republican States Switch to the Democrats and Vice Versa? — Andrew Gelman, Columbia University
    • A Commentary on Statistical Assessment of Violence Recidivism Risk — Peter B. Imrey, Cleveland Clinic ; Philip Dawid, University of Cambridge
    • Using Student Test Scores for Teacher Evaluations: The Pros and Cons of Student Growth Percentiles — J.R. Lockwood, Educational Testing Service ; Katherine E. Castellano, Educational Testing Service ; Daniel F. McCaffrey, Educational Testing Service
    • Discussant: David Banks, Duke University

If I could be in two places, I’d also love to see:

That’s it folks! It’s an ambitious itinerary, let’s hope I get through it all.

I probably won’t get a chance to write daily digests like I’ve tried to do in previous years at JSM, but I’ll tweet about interesting things I hear from @minebocek. I’m sure there will be lots of JSM chatter at #JSM2016 as well.

Now, somebody give me something else to look forward to, and tell me Chicago is cooler than Durham!

Statistics with R on Coursera

18332552I held off on posting about this until we had all the courses ready, and we still have a bit more work to do on the last component, but I’m proud to announce that the specialization called Statistics with R is now on Coursera!

Some of you might know that I’ve had a course on Coursera for a while now (whatever “a while” means on MOOC-land), but it was time to refresh things a bit to align the course with other Coursera offerings — shorter, modular, etc. So I chopped up the old course into bite size chunks and made some enhancements in each component such as

  • integrating dplyr and ggplot2 syntax into the R labs,
  • restructuring the labs to be completed in R Markdown to provide better scaffolding for a data analysis project for each course,
  • adding Shiny apps to some of the labs to better demonstrate statistical concepts without burdening the learners with coding beyond the level of the course,
  • creating an R package that contains all the data, custom functions, etc. used in the course, and
  • cleaning things up a bit to make the weekly workload consistent across weeks.

The underlying code for the labs and the package can be found at https://github.com/StatsWithR. Here you can also find the R code for reproducing some of the figures and analyses shown on the course slides (and we’ll keep adding to that repo in the next few weeks).

The biggest change between the old course and the new specialization though is a completely new course: Bayesian Statistics. I touched on Bayesian inference a bit in my old course, and this generated lots of discussion on the course forums from learners wanting more on this content. Being at Duke, I figured who better to offer this course but us! (If you know anything about the Statistical Science department at Duke, you probably know it’s pretty Bayesian.) Note, I didn’t say “me”, I said “us”. I was able to convince a few colleagues (David Banks, Merlise Clyde, and Colin Rundel) to join me in developing this course, and I’m glad I did! Figuring out exactly how to teach this content in an effective way without assuming too much mathematical background took lots of thinking (and re-thinking, and re-thinking). We have also managed to feature a few interviews with researchers in academia and industry, such as Jim Berger (Duke), David Dunson (Duke), Amy Herring (UNC), and Steve Scott (Google) to provide a bit more context for learners on where and why Bayesian statistics is relevant. This course launched today, and I’m looking forward to seeing the feedback from the learners.

If you’re interested in the specialization, you can find out more about it here. The courses in the specialization are:

  1. Introduction to Probability and Data
  2. Inferential Statistics
  3. Linear Regression and Modeling
  4. Bayesian Statistics
  5. Statistics Capstone Project

You can take the courses individually or sign up for the whole specialization, but to do the capstone you need to have completed the 4 courses in the specialization. The landing page for the specialization outlines in further detail how to navigate everything, and relevant dates and deadlines.

Also note that while the graded components of the course which will allow you to pursue a certificate require payment, one can audit the courses for free and watch videos, complete practices quizzes, and work on the labs.

Project TIER

Last year I was awarded a Project TIER (Teaching Integrity in Empirical Research) fellowship, and last week my work on the fellowship wrapped up with a meeting with the project leads, other fellows from last year, as well as new fellows for the next year. In a nutshell Project TIER focuses on reproducibility. Here is a brief summary of the project’s focus from their website:

For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

As part of the fellowship, beyond continuing working on integrating reproducible data analysis practices into my courses with the use of literate programming via R Markdown and version control via git/GitHub, I have also created templates two GitHub repositories that follow the Project TIER guidelines: one for use with R and the other with Stata. They both live under the Project TIER organization on GitHub. The idea is that one wishing to follow the folder structure and workflow suggested by Project TIER can make a copy of these repositories and easily organize their work following the TIER guidelines.

There is more work to be done on these of course, first of which is evolving the TIER guidelines themselves to line up better with working with git and R as well as working with tricky data (like large data, or private data, etc.). Some of these are issues the new fellows might tackle in the next year.

As part of the fellowship I also taught a workshop titled “Making your research reproducible with Project TIER, R, and GitHub” to Economics graduate students at Duke. These are students who primarily use Stata so the workshop was a first introduction to this workflow, using the RStudio interface for git and GitHub. Materials for this workshop can be found here. At the end of the workshop I got the sense that very few of these students were interested in making the switch over to R (can’t blame them honestly — if you’ve been working on your dissertation for years and you just want to wrap it up, the last thing you want to do is to have to rewrite all your code and redo your analysis in a different platform) but quite a few of them were interested in using GitHub for both version control and for showcasing their work publicly.

Also as part of the fellowship Ben Baumer (a fellow fellow?) and I have organized a session on reproducibility at JSM 2016 that I am very much looking forward to. See here for the line up.

In summary, being involved with this project was a great eye opener to the fact that there are researchers and educators out there who truly care about issues surrounding reproducibility of data analysis but who are very unlikely to switch over to R because that is not as customary for their discipline (although at least one fellow did after watching my demo on R Markdown in the 2015 meeting, that was nice to see 😁). Discussions around working with Stata made me once again very thankful for R Markdown and RStudio which make literate programming a breeze in R. And what my mean by “a breeze” is “easy to teach to and be adopted by anyone from a novice to expert R user”. It seems to me like it would be in the interest of companies like Stata to implement such a workflow/interface to support reproducibility efforts of researchers and educators using their software. I can’t see a single reason why they wouldn’t invest time (and yes, money) in developing this.

During these discussions a package called RStata also came up. This package is “[a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

Overall my involvement with the fellowship was a great experience for meeting and brainstorming with faculty from non-stats disciplines (mostly from the social sciences) who regularly teach in platforms like Stata and SPSS who are also dedicated to teaching reproducible data analysis practices. I’m often the person who tries to encourage people to switch over to R, and I don’t think I’ll be stopping doing that anytime soon, but I do believe that if we want all who do data analysis to do it reproducibly, efforts must be made to (1) come up with workflows that ensure reproducibility in statistical software other than R, and (2) create tools that make reproducible data analysis easier in such software (e.g. tools similar to R Markdown designed specifically for these software).

 

PS: It’s been a while since I last posted here, let’s blame it on a hectic academic year. I started and never got around to finishing two posts in the past few months that I hope to finish and publish soon. One is about using R Markdown for generating course/TA evaluation reports and the other is on using Slack for managing TAs for a large course. Stay tuned.

PPS: Super excited for #useR2016 starting on Monday. The lack of axe-throwing will be disappointing (those who attended useR 2015 in Denmark know what I’m talking about) but otherwise the schedule promises a great line up!

How do Readers Perceive the Results of a Data Analysis?

As a statistician who often needs to explain methods and results of analyses to non-statisticians, I have been receptive to the influx of literature related to the use of storytelling or a data narrative. (I am also aware of the backlash related to use of the word “storytelling” in regards to scientific analysis, although I am less concerned about this than, say, these scholars.) As a teacher of data analysis, the use of narrative is especially poignant in that it ties the analyses performed intrinsically to the data context—or at the very least, to a logical flow of methods used.

I recently read an article posted on Brain Pickings about the psychology behind great stories. In the article, the author, Maria Popova,  cites Jerome Bruner’s (a pioneer of cognitive psychology) essay “Two Modes of Thought”:

There are two modes of cognitive functioning, two modes of thought, each providing distinctive ways of ordering experience, of constructing reality. The two (though complementary) are irreducible to one another. Efforts to reduce one mode to the other or to ignore one at the expense of the other inevitably fail to capture the rich diversity of thought.

Each of the ways of knowing, moreover, has operating principles of its own and its own criteria of well-formedness. They differ radically in their procedures for verification. A good story and a well-formed argument are different natural kinds. Both can be used as means for convincing another. Yet what they convince of is fundamentally different: arguments convince one of their truth, stories of their lifelikeness. The one verifies by eventual appeal to procedures for establishing formal and empirical proof. The other establishes not truth but verisimilitude.

The essence of his essay is that, as Popova states, “a story (allegedly true or allegedly fictional) is judged for its goodness as a story by criteria that are of a different kind from those used to judge a logical argument as adequate or correct.” [highlighting is mine]

What type of implications does this have on a data narrative, where, in principle, both criteria are being judged? Does one outweigh the other in a reader’s judgment? How does this affect reviewers when they are making decisions about publication?

My sense is that psychologically, the judgment of two differing sets of criteria will lead most humans to judge one as being more salient than the other. Presumably, most scientists would want the logical argument (or as Brunner calls it, the logico-scientific argument) to prevail in this case. However, I think it is the story that most readers, even those with scientific backgrounds, will tend to remember.

As for reviewers, again, the presumption is that they will evaluate a paper’s merit on its scientific evidence. But, as any reviewer can tell you, the writing and narrative presenting that evidence is in some ways as important for that evidence to be believable. This is why great courtroom attorneys  spend just as much time on developing the story around a case as marshaling a logical argument they will use to entice a jury.

There is little guidance for statisticians, especially nascent statisticians, about what the “right” degree of paradigmatic and logico-scientific argument should be when writing up data analysis. In fact, many of us (including myself) do not often consider the impact of a reader’s weighing of these different types of evidence. My training in graduate school was more focused on the latter type of writing, and it was only in undergraduate writing courses and through reading other authors’ thoughts about writing that the former is even relevant to me. Ultimately there may be not much to do about how reader’s perceive our work. Perhaps it is as Sylvia Plath wrote about poetry, once a poem is made available to the public, the right of interpretation belongs to the reader.”

Tools for Managing Your Inbox

First of all, happy new year to all of our readers.

As my first contribution in 2016, I thought I would share a couple tools that have helped tame my email inbox with you. In my continued resolution to finally achieve Inbox Zero, I have made a major dent in the last month. This is thanks to two tools: Unroll Me and Google Mail’s “Save & Archive” button.

Unroll Me

The first tool I would like to share with you is an app called Unroll Me. This app makes unsubscribing to email services or adding them to a once-a-day-digest super easy…no more clicking “unsubscribe” in each individual email. After you sign up, Unroll Me examines your inbox for different email subscriptions. You then have three choices for each subscription you find:

  1. Unsubscribe
  2. Add to Rollup
  3. Keep in Inbox

The first and third are self-explanatory. The second option adds all selected subscriptions to a digest-like email that comes once-per-day. Here is an example of my rollup:

unrollmeUnroll Me keeps a history of your rollups for easy reference, and adds new subscriptions that it finds for you to manage. You can also change the preference for any of your subscriptions at any time. Also, since Unroll Me keeps a list of emails that you have unsubscribed from, it is easy to re-subscribe at any point.

This tool has changed my inbox. Digest-like emails are great for reducing inbox clutter (it is the email equivalent of putting household odds-and-ends into a nice wicker basket). Unfortunately, many places that should have an option for this, simply do not. For example, the University of Minnesota (my place of employment) has somehow auto subscribed me to a million email lists. In general, I am not interested in about 90% of what they send, and another 9% are related to things I don’t need to see immediately. Unroll Me has allowed me to keep the email subscription I want to see immediately in my inbox, put those that are less important in a rollup, and eliminates those emails I could care less about completely from my sight.

Google Mail’s “Send & Archive” Button

This is a Google Mail option I learned about on Lifehacker. Why it is not a default button, I do not know. Go to Google Mail’s Settings, and under the General Setting, click the option button labelled “Show ‘Send & Archive’ button in reply”.

senandarchiveThis will add a button to any email you reply to that allows you to send the email and archive the email message you just replied to, essentially combining two steps in one.

emailreplyYou also still have the send button if you want to keep the original email in your inbox.

I hope these tools might also help you. If you have further suggestions, put them in the comments to share.

 

Teaching computation as an argument for simulation based inference

Check out my guest post on the Simulation-based statistical inference blog:

Teaching computation as an argument for simulation-based inference

If you are interested in teaching simulation-based methods, or if you just want to find out more why others are, I highly recommend the posts on this blog. The page also hosts many other useful resources as well as information on upcoming workshops as well.

A two-hour introduction to data analysis in R

A few weeks ago I gave a two-hour Introduction to R workshop for the Master of Engineering Management students at Duke. The session was organized by the student-led Career Development and Alumni Relations committee within this program. The slides for the workshop can be found here and the source code is available on GitHub.

Why might this be of interest to you?

  • The materials can give you a sense of what’s feasible to teach in two hours to an audience that is not scared of programming but is new to R.
  • The workshop introduces the ggplot2 and dplyr packages without the diamonds or nycflights13 datasets. I have nothing against the these datasets, in fact, I think they’re great for introducing these packages, but frankly I’m a bit tired of them. So I was looking for something different when preparing this workshop and decided to use the North Carolina Bicycle Crash Data from Durham OpenData. This choice had some pros and some cons:
    • Pro – open data: Most people new to data analysis are unaware of open data resources. I think it’s useful to showcase such data sources whenever possible.
    • Pro – medium data: The dataset has 5716 observations and 54 variables. It’s not large enough to slow things down (which can especially be an issue for visualizing much larger data) but it’s large enough that manual wrangling of the data would be too much trouble.
    • Con: The visualizations do not really reveal very useful insights into the data. While this is not absolutely necessary for teaching syntax, it would have been a welcome cherry on top…
  • The raw dataset has a feature I love — it’s been damaged due (most likely) to being opened in Excel! One of the variables in the dataset is age group of the biker (BikeAge_gr). Here is the age distribution of bikers as they appear in the original data:
 
##    BikeAge_Gr crash_count
##    (chr)      (int)
## 1  0-5        60
## 2  10-Jun     421
## 3  15-Nov     747
## 4  16-19      605
## 5  20-24      680
## 6  25-29      430
## 7  30-39      658
## 8  40-49      920
## 9  50-59      739
## 10 60-69      274
## 11 70         12
## 12 70+        58

Obviously the age groups 10-Jun and 15-Nov don’t make sense. This is a great opportunity to highlight the importance of exploring the data before modeling or doing something more advanced with it. It is also an opportunity to demonstrate how merely opening a file in Excel can result in unexpected issues. These age groups should instead be 6-10 (not June 10th) and 11-15 (not November 15th). Making these corrections also provides an opportunity to talk about text processing in R.

I should admit that I don’t have evidence of Excel causing this issue. However this is my best guess since “helping” the user by formatting date fields is standard Excel behaviour. There may be other software out there that also do this that I’m unaware of…

If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

Fruit Plot: Plotting Using Multiple PNGs

In one of our previous posts (Halloween: An Excuse for Plotting with Icons), we gave a quick tutorial on how to plot using icons using ggplot. A reader, Dr. D. K. Samuel asked in a comment how to use multiple icons. His comment read,

…can you make a blog post on using multiple icons for such data
year, crop,yield
1995,Tomato,250
1995,Apple,300
1995,Orange,500
2000, Tomato,600
2000,Apple, 800
2000,Orange,900
it will be nice to use icons for each data point. It will also be nice if the (icon) data could be colored by year.

This blog post will address this request. First, the result…

fruit-plot

The process I used to create this plot is as follows:

  1. Find the icons that you want to use in place of the points on your scatterplot (or dot plot).

I used an apple icon (created by Creative Stall), an orange icon (created by Gui Zamarioli), and a tomato icon (created by Andrey Vasiliev); all obtained from The Noun Project.

  1. Color the icons.

After downloading the icons, I used Gimp, a free image manipulation program, to color each of the icons. I created a green version, and a blue version of each icon. (The request asked for the two different years to have different colors.) I also cropped the icons.

Given that there were only three icons, doing this manually was not much of a time burden (10 minutes after I selected the color palette—using colorbrewer.org). Could this be done programatically? I am not sure. A person, who is not me, might be able to write some commands to do this with ImageMagick or some other program. You might also be able to do this in R, but I sure don’t know how…I imagine it involves re-writing the values for the pixels you want to change the color of, but how you determine which of those you want is beyond me.

If you are interested in only changing the color of the icon outline, an alternative would be to download the SVGs rather than the PNGs. Opening the SVG file in a text editor gives the underlying syntax for the SVG. For example, the apple icon looks like this:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
  <g>
    <path d="M19.749,48c-1.662... />
    <path d="M24.001,14.866c-0.048, ... />
    <path d="M29.512, ... />
  </g>
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>
</svg>

The three path commands draw the actual apple. The first draws the apple, the second path command draws the leaf on top of the apple, and the third draws the stem. Adding the text, fill=”blue” to the end of each path command will change the color of the path from black to blue (see below).

<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
  <g>
    <path d="M19.749,48c-1.662 ... fill="blue" />
    <path d="M24.001,14.866c-0.048, ... fill="blue" />
    <path d="M29.512, ... fill="blue" />
  </g>
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>
</svg>

This could easily be programmatically changed. Then the SVG images could also programmatically be exported to PNGs.

  1. Read in the icons (which are PNG files).

Here we use the readPNG() function from the png library to bring the icon into R.

library(png)
blue_apple = readPNG("~/Desktop/fruit-plot/blue_apple.png", TRUE)
green_apple = readPNG("~/Desktop/fruit-plot/green_apple.png", TRUE)
blue_orange = readPNG("~/Desktop/fruit-plot/blue_orange.png", TRUE)
green_orange = readPNG("~/Desktop/fruit-plot/green_orange.png", TRUE)
blue_tomato = readPNG("~/Desktop/fruit-plot/blue_tomato.png", TRUE)
green_tomato = readPNG("~/Desktop/fruit-plot/green_tomato.png", TRUE)
  1. Create the data.

Use the data.frame() function to create the data.

plotData = data.frame(
&nbsp; year = c(1995, 1995, 1995, 2000, 2000, 2000),
&nbsp; crop = c("tomato", "apple", "orange", "tomato", "apple", "orange"),
&nbsp; yield = c(250, 300, 500, 600, 800, 900)
)

plotData
  year   crop yield
1 1995 tomato   250
2 1995  apple   300
3 1995 orange   500
4 2000 tomato   600
5 2000  apple   800
6 2000 orange   900

Next we will add a column to our data frame that maps the year to color. This uses the ifelse() function. In this example, if the logical statement plotData$year == 1995 evaluates as TRUE, then the value will be “blue”. If it evaluates as FALSE, then the value will be “green”.

plotData$color = ifelse(plotData$year == 1995, "blue", "green")

plotData
  year   crop yield color
1 1995 tomato   250  blue
2 1995  apple   300  blue
3 1995 orange   500  blue
4 2000 tomato   600 green
5 2000  apple   800 green
6 2000 orange   900 green

Now we will use this new “color” column in conjunction with the “crop” column to identify the icon that will be plotted for each row. the paste0() function concatenates each argument together with no spaces between them. Here we are concatenating the color value, an underscore, and the crop value.

plotData$icon = paste0(plotData$color, "_", plotData$crop)

plotData
  year   crop yield color         icon
1 1995 tomato   250  blue  blue_tomato
2 1995  apple   300  blue   blue_apple
3 1995 orange   500  blue  blue_orange
4 2000 tomato   600 green green_tomato
5 2000  apple   800 green  green_apple
6 2000 orange   900 green green_orange
  1. Use ggplot to create a scatterplot of the data, making the size of the points 0.
library(ggplot2)

p = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  theme_bw() +
  xlab("Year") +
  ylab("Yield")
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

Similar to the previous post, we add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The key is that the image name is now in the “icon” column of the data frame. The values in the “icon” column are character data. To make R treat these as objects we first parse the character data using the parse() function, and then we use eval() to have R evaluate the parsed expression. A description of this appears in this Stack Overflow question.

library(grid)

for(i in 1:nrow(plotData)){
  p = p + annotation_custom(
    rasterGrob(eval(parse(text = plotData$icon[i]))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] - 20, ymax = plotData$yield[i] + 20
  )
} 

# Show plot
print(p)
  1. Some issues to consider and my alternative plot.

I think that plot is what was requested, but since I cannot help myself, I would propose a few changes that I think would make this plot better. First, I would add lines to connect each fruit (apple in 1995 to apple in 2000). This would help the reader to better track the change in yield over time.

Secondly, I would actually leave the fruit color constant across years and vary the color between fruits (probably coloring them according to their real-world colors). This again helps the reader in that they can more easily identify the fruits and also helps them track the change in yield. (It also avoids a Stroop-like effect of coloring an orange some other color than orange!)

Here is the code:

# Read in PNG files
apple = readPNG("~/Desktop/fruit-plot/red_apple.png", TRUE)
orange = readPNG("~/Desktop/fruit-plot/orange_orange.png", TRUE)
tomato = readPNG("~/Desktop/fruit-plot/red_tomato.png", TRUE)

# Plot
p2 = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  geom_line(aes(group = crop), lty = "dashed") +
  theme_bw()  +
  xlab("Year") +
  ylab("Yield") +
  annotate("text", x = 1997, y = 350, label = "Tomato created by Andrey Vasiliev from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 330, label = "Apple created by Creative Stall from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 310, label = "Orange created by Gui Zamarioli from the Noun Project", size = 2, hjust = 0)

for(i in 1:nrow(plotData)){
  p2 = p2 + annotation_custom(
    rasterGrob(eval(parse(text = as.character(plotData$crop[i])))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] -20, ymax = plotData$yield[i]+20
  )
}

# Show plot
print(p2)

And the result…

fruit-plot2

Halloween: An Excuse for Plotting with Icons

In my course on the GLM, we are discussing residual plots this week. Given that it is also Halloween this Saturday, it seems like a perfect time to code up a residual plot made of ghosts.

Ghost plotThe process I used to create this plot is as follows:

  1. Find an icon that you want to use in place of the points on your scatterplot (or dot plot).

I used a ghost icon (created by Andrea Mazzini) obtained from The Noun Project. After downloading the icon, I used Preview to create a new PNG file that had cut out the citation text in the downloaded image. I will add the citation text at a later stage in the plot itself. This new icon was 450×450 pixels.

  1. Use ggplot to create a scatterplot of a set of data, making the size of the points 0.

Here is the code that will create the data and make the plot that I used.

plotData = data.frame(
  .fitted = c(76.5, 81.3, 75.5, 79.5, 80.1, 78.5, 79.5, 77.5, 81.2, 80.4, 78.1, 79.5, 76.6, 79.4, 75.9, 86.6, 84.2, 83.1, 82.4, 78.4, 81.6, 79.6, 80.4, 82.3, 78.6, 82.1, 76.6, 82.1, 87, 82.2, 82.1, 87.2, 80.5, 84.9, 78.5, 79, 78.5, 81.5, 77.4, 76.8, 79.4, 75.5, 80.2, 80.4, 81.5, 81.5, 80.5, 79.2, 82.2, 83, 78.5, 79.2, 80.6, 78.6, 85.9, 76.5, 77.5, 84.1, 77.6, 81.2, 74.8, 83.4, 80.4, 77.6, 78.6, 83.3, 80.4, 80.5, 80.4, 83.8, 85.1, 82.2, 84.1, 80.2, 75.7, 83, 81.5, 83.1, 78.3, 76.9, 82, 82.3, 85.8, 78.5, 75.9, 80.4, 82.3, 75.7, 73.9, 80.4, 83.2, 85.2, 84.9, 80.4, 85.9, 76.8, 83.3, 80.2, 83.1, 77.6),
  .stdresid = c(0.2, -0.3, 0.5, 1.4, 0.3, -0.2, 1.2, -1.1, 0.7, -0.1, -0.3, -1.1, -1.5, -0.1, 0, -1, 1, 0.3, -0.5, 0.5, 1.8, 1.6, -0.1, -1.3, -0.2, -0.9, 1.1, -0.2, 1.5, -0.3, -1.2, -0.6, -0.4, -3, 0.5, 0.3, -0.8, 0.8, 0.5, 1.3, 1.8, 0.5, -1.6, -2, -2.1, -0.8, 0.4, -0.9, 0.4, -0.4, 0.6, 0.4, 1.4, -1.4, 1.3, 0.4, -0.8, -0.2, 0.5, 0.7, 0.5, 0.1, 0.1, -0.8, -2.1, 0, 1.9, -0.5, -0.1, -1.4, 0.6, 0.7, -0.3, 1, -0.7, 0.7, -0.2, 0.8, 1.3, -0.7, -0.4, 1.5, 2.1, 1.6, -1, 0.7, -1, 0.9, -0.3, 0.9, -0.3, -0.7, -0.9, -0.2, 1.2, -0.8, -0.9, -1.7, 0.6, -0.5)
  )

library(ggplot2)

p = ggplot(data = plotData, aes(x = .fitted, y = .stdresid)) +
    theme_bw() + 
    geom_hline(yintercept = 0) +
    geom_point(size = 0) +
    theme_bw() +
    xlab("Fitted values") +
    ylab("Standarized Residuals") +
    annotate("text", x = 76, y = -3, label = "Ghost created by Andrea Mazzini from Noun Project")
  1. Read in the icon (which is a PNG file).

Here we use the readPNG() function from the png library to bring the icon into R.

library(png)
ghost = readPNG("/Users/andrewz/Desktop/ghost.png", TRUE)
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

The idea is that since we have saved our plot in the object p, we can add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The pseudo-like code for this is:

for(i in 1:nrow(plotData)){
    p = p + 
      annotation_custom(
        our_image,
        xmin = minimum_x_value_for_the_image, 
        xmax = maximum_x_value_for_the_image, 
        ymin = minimum_y_value_for_the_image, 
        ymax = maximum_y_value_for_the_image
        ) 
    }

In order for the image to be plotted, we first have to make it plot-able by making it a graphical object, or GROB.

The rasterGrob() function (found in the grid,/b> package) renders a bitmap image (raster image) into a graphical object or GROB which can then be displayed at a specified location, orientation, etc. Read more about using Raster images in R here.

The arguments xmin, xmax, ymin, and ymax give the horizontal and vertical locations (in data coordinates) of the raster image. In our residual plot, we want the center of the image to be located at the coordinates (.fitted, .stdresid). In the syntax below, we add a small bit to the maximum values and subtract a small bit from the minimum values to force the icon into a box that will plot the icons a bit smaller than their actual size. (#protip: play around with this value until you get a plot that looks good.)

library(grid)

for(i in 1:nrow(plotData)){
    p = p + annotation_custom(
      rasterGrob(ghost),
      xmin = plotData$.fitted[i]-0.2, xmax = plotData$.fitted[i]+0.2, 
      ymin = plotData$.stdresid[i]-0.2, ymax = plotData$.stdresid[i]+0.2
      ) 
    }

Finally we print the plot to our graphics device using

print(p)

And the result is eerily pleasant!