New Issue of JSE

Michelle Everson just announced that the March 2013 issue of the Journal of Statistics Education (JSE) is now available online.  You can get to that issue from the homepage of JSE (  This month, JSE also introduces some new features

  • Department on Research in K-12 Statistics Education
  • JSE webinar series (beginning June, 2013)
  • New Facebook group
  • New Twitter account

Visit JSE online and enjoy the new issue!

Big Data Is Not the New Oil

Our colleague and dear friend John Holcomb sent an email to Rob and I in which he asked if we had heard the phrase “Big data is the new oil”. Neither of us had, but according to Jer Thorp, ad executives are uttering this phrase upwards of 100 times a day.

Jer’s article is worth a read. While he points out in the title that big data is not the new oil, he astutely suggests that the oil/data metaphor does work to an extent. After describing data as a human resource (a thesis of his TED talk), Jer makes, and expounds on, three points that resonated with me

  1. People need to understand and experience data ownership.
  2. We need to have a more open conversation about data and ethics.
  3. We need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.

I am not sure which “we” he is referring to, but I might argue that society at large needs to have this conversation, and more importantly, the data users/statisticians/executives that make decisions to collect the data need to be having these conversations. Read the article at the Harvard Business Review Blog Network.

Prediction Accuracy of the NCAA Bracket: Results


In  Emanuel Derman’s book Models. Behaving. Badly, the author lays out a Modeler’s Hippocratic Oath.

  • I will remember that I didn’t make the world, and it doesn’t satisfy my equations.
  • Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
  •  I will never sacrifice reality for elegance without explaining why I have done so.
  • Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
  • I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

Since I have no desire for instilling false comfort with the non-replicable, fuzzy mascot/alphabetical order model that I used to predict the NCAA tournament, I report my results after two days in.

  • Overall the model has correctly predicted 19 of the 32 games correctly. This is not any better than chance (= .108, one-sided).
    • Conditionally, the model performed best in the East Regional (6/8, p = .035, one-sided). It was worst in the West and South Regionals (4/8 in both, p-value not reported due to complete stupidity of the model.). The performance in the Midwest Regional, like so many things Midwest, was so-so (5/8, p = .145, one-sided).
  • The model has not, as yet, “busted” my bracket.
    • I still have 11 of the Sweet Sixteen teams predicted still alive in my bracket.
    • I still have 7 of the Elite Eight teams predicted still alive in my bracket.
    • I still have 4 of the Final Four predicted teams still alive in my bracket.
  • The president also has 19 out of 32 correct predictions in his bracket. Thirteen of his Sweet Sixteen, 7 of his Elite Eight, and 4 of his final four  predictions are still alive.
  • According to my prediction about the Minnesota/UCLA game being a good matchup…it pretty much was. Both teams played terribly. It was so close that neither team scored a field goal in the first five minutes of the game.

To recover from no only watching this game, but also from the mundanity of this blog post, I offer you comfort in the visualization of Nate Silver’s bracket of predictions.

Screen Shot 2013-03-23 at 8.32.24 AM

NCAA Basketball Visualization

It is time for the NCAA Basketball Tournament. Sixty-four teams dream big (er…I mean 68…well actually by now, 64) and schools like Iona and Florida Gulf Coast University (go Eagles!) are hoping that Robert Morris astounding victory in the N.I.T. isn’t just a flash in the pan.

My favorite part is filling out the bracket–see it below. (Imagine that…a statistician’s favorite part of the whole thing is making predictions.) Even President Obama filled out a bracket [see it here].

Andy's Bracket

My method for making predictions, I use a complicated formula that involves “coolness” factors of team mascots, alphabetical order (but only conditional on particular seedings), waving of hands, and guesswork. But, that was because I didn’t have access to my student Rodrigo Zamith’s latest blog post until today.

Rodrigo has put together side-by-side visualizations of many of the pertinent basketball statistics (e.g., points scored, rebounds, etc.) using the R package ggplot2. This would have been very helpful in my decisions where the mascot measure failed me and I was left with a toss-up (e.g., Oklahoma vs. San Diego State).

Preview of the March 22 Game between Minnesota and UCLA

Rodrigo has also made the data, not only from the 2012-2013 season available from his blog, but also the previous two seasons as well. Check it out at Rodrigo’s blog!

Now, all I have to do is hang tight until the 8:57pm (CST) game on March 22. Judging from the comparisons, it will be tight.


This Day in Statistics

I was looking to find an add-on Google Calendar that included important days in the history of statistics. They have one for seemingly everything under the sun, except this. So I created one and made it public in honor of the International Year of Statistics. I will continually add to it as I find time.

Feel free to add it. As always, it is available in the following formats

Want me to add an important birthday? Add the info into the comments section. Want to be an author on the calendar so you can add all 100 statistician’s that you know I forgot? Send me an email and I will add you on.

Wall Street Journal

Carl Bialik of the Wall Street Journal has a nice article about the growth of statistics.  The print version differs substantially from the online version in content, though not in message.

Missing from this message is the urgent need to have more teachers, at all levels, trained in statistics.  I’m currently at a meeting of the joint committee on education of the American Statistical Association and the National Council of Teachers of Mathematics.  A recurring theme is that K-12 stats education has arrived, but professional development lags far behind.  Agreed that our future needs statisticians.  But we also need teachers who know statistics, 3rd graders who know statistics, unskilled workers who know statistics, and, well, everyone needs some statistics.  And by “statistics”, we do not mean the ability to plug numbers into memorized formulas.  Instead, we mean “reasoning with data.”

Dear Gmail…

I recently added a free application/service that analyzes my email called Gmail Meter. This service sends me a comprehensive weekly report full of summaries and plots that indicate how I use Gmail.

The first thing I learned is that Wednesdays are for emailing and I seem to respond in a timely manner, on average, to emails sent to me…when I actually respond (I have a 24.58% response rate. Yikes!) Wednesdays I only teach one class (at 4:40pm) this semester, but I have a morning meeting so I am on campus and generally have time to respond to emails that I may not have gotten to.

Summary of my Gmail

The plot of my daily email traffic shows that most email is sent to me during the day (typical work hours), while my email times tend to be prior to classes in the morning and after my evening courses. Also, it is clear I am sending far less that I receive. It appears I am doing my part to lower my email footprint!
chartI seem to be more prompt on my email responses (for the most part) than others who respond to me. What is interesting, is that people who respond to me are in primarily very quick (<4hrs) or take more than a day to get back to me. This fits with the behavior I expect from most academics. chart-2In the emails I send, I tend to be terse. Generally, I try to avoid long emails to people since when long emails are sent to me I tend to get cranky. (I recognize that sometimes it can’t be avoided.) I actually am quite pleased that the mode here is less than 10 words. (Again, yay for my footprint!)

I am not quite as happy to see that the mode for emails sent to me is the category indicating more than 200 words. Some of this is because of the university committees  I sit on. For example, the University of Minnesota Senate sends many emails. These emails often are lengthy because of the inclusion of bylaws and articles to the University Constitution that we will be voting on. That being said, I agree with this email charter which begs us all to keep it short.chart-3What kind of media attachments are taking up space in my Gmail box? It seems that most are Microsoft Word documents. Again, given my collaboration with other academics and feedback to students this makes sense to me. Since I have a Mac and most of my colleagues still work on PC, I send many documents as PDF files. My guess is that if this were sent to me a few years ago, the number of attachments would have been even higher. Our research group has slowly worked toward using sites like Dropbox to share documents. (Next stop…some versioning system.)chart-4Now for the plot that made me stop and write this post. Almost 90% of the email I received this week hit the trash can. Also a small percentage is still in my inbox. I am trying to achieve Inbox Zero, but just haven’t made it yet. I am currently down to xxx emails in my inbox. I signed up for the Mailbox app which should help with this goal when I check email on my phone, but like the Tempo app that Rob signed up for, there is a reservation system in place. Unlike Rob, my spot in the Mailbox line is nowhere near the bottom (last I looked 632,889 people in front of me) despite having reserved my place in line several weeks ago.chart-1I also receive information on the week’s top emailers to me (Joan) and the top recipients of my mail (one of my students); top conversation threads, a scatterplot of the number of words per email in a thread versus the rank of the email in the thread (was it the 1st email sent, 2nd, etc.). As one might expect there is a strong, negative relationship here. It also produces a word cloud based on the subjects and bodies of all messages sent or received directly. Lastly, it conditions emails received with attachments on whether they came from inside or outside the organization (University of Minnesota).

It is not clear that you can obtain the raw data, although it is not clear that you can’t either. There are of course ways to obtain the meta-data that Gmail Meter is using by scraping it using a program such as Python (see here). My guess is that you could also do this with R 9perhaps using the curl and XML packages). They have several feature requests for making Google Meter more customizable which would make it even cooler.

Waiting for Tempo

I got pretty excited about a new calendar app, in part because I love these productivity tools and in part because I really hate the calendar that comes with the iPhone.  Tempo, as it is called, seemed nifty because it integrates data on your phone into the calendar so, for instance, you can get directions to your next meeting easily, alert people that you’re going to be late, have documents related to your appointment automatically opened, and other features that will either save lots of time and hassles or themselves become time-sinks and hassles.

But I need not have worried.  When I downloaded the app, I was informed that there is a bit of a queue for activating accounts.  I was currently 32,310th in line, with 2 people behind me.  So how much longer would I wait?

I logged on from time to time to assess my place in line.  The data are here: tempowaitlist.

I’m in Alexandria, VA, in town for a meeting of an ASA/NCTM education committee.  Although its late here, still feels early to us PST-ers, and so I thought I’d take a look at the data.  I’m almost to the point where half of the line is ahead of me, although mostly this is due to so many people lining up behind me.   (17,505 have queued up behind me—surely a sign that I’ve joined the right app?)  This graph shows my progress so far, in terms of how many are ahead of me.  You can, if boredom strikes, estimate how many more seconds before I arrive at the  front of the line. I’m somewhat encouraged by a hint of a sudden acceleration yesterday—but my excitement is no doubt reading too much into noise.

When I get there, I’ll let you know if it was worth the wait.

Number of People Ahead of me in line, as of Feb 28.

Number of People Ahead of me in line, as of Feb 28.