Some Reading for the Winter Break

It has been a long while since I wrote anything for Citizen Statistician, so I thought I would scribe a post about three books that I will be reading over break.







The first book is Cathy O’Neil’s book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy [link to Amazon]. I am currently in the midst of Chapter 3. I heard about this book on an episode of 538’s podcast, What’s the Point?, on which O’Neil was featured [Who’s Accountable When An Algorithm Makes A Bad Decision?]. The premise of this book has been something that has been on the mind of many people thinking about data science and algorithms in recent years (and probably not-so-recent years); that many algorithms, and thus the predictions stemming from them, are not transparent. This leads to many ethical and, potentially, legal issues when algorithms are then used to make decisions about recidivism, loan applications, college admissions, etc. I think this book could be the basis for a very interesting seminar. Let me know if anyone is working on something like this.

The second book I will be reading is Michael Lewis’ The Undoing Project: A Friendship That Changed Our Minds [link to Amazon]. This book is bout the friendship, collaboration, and, ultimately, disentanglement between the renowned psychologists Daniel Kahnemann and Amos Tversky. I learned about Kahnemann and Tversky’s work early in my graduate career when Joan Garfield taught a doctoral research seminar on the seminal psychological work related to probabilistic thinking and statistics education. We read not only Kahnemann and Tversky, but also Gird Gigerenzer, Ruma Falk, Maya Bar Hillel, Richard Nisbett, Efraim Fischbein, and others. Interestingly, What’s the Point? recently did two episodes on Lewis’ book as well; Michael Lewis’s New Book Examines How We Think About Thinking and Nate Silver Interviews Michael Lewis About His New Book, ‘The Undoing Project’.

The third book is Who’s #1?: The Science of Rating and Ranking [link to Amazon] by Amy Langville and Carl Meyer. I had read their earlier book, Google’s PageRank and Beyond: The Science of Search Engine Rankings, several years ago, and was quite impressed with the readability of the complex matrix algebra they presented. In Who’s #1, the authors present the mathematics underlying several ratings systems including the Massey system, Elo, Colley, Keener, etc. I am actually treating this book like a self-taught class,  working out several of their examples using R, and really trying to understand the ideas. My interest here is related to the work that I am doing with Brandon LeBeau (University of Iowa) and a current graduate student, Kyle Nickodem on estimating the coaching ability for NCAA football coaches using a hierarchical IRT model [see slides from a talk here].

How do Readers Perceive the Results of a Data Analysis?

As a statistician who often needs to explain methods and results of analyses to non-statisticians, I have been receptive to the influx of literature related to the use of storytelling or a data narrative. (I am also aware of the backlash related to use of the word “storytelling” in regards to scientific analysis, although I am less concerned about this than, say, these scholars.) As a teacher of data analysis, the use of narrative is especially poignant in that it ties the analyses performed intrinsically to the data context—or at the very least, to a logical flow of methods used.

I recently read an article posted on Brain Pickings about the psychology behind great stories. In the article, the author, Maria Popova,  cites Jerome Bruner’s (a pioneer of cognitive psychology) essay “Two Modes of Thought”:

There are two modes of cognitive functioning, two modes of thought, each providing distinctive ways of ordering experience, of constructing reality. The two (though complementary) are irreducible to one another. Efforts to reduce one mode to the other or to ignore one at the expense of the other inevitably fail to capture the rich diversity of thought.

Each of the ways of knowing, moreover, has operating principles of its own and its own criteria of well-formedness. They differ radically in their procedures for verification. A good story and a well-formed argument are different natural kinds. Both can be used as means for convincing another. Yet what they convince of is fundamentally different: arguments convince one of their truth, stories of their lifelikeness. The one verifies by eventual appeal to procedures for establishing formal and empirical proof. The other establishes not truth but verisimilitude.

The essence of his essay is that, as Popova states, “a story (allegedly true or allegedly fictional) is judged for its goodness as a story by criteria that are of a different kind from those used to judge a logical argument as adequate or correct.” [highlighting is mine]

What type of implications does this have on a data narrative, where, in principle, both criteria are being judged? Does one outweigh the other in a reader’s judgment? How does this affect reviewers when they are making decisions about publication?

My sense is that psychologically, the judgment of two differing sets of criteria will lead most humans to judge one as being more salient than the other. Presumably, most scientists would want the logical argument (or as Brunner calls it, the logico-scientific argument) to prevail in this case. However, I think it is the story that most readers, even those with scientific backgrounds, will tend to remember.

As for reviewers, again, the presumption is that they will evaluate a paper’s merit on its scientific evidence. But, as any reviewer can tell you, the writing and narrative presenting that evidence is in some ways as important for that evidence to be believable. This is why great courtroom attorneys  spend just as much time on developing the story around a case as marshaling a logical argument they will use to entice a jury.

There is little guidance for statisticians, especially nascent statisticians, about what the “right” degree of paradigmatic and logico-scientific argument should be when writing up data analysis. In fact, many of us (including myself) do not often consider the impact of a reader’s weighing of these different types of evidence. My training in graduate school was more focused on the latter type of writing, and it was only in undergraduate writing courses and through reading other authors’ thoughts about writing that the former is even relevant to me. Ultimately there may be not much to do about how reader’s perceive our work. Perhaps it is as Sylvia Plath wrote about poetry, once a poem is made available to the public, the right of interpretation belongs to the reader.”

Tools for Managing Your Inbox

First of all, happy new year to all of our readers.

As my first contribution in 2016, I thought I would share a couple tools that have helped tame my email inbox with you. In my continued resolution to finally achieve Inbox Zero, I have made a major dent in the last month. This is thanks to two tools: Unroll Me and Google Mail’s “Save & Archive” button.

Unroll Me

The first tool I would like to share with you is an app called Unroll Me. This app makes unsubscribing to email services or adding them to a once-a-day-digest super easy…no more clicking “unsubscribe” in each individual email. After you sign up, Unroll Me examines your inbox for different email subscriptions. You then have three choices for each subscription you find:

  1. Unsubscribe
  2. Add to Rollup
  3. Keep in Inbox

The first and third are self-explanatory. The second option adds all selected subscriptions to a digest-like email that comes once-per-day. Here is an example of my rollup:

unrollmeUnroll Me keeps a history of your rollups for easy reference, and adds new subscriptions that it finds for you to manage. You can also change the preference for any of your subscriptions at any time. Also, since Unroll Me keeps a list of emails that you have unsubscribed from, it is easy to re-subscribe at any point.

This tool has changed my inbox. Digest-like emails are great for reducing inbox clutter (it is the email equivalent of putting household odds-and-ends into a nice wicker basket). Unfortunately, many places that should have an option for this, simply do not. For example, the University of Minnesota (my place of employment) has somehow auto subscribed me to a million email lists. In general, I am not interested in about 90% of what they send, and another 9% are related to things I don’t need to see immediately. Unroll Me has allowed me to keep the email subscription I want to see immediately in my inbox, put those that are less important in a rollup, and eliminates those emails I could care less about completely from my sight.

Google Mail’s “Send & Archive” Button

This is a Google Mail option I learned about on Lifehacker. Why it is not a default button, I do not know. Go to Google Mail’s Settings, and under the General Setting, click the option button labelled “Show ‘Send & Archive’ button in reply”.

senandarchiveThis will add a button to any email you reply to that allows you to send the email and archive the email message you just replied to, essentially combining two steps in one.

emailreplyYou also still have the send button if you want to keep the original email in your inbox.

I hope these tools might also help you. If you have further suggestions, put them in the comments to share.


Fruit Plot: Plotting Using Multiple PNGs

In one of our previous posts (Halloween: An Excuse for Plotting with Icons), we gave a quick tutorial on how to plot using icons using ggplot. A reader, Dr. D. K. Samuel asked in a comment how to use multiple icons. His comment read,

…can you make a blog post on using multiple icons for such data
year, crop,yield
2000, Tomato,600
2000,Apple, 800
it will be nice to use icons for each data point. It will also be nice if the (icon) data could be colored by year.

This blog post will address this request. First, the result…


The process I used to create this plot is as follows:

  1. Find the icons that you want to use in place of the points on your scatterplot (or dot plot).

I used an apple icon (created by Creative Stall), an orange icon (created by Gui Zamarioli), and a tomato icon (created by Andrey Vasiliev); all obtained from The Noun Project.

  1. Color the icons.

After downloading the icons, I used Gimp, a free image manipulation program, to color each of the icons. I created a green version, and a blue version of each icon. (The request asked for the two different years to have different colors.) I also cropped the icons.

Given that there were only three icons, doing this manually was not much of a time burden (10 minutes after I selected the color palette—using Could this be done programatically? I am not sure. A person, who is not me, might be able to write some commands to do this with ImageMagick or some other program. You might also be able to do this in R, but I sure don’t know how…I imagine it involves re-writing the values for the pixels you want to change the color of, but how you determine which of those you want is beyond me.

If you are interested in only changing the color of the icon outline, an alternative would be to download the SVGs rather than the PNGs. Opening the SVG file in a text editor gives the underlying syntax for the SVG. For example, the apple icon looks like this:

<svg xmlns="" xmlns:xlink="" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
    <path d="M19.749,48c-1.662... />
    <path d="M24.001,14.866c-0.048, ... />
    <path d="M29.512, ... />
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>

The three path commands draw the actual apple. The first draws the apple, the second path command draws the leaf on top of the apple, and the third draws the stem. Adding the text, fill=”blue” to the end of each path command will change the color of the path from black to blue (see below).

<svg xmlns="" xmlns:xlink="" version="1.1" x="0px" y="0px" viewBox="0 0 48 60" enable-background="new 0 0 48 48" xml:space="preserve">
    <path d="M19.749,48c-1.662 ... fill="blue" />
    <path d="M24.001,14.866c-0.048, ... fill="blue" />
    <path d="M29.512, ... fill="blue" />
<text x="0" y="63" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">Created by Creative Stall</text><text x="0" y="68" fill="#000000" font-size="5px" font-weight="bold" font-family="'Helvetica Neue', Helvetica, Arial-Unicode, Arial, Sans-serif">from the Noun Project</text>

This could easily be programmatically changed. Then the SVG images could also programmatically be exported to PNGs.

  1. Read in the icons (which are PNG files).

Here we use the readPNG() function from the png library to bring the icon into R.

blue_apple = readPNG("~/Desktop/fruit-plot/blue_apple.png", TRUE)
green_apple = readPNG("~/Desktop/fruit-plot/green_apple.png", TRUE)
blue_orange = readPNG("~/Desktop/fruit-plot/blue_orange.png", TRUE)
green_orange = readPNG("~/Desktop/fruit-plot/green_orange.png", TRUE)
blue_tomato = readPNG("~/Desktop/fruit-plot/blue_tomato.png", TRUE)
green_tomato = readPNG("~/Desktop/fruit-plot/green_tomato.png", TRUE)
  1. Create the data.

Use the data.frame() function to create the data.

plotData = data.frame(
&nbsp; year = c(1995, 1995, 1995, 2000, 2000, 2000),
&nbsp; crop = c("tomato", "apple", "orange", "tomato", "apple", "orange"),
&nbsp; yield = c(250, 300, 500, 600, 800, 900)

  year   crop yield
1 1995 tomato   250
2 1995  apple   300
3 1995 orange   500
4 2000 tomato   600
5 2000  apple   800
6 2000 orange   900

Next we will add a column to our data frame that maps the year to color. This uses the ifelse() function. In this example, if the logical statement plotData$year == 1995 evaluates as TRUE, then the value will be “blue”. If it evaluates as FALSE, then the value will be “green”.

plotData$color = ifelse(plotData$year == 1995, "blue", "green")

  year   crop yield color
1 1995 tomato   250  blue
2 1995  apple   300  blue
3 1995 orange   500  blue
4 2000 tomato   600 green
5 2000  apple   800 green
6 2000 orange   900 green

Now we will use this new “color” column in conjunction with the “crop” column to identify the icon that will be plotted for each row. the paste0() function concatenates each argument together with no spaces between them. Here we are concatenating the color value, an underscore, and the crop value.

plotData$icon = paste0(plotData$color, "_", plotData$crop)

  year   crop yield color         icon
1 1995 tomato   250  blue  blue_tomato
2 1995  apple   300  blue   blue_apple
3 1995 orange   500  blue  blue_orange
4 2000 tomato   600 green green_tomato
5 2000  apple   800 green  green_apple
6 2000 orange   900 green green_orange
  1. Use ggplot to create a scatterplot of the data, making the size of the points 0.

p = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  theme_bw() +
  xlab("Year") +
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

Similar to the previous post, we add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The key is that the image name is now in the “icon” column of the data frame. The values in the “icon” column are character data. To make R treat these as objects we first parse the character data using the parse() function, and then we use eval() to have R evaluate the parsed expression. A description of this appears in this Stack Overflow question.


for(i in 1:nrow(plotData)){
  p = p + annotation_custom(
    rasterGrob(eval(parse(text = plotData$icon[i]))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] - 20, ymax = plotData$yield[i] + 20

# Show plot
  1. Some issues to consider and my alternative plot.

I think that plot is what was requested, but since I cannot help myself, I would propose a few changes that I think would make this plot better. First, I would add lines to connect each fruit (apple in 1995 to apple in 2000). This would help the reader to better track the change in yield over time.

Secondly, I would actually leave the fruit color constant across years and vary the color between fruits (probably coloring them according to their real-world colors). This again helps the reader in that they can more easily identify the fruits and also helps them track the change in yield. (It also avoids a Stroop-like effect of coloring an orange some other color than orange!)

Here is the code:

# Read in PNG files
apple = readPNG("~/Desktop/fruit-plot/red_apple.png", TRUE)
orange = readPNG("~/Desktop/fruit-plot/orange_orange.png", TRUE)
tomato = readPNG("~/Desktop/fruit-plot/red_tomato.png", TRUE)

# Plot
p2 = ggplot(data = plotData, aes(x = year, y = yield)) +
  geom_point(size = 0) +
  geom_line(aes(group = crop), lty = "dashed") +
  theme_bw()  +
  xlab("Year") +
  ylab("Yield") +
  annotate("text", x = 1997, y = 350, label = "Tomato created by Andrey Vasiliev from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 330, label = "Apple created by Creative Stall from the Noun Project", size = 2, hjust = 0) +
  annotate("text", x = 1997, y = 310, label = "Orange created by Gui Zamarioli from the Noun Project", size = 2, hjust = 0)

for(i in 1:nrow(plotData)){
  p2 = p2 + annotation_custom(
    rasterGrob(eval(parse(text = as.character(plotData$crop[i])))),
    xmin = plotData$year[i] - 20, xmax = plotData$year[i] + 20, 
    ymin = plotData$yield[i] -20, ymax = plotData$yield[i]+20

# Show plot

And the result…


Halloween: An Excuse for Plotting with Icons

In my course on the GLM, we are discussing residual plots this week. Given that it is also Halloween this Saturday, it seems like a perfect time to code up a residual plot made of ghosts.

Ghost plotThe process I used to create this plot is as follows:

  1. Find an icon that you want to use in place of the points on your scatterplot (or dot plot).

I used a ghost icon (created by Andrea Mazzini) obtained from The Noun Project. After downloading the icon, I used Preview to create a new PNG file that had cut out the citation text in the downloaded image. I will add the citation text at a later stage in the plot itself. This new icon was 450×450 pixels.

  1. Use ggplot to create a scatterplot of a set of data, making the size of the points 0.

Here is the code that will create the data and make the plot that I used.

plotData = data.frame(
  .fitted = c(76.5, 81.3, 75.5, 79.5, 80.1, 78.5, 79.5, 77.5, 81.2, 80.4, 78.1, 79.5, 76.6, 79.4, 75.9, 86.6, 84.2, 83.1, 82.4, 78.4, 81.6, 79.6, 80.4, 82.3, 78.6, 82.1, 76.6, 82.1, 87, 82.2, 82.1, 87.2, 80.5, 84.9, 78.5, 79, 78.5, 81.5, 77.4, 76.8, 79.4, 75.5, 80.2, 80.4, 81.5, 81.5, 80.5, 79.2, 82.2, 83, 78.5, 79.2, 80.6, 78.6, 85.9, 76.5, 77.5, 84.1, 77.6, 81.2, 74.8, 83.4, 80.4, 77.6, 78.6, 83.3, 80.4, 80.5, 80.4, 83.8, 85.1, 82.2, 84.1, 80.2, 75.7, 83, 81.5, 83.1, 78.3, 76.9, 82, 82.3, 85.8, 78.5, 75.9, 80.4, 82.3, 75.7, 73.9, 80.4, 83.2, 85.2, 84.9, 80.4, 85.9, 76.8, 83.3, 80.2, 83.1, 77.6),
  .stdresid = c(0.2, -0.3, 0.5, 1.4, 0.3, -0.2, 1.2, -1.1, 0.7, -0.1, -0.3, -1.1, -1.5, -0.1, 0, -1, 1, 0.3, -0.5, 0.5, 1.8, 1.6, -0.1, -1.3, -0.2, -0.9, 1.1, -0.2, 1.5, -0.3, -1.2, -0.6, -0.4, -3, 0.5, 0.3, -0.8, 0.8, 0.5, 1.3, 1.8, 0.5, -1.6, -2, -2.1, -0.8, 0.4, -0.9, 0.4, -0.4, 0.6, 0.4, 1.4, -1.4, 1.3, 0.4, -0.8, -0.2, 0.5, 0.7, 0.5, 0.1, 0.1, -0.8, -2.1, 0, 1.9, -0.5, -0.1, -1.4, 0.6, 0.7, -0.3, 1, -0.7, 0.7, -0.2, 0.8, 1.3, -0.7, -0.4, 1.5, 2.1, 1.6, -1, 0.7, -1, 0.9, -0.3, 0.9, -0.3, -0.7, -0.9, -0.2, 1.2, -0.8, -0.9, -1.7, 0.6, -0.5)


p = ggplot(data = plotData, aes(x = .fitted, y = .stdresid)) +
    theme_bw() + 
    geom_hline(yintercept = 0) +
    geom_point(size = 0) +
    theme_bw() +
    xlab("Fitted values") +
    ylab("Standarized Residuals") +
    annotate("text", x = 76, y = -3, label = "Ghost created by Andrea Mazzini from Noun Project")
  1. Read in the icon (which is a PNG file).

Here we use the readPNG() function from the png library to bring the icon into R.

ghost = readPNG("/Users/andrewz/Desktop/ghost.png", TRUE)
  1. Use a for() loop to add annotation_custom() layers (one for each point) that contain the image.

The idea is that since we have saved our plot in the object p, we can add new layers (in our case each layer will be an additional point) by recursively adding the layer and then writing this into p. The pseudo-like code for this is:

for(i in 1:nrow(plotData)){
    p = p + 
        xmin = minimum_x_value_for_the_image, 
        xmax = maximum_x_value_for_the_image, 
        ymin = minimum_y_value_for_the_image, 
        ymax = maximum_y_value_for_the_image

In order for the image to be plotted, we first have to make it plot-able by making it a graphical object, or GROB.

The rasterGrob() function (found in the grid,/b> package) renders a bitmap image (raster image) into a graphical object or GROB which can then be displayed at a specified location, orientation, etc. Read more about using Raster images in R here.

The arguments xmin, xmax, ymin, and ymax give the horizontal and vertical locations (in data coordinates) of the raster image. In our residual plot, we want the center of the image to be located at the coordinates (.fitted, .stdresid). In the syntax below, we add a small bit to the maximum values and subtract a small bit from the minimum values to force the icon into a box that will plot the icons a bit smaller than their actual size. (#protip: play around with this value until you get a plot that looks good.)


for(i in 1:nrow(plotData)){
    p = p + annotation_custom(
      xmin = plotData$.fitted[i]-0.2, xmax = plotData$.fitted[i]+0.2, 
      ymin = plotData$.stdresid[i]-0.2, ymax = plotData$.stdresid[i]+0.2

Finally we print the plot to our graphics device using


And the result is eerily pleasant!

Research Hack: Obtaining an RSS Feed for Websites without RSS Feeds

In one of our older posts, I wrote about using feedreaders and aggregators to keep up-to-date on blogs, journals, etc. These work great when the site you want to read have an RSS feed. RSS (Rich Site Summary) is simply a format that retrieves updated content from a webpage. A feedreader (or aggregator) grabs the “feed” and displays the updated content for you to read. No more having to visit the website daily to see if something changed or was updated.

Many sites have an RSS feed and make the process of adding the content to your feedreader fairly painless. In general the process works something like this…look for the RSS icon, click it, and choose the reader you want to send the content to, and click subscribe.

Subscribing to an RSS Feed

Subscribing to an RSS Feed

But what about sites that do not have an RSS feed explicitly in place? It turns out that there are third-party applications that can piece together an RSS feed for these sites as well. There are several of these applications available online. I have used Page2RSS and had pretty good luck.

Page2RSSJust enter the URL of the site you want to keep tabs on into the “Page URL” box and click the “to RSS” button. This will produce another URL that can be copied and pasted into your feedreader or subscribed to directly.

I have successfully used this to add RSS feeds from many sites and journals to my feedreader. Below is a shot from Feedly (my current reader) on my iPad Mini of updated content (as of this morning) on the Journal of Statistics Education website.

JSE on Feedly

PDF and Citation Management

A new academic year looms. This means a new crop of graduate students will begin their academic training. PDF management is a critical tool that all graduate students need to use and the sooner the better. Often these tools go hand-in-hand with a citation management system, which is also critical for graduate students.

Using a citation management software makes scholarly work easier and more effective. First and foremost, these tools allow you to automatically cite references for a paper in a wide range of bibliographic styles. They also allow you to organize, evaluate, annotate, and search within your citation collection and share your references with others. Often they also sync across machines and devices allowing you to access your database wherever you are.

There are several tools available for PDF/citation management, including:

Some of these are citation managers only (BibDesk). Many allow you to also manage your PDF files as well; naming, organizing, and moving your files to a central repository on your computer. Some allow for annotation within the software as well. There are several online comparisons of some of the different systems ( e.g., Penn LibrariesUW Madison Library, etc.) From my experience, students tend to choose either Mendelay or Zotero—my guess is because they are free.

There is a lot to be said for free software, and both Zotero and Mendelay seem pretty solid. However, as a graduate student you should understand that you are investing in your future. This type of tool, I think it is fair to say, you will be using daily. Spending money on a tool that has the features and UI that you will want to use is perfectly ok and should even be encouraged.

Another consideration for students who are beginning the process is to find out what your advisor(s), and research groups use. Although many are cross-compatible, using and learning the tool is easier with a group helping you.

What Do I Use?

I use Papers. It is not free (a student license is ~$50). When I started using Papers, Mendelay and Zotero were not available. I actually have since used both Mendelay and Zotero for a while, but then ultimately made the decision this summer to switch back to Papers. It is faster and more importantly to me, has better search functionality, both across and within a paper.

I would like to use Sente (free for up to 100 references), but the search function is very limited. In my opinion, Sente has the best is sleek and minimalist and reading a paper is a nice experience.

My Recommendation…

Ultimately, use what you are comfortable with and then, actually use it. Take the time to enter ALL the meta-data for PDFs as you accumulate them. Don’t imagine you will have time to do it later…you won’t. Being organized with your references from the start will keep you more productive later.


Willful Ignorance [Book Review]

I just finished reading Willful Ignorance: The Mismeasure of Uncertainty by Herbert Weisberg. I gave this book five stars (out of five) on Goodreads.

According to Weisberg, the text can be

“regarded as two books in one. On one hand it is a history of a big idea: how we have come to think about uncertainty. On the other, it is a prescription for change, especially with regard to how we perform research in the biomedical and social sciences” (p. xi).

Willful ignorance is the idea that to deal with uncertainty, statisticians simplify the situation by filtering out or ignoring much of what we know…we willfully ignore some information in order to quantify the amount of uncertainty.

The book gives a cogent history and evolution of the ideas and history of probability, tackling head-on the questions: what is probability, how did we come to our current understanding of probability, and how did mathematical probability come to represent uncertainty and ambiguity.

Although Weisberg presents a nice historical perspective, the book is equally philosophical. In some ways it is a more leisurely read of the material found in Hacking, and in many ways more compelling.

I learned a great deal from this book. In many places I found myself re-reading sections and spiraling back to previously read sections to read them with some new understanding. I may even try to assign parts of it to the undergraduates I am teaching this summer.

This book would make a wonderful beach read for anyone interested in randomness, or uncertainty, or any academic hipster.

Fitbit Revisited

Many moons ago we wrote about a bit of a kludge to get data from a Fitbit (see here). Now it looks as though there is a much better way. Cory Nissen has written an R package to scrape Fitbit data and posted it on GitHub. He also wrote a blog post on his blog Stats and Things announcing the package and demonstrating its use. While I haven’t tried it yet, it looks pretty straight-forward and much easier than anything else i have seen to date.

Annual Review of Reading

It is that time of year…time to review the previous year; make top 10 lists; and resolve to be a better person in 2015. I will tackle the first, but only of my reading habits. In 2014 I read 46 books for a grand total of 17,480 pages. (Note: I do not count academic books for work in this list, only books I read for recreation.) This is a yearly high, at least since I have been tracking this data on GoodReads (since late 2010). You can read an older annual report of reading here.

Year Books Pages
2011 45 15,332
2012 29 9,203
2013 45 15,887
2014 46 17,480

Since I have accumulated four years worth of data, I thought I might do some comparative analysis of my reading over this time period.

When am I reading?

plot2The trend displayed here was somewhat surprising when I looked at it—at least related to the decline in reading over the summer months. Although, reflecting on it, it maybe should not have been as surprising. There is a slight uptick around the month of May (when spring semester ends) and the decline begins in June/July. Not only do summer classes begin, but I also try to do a few house and garden projects over the summer months. This uptick and decline are still visible when a plot of the number of pages (rather than the number of books) is examined, albeit much smaller (1,700 pages in May and 1,200 pages in the summer months). This might indicate I read longer books in the summer. For example, one of the books I read this last summer was Neal “I don’t know the meaning of the word ‘brevity'” Stephenson’s Reamde, which clocked in at a mere 1,044 pages.

Was I reading books that I ultimately enjoyed?

plot3I also plotted my monthly average rating (on a five-point scale) for the four years of data. This plot shows that 2014 is an anomaly. I apparently read trash in the summer (which is what you are supposed to do). The previous three years I read the most un-noteworthy books in the fall. Or, I just rated them lower because school had started again.

Am I more critical than other readers? Is this consistent throughout the year?

I also looked at how other GoodReads readers had rated those same books. The months represent when I read the book. (I didn’t look at when the book was read by other readers, although that would be interesting to see if time of year has an effect on rating.) The scale on the y-axis is the residual between my rating and the average GoodReads rating. My ratings are generally close to the average, sometimes higher, sometimes lower. There are, however, many books that I rated much lower than average. The loess smooth suggests that July–November is when I am most critical relative to other readers.