“Mail merge” with RMarkdown

The term “mail merge” might not be familiar to those who have not worked in an office setting, but here is the Wikipedia definition:

Mail merge is a software operation describing the production of multiple (and potentially large numbers of) documents from a single template form and a structured data source. The letter may be sent out to many “recipients” with small changes, such as a change of address or a change in the greeting line.

Source: http://en.wikipedia.org/wiki/Mail_merge

The other day I was working on creating personalized handouts for a workshop. That is, each handout contained some standard text (including some R code) and some fields that were personalized for each participant (login information for our RStudio server). I wanted to do this in RMarkdown so that the R code on the handout could be formatted nicely. Googling “rmarkdown mail merge” didn’t yield much (that’s why I’m posting this), but I finally came across this tutorial which called the process “iterative reporting”.

Turns our this is a pretty straightforward task. Below is a very simple minimum working example. You can obviously make your markdown document a lot more complicated. I’m thinking holiday cards made in R…

All relevant files for this example can also be found here.

Input data: meeting_times.csv

This is a 20 x 2 csv file, an excerpt is shown below. I got the names from here.

name meeting_time
Peggy Kallas 9:00 AM
Ezra Zanders 9:15 AM
Hope Mogan 9:30 AM
Nathanael Scully 9:45 AM
Mayra Cowley 10:00 AM
Ethelene Oglesbee 10:15 AM

R script: mail_merge_script.R


## Packages
library(knitr)
library(rmarkdown)

## Data
personalized_info <- read.csv(file = "meeting_times.csv")

## Loop
for (i in 1:nrow(personalized_info)){
 rmarkdown::render(input = "mail_merge_handout.Rmd",
 output_format = "pdf_document",
 output_file = paste("handout_", i, ".pdf", sep=''),
 output_dir = "handouts/")
}

RMarkdown: mail_merge_handout.Rmd

---
output: pdf_document
---

```{r echo=FALSE}
personalized_info <- read.csv("meeting_times.csv", stringsAsFactors = FALSE)
name <- personalized_info$name[i]
time <- personalized_info$meeting_time[i]
```

Dear `r name`,

Your meeting time is `r time`.

See you then!

Save the Rmd file and the R script in the same folder (or specify the path to the Rmd file accordingly in the R script), and then run the R script. This will call the Rmd file within the loop and output 20 PDF files to the handouts directory. Each of these files look something like this

mail_merge_sample

with the name and date field being different in each one.

If you prefer HTML or Word output, you can specify this in the output_format argument in the R script.

Quantitatively Thinking

John Oliver said it best: April 15 combines Americans two most-hated things: taxes and math.  I’ve been thinking about the latter recently after hearing a fascinating talk last weekend about quantitative literacy.

QL is meant to describe our ability to think with, and about, numbers.  QL doesn’t include  high-level math skills, but usually is meant to describe  our ability to understand percentages and proportions and basic mathematical operations.This is a really important type of literacy, of course, but I fear that the QL movement could benefit from merging QL with SL–Statistical Literacy.

No surprise, that, coming from this blog.  But let me tell you why.  The speaker began by saying that many Americans can’t figure out, given the amount of gas in their tank, how many miles they have to drive before they run out of gas.

This dumbfounded me.  If it were literally true, you’d see stalled cars every few blocks in Los Angeles.  (Now we see them only every 3 or 4 miles.)  But I also thought, wait, do I know how far I can drive before I run out of gas?  My gas gauge says I have half a tank left, and I think (but am not certain) that my tank holds 16 gallons.  That means I probably have 8 gallons left.  I can see I’ve driven about 200 miles since I last filled up because I remembered to hit that little mileage reset button that keeps track of such things.  And so I’m averaging 25 mpg. But I’m also planning a trip to San Diego in the next couple of days, and then I’ll be driving on the highway, and so my mileage will improve.  And that 25 mpg is just an average, and averages have variability, but I don’t really have a sense of the variability of that mean.  And this problem requires that I know my mpg in the future, and, well, of all the things you can predict, the future is the hardest.  And so, I’m left to conclude that I don’t really know when my car will run out gas.

Now while I don’t know the exact number of miles I can drive, I can estimate the value.  With a little more data I can measure the uncertainty in this estimate, too, and use that to decide, when the tank gets low, if I should push my luck (or push my car).

And that example, I think, illustrates a problem with the QL movement.  The issue is not that Americans don’t know how to calculate how far they can drive before their car runs out of gas, but that they don’t know how to estimate how far they can drive. This is not just mincing words. The actual problem from which the initial startling claim was made was something like this: “Your car gets 25 mpg and you have 8 gallons left in your tank.  How far can you drive before you run out of gas?”  In real life, the answer is “It depends.”  This is a situation that every first-year stats student should recognize contains variability.   (For those of you whose car tries to tell you how many miles you have left in your tank, you’ve probably experienced that pleasing event when you begin your trip with, say, 87 miles left in your tank and end your trip 10 miles later with 88 miles left in your tank.  And so you know first hand the variability in this system.) The correct response to this question is to try to estimate the miles you can drive, and to recognize assumptions you must make to do this estimation.  Instead, we are meant to go into “math mode” and recognize this not as a life-skills problem but  a Dreaded Word Problem.  One sign that you are dealing with a DWP is that there are implicit assumptions that you’re just supposed to know, and you’re supposed to ignore your own experience and plow ahead so that you can get the “right” answer, as opposed to the true answer. (Which is: “it depends”).

A better problem would provide us with data.  Perhaps we would see the distances travelled on 8 gallons the last 10 trips.  Or perhaps on just 5 gallons and then would have to estimate how far we could go, on average, with 8 gallons.  And we should be asked to state our assumptions and to consider the consequences if those assumptions are wrong.  In short, we should be performing a modeling activity, and not a DWP.  Here’s an example:  On my last 5 trips, on 10 gallons of gas I drove 252, 184, 300, 355, 205 miles.  I have 10 gallons left, and I must drive 200 miles.  Do I need to fill up? Explain.**

The point is that one reason QL seems to be such a problem is not because we can’t think about numbers, but that the questions that have been used to conclude that we can’t think about numbers are not reflective of real-life problems.  Instead, these questions are reflective of the DWP culture.  I should emphasize that this is just one reason.  I’ve seen first hand that many students wrestle with proportions and basic number-sense.  This sort of question that comes up often in intro stats — “I am 5 inches taller than average.  One standard deviation is 3 inches.  How many standard deviations above average am I?”  –is a real stumper for many students, and this is sad because by the time they get to college this sort of thing should be answerable through habit, and not require thinking through for the very first time. (Interestingly, if you change the 5 to a 6 it becomes much easier for some, but not for all.)

And so, while trying to ponder the perplexities of finding your tax bracket, be consoled that a great number of others —who really knows how many others? — are feeling the same QL anxiety as you.  But for a good reason:  tax problems are perhaps the rare examples of  DWPs that actually matter.

**suggestions for improving this problem are welcome!

PD follow-up

Last Saturday the Mobilize project hosted a day-long professional development meeting for about 10 high school math teachers and 10 high school science teachers.  As always, it was very impressive how dedicated the teachers were, but I was particularly impressed by their creativity as, again and again, they demonstrated that they were able to take our lessons and add dimension to them that I, at least, didn’t initially see.

One important component of Mobilize is to teach the teachers statistical reasoning.  This is important because (a) the Mobilize content is mostly involved with using data analysis as a pathway for teaching math and science and (b) the Common Core (math) and the Next Generation (science) standards include much more statistics than previous curricula.  And yet, at least for math teachers, data analysis is not part of their education.

And so I was looking forward to seeing how the teachers performed on the “rank the airlines” Model Eliciting Activity, which was designed by the CATALYST project, led by Joan Garfield at U of Minnesota.  (Unit 2, Lesson 9 from the CATALYST web site.)  Model Eliciting Activities (MEA) are a lesson design which I’m getting really excited about, and trying to integrate into more of my own lessons.  Essentially, groups of students are given realistic and complex questions to answer.  The key is to provide some means for the student groups to evaluate their own work, so that they can iterate and achieve increasingly improved solutions.  MEAs began in the engineering-education world, and have been used increasingly in mathematics both at college and high school and middle school levels.  (A good starting point is “Model-eliciting activities (MEAs)  as a bridge between engineering education research and mathematics education research”, HamiIton, Lesh, Lester, Brilleslyper, 2008.  Advances in Engineering Education.) I was first introduced to MEAs when I was an evaluator for the CATALYST project, but didn’t really begin to see their potential until Joan Garfield pointed it out to me while I was trying to find ways of enhancing our Mobilize curriculum.

In the MEA we presented to the teachers on Saturday, they were shown data on arrival time delays from 5 airlines. Each airline had 10 randomly sampled flights into Chicago O’Hare from a particular year.  The primary purpose of the MEA is to help participants develop informal ways for comparing groups when variability is present.  In this case, the variability is present in an obvious way (different flights have different arrival delays) as well as less obvious ways (the data set is just one possible sample from a very large population, and there is sample-to-sample variability which is invisible. That is, you cannot see it in the data set, but might still use the data to conjecture about it.)

Before the PD I had wondered if the math and science teachers would approach the MEA differently.  Interestingly, during our debrief, one of the math teachers wondered the same thing.  I’m not sure if we saw truly meaningful differences, but here are some things we did see.

Most of the teams immediately hit on the idea of struggling to merge both the airline accuracy and the airline precision into their ranking.  However, only two teams presented rules that used both.  Interestingly, one used precision (variability) as the primary ranking and used accuracy (mean arrival delay) to break ties; another group did the opposite.

At least one team ranked only on precision, but developed a different measure of precision that was more relevant to the problem at hand:  the mean absolute deviations from 0 (rather than deviations from the mean).

One of the more interesting things that came to my attention, as a designer or curriculum, was that almost every team wrestled with what to do with outliers.  This made me realize that we do a lousy job of teaching people what to do with outliers, particularly since outliers are not very rare.   (One could argue whether, in fact, any of the observations in this MEA are outliers or not, but in order to engage in that argument you need a more sophisticated understanding of outliers than we develop in our students.  I, myself, would not have considered any of the observations to be outliers.)  For instance, I heard teams expressing concern that it wasn’t “fair” to penalize an airline that had a fairly good mean arrival time just because of one bad outlier.  Other groups wondered if the bad outliers were caused by weather delays and, if so, whether it was fair to include those data at all.   I was very pleased that no one proposed an outright elimination of outliers. (At least within my hearing.)  But my concern was that they didn’t seem to have constructive ways of thinking about outliers.

The fact that teachers don’t have a way of thinking about outliers is our fault.  I think this MEA did a great job of exposing the participants to a situation in which we really had to think about the effect of outliers in a context where they were not obvious data-entry errors.  But I wonder how we can develop more such experiences, so that teachers and students don’t fall into procedural-based, automated thinking.  (e.g. “If it is more than 1.5 times the IQR away from the median, it is an outlier and should be deleted.”  I have heard/read/seen this far too often.)

Do you have a lesson that engages students in wrestling with outliers? If so, please share!

Model Eliciting Activity: Prologue

I’m very excited/curious about tomorrow: I’m going to lead about 40 math and science teachers in a data-analysis activities, using one of the Model Eliciting Activities from the University of Minnesota Catalysts for Change Project. (One of our bloggers, Andy, was part of this project.) Specifically, we’re giving them the arrival-delay times for five different airlines into Chicago O’Hare. A random sample of 10 from each airline, and asking them to come up with rules for ranking the airlines from best to worst.

I’m curious to see what they come up with, particularly whether  the math teachers differ terribly from the science teachers. The math teachers are further along in our weekend professional development program than are the science teachers, and so I’m hoping they’ll identify the key characteristics of a distribution (all together: center, spread, shape; well, shape doesn’t play much of a role here) and use these to formulate their rankings. We’ve worked hard on helping them see distributions as a unit, and not a collection of individual points, and have seen big improvements in the teachers, most of whom have not taught statistics before.

The science teachers, I suspect, will be a little bit more deterministic in their reasoning, and, if true to my naive stereotype of science teachers, will try to find explanations for individual points. Since I haven’t worked as much with the science teachers, I’m curious to see if they’ll see the distribution as a whole, or instead try to do point-by-point comparisons.

When we initially started this project, we had some informal ideas that the science teachers would take more naturally to data analysis than would the math teachers. This hasn’t turned out to be entirely true. Many of the math teachers had taught statistics before, and so had some experience. Those who hadn’t, though, tended to be rather procedurally oriented. For example, they often just automatically dropped outliers from their analysis without any thought at all, just because they thought that that was the rule. (This has been a very hard habit to break.)

The math teachers also had a very rigid view of what was and was not data. The science teachers, on the other hand, had a much more flexible view of data. In a discussion about whether photos from a smart phone were data, a majority of math teachers said no and a majority of science teachers said yes. On the other hand, the science teachers tend to use data to confirm what they already know to be true, rather than use it to discover something. This isn’t such a problem with the math teachers, in part because they don’t have preconceptions of the data and so have nothing to confirm. In fact, we’ve worked hard with the math teachers, and with the science teachers, to help them approach a data set with questions in mind. But it’s been a challenge teaching them to phrase questions for their students in which the answers aren’t pre-determined or obvious, and which are empirically oriented. (For example: We would like them to ask something like “what activities most often led to our throwing away redcycling into the trash bin?” rather than “Is it wrong to throw trash into the recycling bin?” or “Do people throw trash into the recycling bin?”)

So I’ll report back soon on what happened and how it went.

Yikes…It’s Been Awile

Apparently our last blog post was in August. Dang. Where did five months go? Blog guilt would be killing me, but I swear it was just yesterday that Mine posted.

I will give a bit of review of some of the books that I read this semester related to statistics. Most recently, I finished Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications. This was a fairly readable book for those looking to understand a bit of matrix algebra. The emphasis is definitely in economics, but their are some statistics examples as well. I am not as sure where the “motivated learning” part comes in, but the examples are practical and the writing is pretty coherent.

The two books that I read that I am most excited about are Model Based Inference in the Life Sciences: A Primer on Evidence and The Psychology of Computer Programming. The latter, written in the 70’s, explored psychological aspects of computer programming, especially in industry, and on increasing productivity. Weinberg (the author) stated his purpose in the book was to study “computer programming as a human activity.” This was compelling on many levels to me, not the least of which is to better understand how students learn statistics when using software such as R.

Reading this book, along with participating in a student-led computing club in our department has sparked some interest to begin reading the literature related to these ideas this spring semester (feel free to join us…maybe we will document our conversations as we go). I am very interested in how instructor’s choose software to teach with (see concerns raised about using R in Harwell (2014). Not so fast my friend: The rush to R and the need for rigorous evaluation of data analysis and software in education. Education Research Quarterly.) I have also thought long and hard about not only what influences the choice of software to use in teaching (I do use R), but also about subsequent choices related to that decision (e.g., if R is adopted, which R packages will be introduced to students). All of these choices probably have some impact on student learning and also on students’ future practice (what you learn in graduate school is what you ultimately end up doing).

The Model Based Inference book was a shorter, readable version of Burnham and Anderson’s (2003) Springer volume on multimodel inference and information theory. I was introduced to these ideas when I taught out of Jeff Long’s, Longitudinal Data Analysis for the Behavioral Sciences Using R. They remained with me for several years and after reading Anderson’s book, I am going to teach some of these ideas in our advanced methods course this spring.

Anyway…just some short thoughts to leave you with. Happy Holidays.

Pie Charts. Are they worth the Fight?

Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.

screenshot-app

During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.

plots_1Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)

It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3″ browsers have a strong hold on the market.

The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.

The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.

So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.

*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.

**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.

# Input the ad data
ad = data.frame(
	type = c("Poster", "Billboard", "Bus", "Digital"),
	n = c(529, 356, 59, 81)
	)

# Bar plot
library(ggplot2)
ggplot(data = ad, aes(x = type, y = n, fill = type)) +
     geom_bar(stat = "identity", show_guide = FALSE) +
     theme_bw()

# Add addition columns to data, needed for donut plot.
ad$fraction = ad$n / sum(ad$n)
ad$ymax = cumsum(ad$fraction)
ad$ymin = c(0, head(ad$ymax, n = -1))

# Donut plot
ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
     geom_rect(colour = "grey30", show_guide = FALSE) +
     coord_polar(theta = "y") +
     xlim(c(0, 4)) +
     theme_bw() +
     theme(panel.grid=element_blank()) +
     theme(axis.text=element_blank()) +
     theme(axis.ticks=element_blank()) +
     geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) +
     xlab("") +
     ylab("")

 

 

Increasing the Numbers of Females in STEM

I just read a wonderful piece written about how the Harvey Mudd increased the ratio of females declaring a major in Computer Science from 10% to 40% since 2006. That is awesome!

One of the things that they attribute this success to is changing the name of their introductory course. They renamed the course from Introduction to programming in Java to Creative Approaches to Problem Solving in Science and Engineering using Python.

Now, clearly, they changed the language they were using (literally) as well,from Java to Python, but it does beg the question, “what’s in a name?” According to Jim Croce and Harvey Mudd, a lot. If you don’t believe that, just ask anyone who has been in a class with the moniker Data Science, or any publisher who has published a book recently entitled [Insert anything here] Using R.

It would be interesting to study the effect of changing a course name. Are there words or phrases that attract more students to the course (e.g., creative, problem solving)?  Are there gender differences? How long does the effect last? Is it a flash-in-the-pan? Or does it continue to attract students after a short time period? (My guess is that the teacher plays a large role in the continued attraction of students to the course.)

Looking at the effects of a name is not new. Stephen Dubner and Steve Levitt of Freakonomics fame have illuminated folks about research about whether a child’s name has an effect on a variety of outcomes such as educational achievement and future income [podcast], and suggest that it isn’t as predictive as some people believe. Perhaps someone could use some of their ideas and methods to examine the effect of course names.

Has anyone tried this with statistics (aside from Data Science)? I know Harvard put in place a course called Real Life Statistics: Your Chance for Happiness (or Misery) which got good numbers of students (and a lot of press). My sense is that this happens much more in liberal arts schools (David Moore’s Concepts and Controversies book springs to mind). What would good course words or phrases for statistics include? Evidence. Uncertainty. Data. Variation. Visualization. Understanding. Although these are words that statisticians use constantly, I have to admit they all sound better than An Introduction to Statistics.

 

Conditional probabilities and kitties

I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.

vodka

And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):

cond

The question asks: “My cat is indoor only. Is it still at risk?”

The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).

However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.

Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.

When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!

The Future of Inference

We had an interesting departmental seminar last week, thanks to our post-doc Joakim Ekstrom, that I thought would be fun to share.  The topic was The Future of Statistics discussed by a panel of three statisticians.  From left to right in the room: Songchun Zhu (UCLA Statistics), Susan Paddock (RAND), and Jan DeLeeuw (UCLA Statistics).  The panel was asked about the future of inference: waxing or waning.

The answers spanned the spectrum from “More” to “Less” and did so, interestingly enough, as one moved left to right in order of seating.  Songchun staked a claim for waxing, in part because  he knows of groups that are hiring statisticians instead of computer scientists because statisticians’ inclination to cast problems in an inferential context makes them more capable of finding conclusions in data, and not simply presenting summaries and visualizations.  Susan felt that it was neither waxing nor waning, and pointed out that she and many of the statisticians she knows spend much of their time doing inference.  Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I think one reason that many stats educators might object to this because its hard to think of how else to fill the curriculum.  That might have been an issue when most students took a single Introductory course in their early twenties and then never saw statistics again.  But now we must think of the long game, and realize that students begin learning statistics early.  The Common Core stakes out one learning pathway, but we should be looking ahead, and thinking of future curricula, since the importance of statistics will grow.

If statistics is the science of data, I suggest we spend more time thinking about how to teach students to behave more like scientists.  And this means thinking seriously about how we can  develop their sense of curiosity.  The Common Core introduces the notion of a ‘statistical question’– a question that recognizes variability.  To the statisticians reading this, this needs no more explanation.  But I’ve found it surprisingly difficult to teach this practice to math teachers teaching statistics.  I’m not sure, yet, why this is.  Part of the reason might be that in order to answer a statistical question such as “What is the most popular favorite color in this class” we must ask the non-statistical question “What is your favorite color.”  But there’s more to it than that.  A good statistical question isn’t as simple as the one I mentioned, and leads to discovery beyond the mere satisfaction of curiosity.  I’m reminded of the Census at Schools program that encouraged students to become Data Detectives.

In short, its time to think seriously about teaching students why they should want to do data analysis.  And if we’re successful, they’ll want to learn how to do inference.

So what role does inference play in your Ideal Statistics Curriculum?

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.