Data and the Freedom of Information Act

In reading one of the many blogs that I read, there was a suggestion to use the Baltimore’s parking citation data to see if some makes/models of cars get citations more than others. Now parking citations are very near and dear to me since I get at least one (n ≥ 1) parking citation a year parking near the University of Minnesota–which most often also leads to my car being towed since you only have so many hours to move your car after they ticket it.

After seeing the data from Baltimore I thought wonderful…I can do some analyses on the Minneapolis parking citation data and empirically examine some hypotheses. For example, are cars parked near the university campus ticketed and/or towed with greater disparity than in other neighborhoods in Minneapolis? Are the number of parking citations in Minneapolis really down from previous years (especially in locations where the have installed “smart” meters)? Or even whether particular types of cars are cited more often than others.

One problem. Minneapolis does not make their data as openly accessible as Baltimore. My initial thought was “no worries”. The Freedom of Information Act and the Minnesota Government Data Practices Act (Minnesota Statutes, Chapter 13) make it clear that all government data is public unless a state or federal laws says the data is not public. Furthermore, according to the Hennepin County Requests for Data by Members of the Public webpage, these laws “stipulate that Hennepin County must keep all government data in a way that makes it easy for you, as a member of the public, to access.”

My quest to obtain the parking citation data for Hennepin County (the county Minneapolis resides in) began at roughly 7:00 a.m. when I went to the Minneapolis Police Department webpage to find out how to obtain this data. After following the link to “traffic violations” I was redirected to the Hennepin County Traffic Violations Bureau, which is actually the Fourth District of the Minnesota Judicial Branch’s website. (This will matter later in the story). After making sure that I hadn’t been accidentally transported to the internet of 1997 (the page had echoes of the old Yahoo), I emailed the contact person under data requests.

At this point it was 7:30 a.m. and I thought great, I will be answering all my questions with ggplot2 by noon! After some back and forth with the contact person (I had to more specifically state what I wanted and make sure that they understood I wanted the raw data not summary reports, them making sure that Excel was an ok format for me) I received the following email:

The fee for running the data request is $60.  If you approve, the next step is to prepay by check.  District Court is unable to process cash or credit card payments for data requests.

 

Sixty dollars!!?? Are you kidding me? After my initial shock, I sent a bevy of emails back and forth with the contact person quoting both the Minnesota and Federal laws that they had printed on their website. The response was:

District Court is subject to the Rules of Public Access to the Records of the Judicial Branch http://www.mncourts.gov/?page=511#publicAccess. The Rules do not require raw data to be accessible through the internet. The data you are interested in must be extracted from the database to put is a readable Excel file, thus the prorated fee of $60 (rate is $80/hour).

To which I asked why it would cost me $60 for someone to query a database and output the results into an Excel spreadsheet (I am in the wrong profession!). When I formally withdrew my request at 2:03 p.m., I also politely mentioned that I would be blogging about this adventure. I was asked to call a phone number and talk with them before I blogged about it.

I called and we actually had a pretty good talk. Here is where things are a bit fuzzy and the difference between Hennepin County and the Fourth District Court come into play. I was told that these two entities have completely separate rules regarding the accessibility of data. (Note to the reader: This is true, but as far as I can tell by the document linked in the second email, these rules would not apply to parking citation data.) Also, because they are not on the same budget different requests for data cost the public money. (Another note to the reader: This is where I was told that since there were multiple queries to run–at least 21 different rules and regulations governing parking in Minneapolis–it would cost a lot of money because they had to go through quite a menu-structure for each query.)

I have the name of a contact who may or may not be able to help me obtain this data, and I will keep you posted. But all of this is to point out that “publicly accessible” data not only varies in its accessibility, but also in its being “public”. In Hennepin County, you still need to be somewhat affluent to obtain these data.

I applaud the cities that have begun and carried through with open data initiatives. Unfortunately, the United Kingdom is way ahead of the United States when it comes to open data. Here are some examples of cities that have embraced open data.

I am sure there are many more cities that have opened up their data in manners that are truly “open”. If people have good suggestions about how we as a statistics community can be more instrumental in helping city and county governments embrace these initiatives, post in the comments. I would love to hear from you.

 

Planting seeds of reproducibility with knitr and markdown

I attended useR! 2012 this past summer and one of the highlights of the conference was a presentation by Yihui Xie and JJ Allaire on knitr. As an often frustrated user of Sweave, I was very impressed with how they streamlined the process of integrating R with LaTeX and other document types, and I was excited to take advantage of the tools. It also occurred to me that these tools, especially the simpler markdown language, could be useful to the students in my introductory statistics course.

For context, I teach a large introductory statistics class with mostly first and second year social science majors. The course has a weekly lab component where students start using R from day one. The emphasis is on concepts and interpretation, as a way of reinforcing lecture material, but students are also expected to be comfortable enough with R to analyze novel data.

So why should students use knitr? Almost all students in this class have no programming experience, and are unfamiliar with command line interfaces. As such, they tend towards  a trial-and-error approach where they repeat the same command over and over again with minor modifications until it yields a reasonable result. This exploratory phase is important, as it helps them become comfortable with R. However it comes with frustraring side effects, such as cluttering the console and workspace, and hence leading to errors that are difficult to debug (e.g., inadvertently overwriting needed variables) and making it difficult for the students to reproduce their own results.

As of this semester, my students are using knitr and markdown to complete their lab reports. In an effort to make the transition from standard word processors as painless as possible, we provide templates containing precise formatting that informs the students on where to embed code vs. written responses. Throughout the semester, the amount of instructions are decreased as the students become more comfortable with the language and the overall formatting of the lab write-ups.

This is still work in progress, but after five labs my impressions are very positive. Students are impressed that their plots show up “magically” in the reports, and enjoy being able to complete their analysis and the write up entirely in RStudio. This eliminates the need to copy and paste plots and outputs into a word processor, and makes revisions far less error prone. Another benefit is that this approach forces them to keep their analysis organized, which helps keep the frustration level down.

And the cherry on top – lab reports created using markdown are much easier for myself and the teaching assistants to grade, since code, output, and write-up are automatically organized in a logical order and all reports follow the same structure.

There were, of course, some initial issues:

  • Not immediately realizing that it is essential to embed the code inside chunks identified by “`{r} and “` in order for it to be processed.
  • Running the code in the console first and then copying and pasting into the markdown document results in stray > and + signs, which results in cryptic errors.
  • The resulting document can be quite lengthy with all of the code, R output, plots, and written responses, making it less likely for students thoroughly review and catch errors.
  • Certain mistakes in R code (such as an extraneous space in a variable name) prevent the document from compiling (other errors will result in a compiled document with the error output). This is perhaps the most frustrating problem since it makes it difficult for the students to identify the source of the error.

With guidance from peers, teaching assistants, and myself, the students quickly develop the skills necessary to troubleshoot these issues, and after 5 weeks, such errors have all but vanished.

It’s not all sunshine and lollipops though, there are some missing features that would make knitr / RStudio more user friendly in this context:

  • Write to PDF: The markdown document creates an HTML file when compiled, which is useful for creating webpages, but a PDF output would be much more useful for students turning in printed reports. (Suggestion: Pandoc, ref: stackoverflow.)
  • Smart page breaks: Since the resulting HTML document is not meant to be printed on letter sized pages, plots and R output can be split between two pages when printed, which is quite messy.
  • Word count:  A word count feature that only counts words in the content, and not in the code, would be immensely useful for setting and sticking to length requirements. (Suggestion: Marked, ref: this post.)

Tools for resolving some of these issues are out there, but none of them can currently be easily integrated into the RStudio workflow.

All of the above amount to mostly logistic practicalities for using knitr in an introductory course, but there is also a larger pedagogical argument for it: introducing reproducible research in a practical and painless way. Reproducible research is not something that many first or second year undergraduate students are aware of — after all, very few of them are actually engaged in research activities at that point in their academic careers. At best, students are usually aware that reproducibility is one of the central tenants of the scientific method, but have given very little thought to what that involves either as a researcher producing work that others will want to replicate, or as someone attempting to reproduce another author’s work. In the context of completing simple lab assignments and projects with knitr, students experience first hand the benefits and the frustrations of reproducible research, which is hopefully a lesson they’ll take away from the class, regardless of how much R or statistics they remember.

PS: If you’re interested in the nuts and bolts, you can review the labs as well as knitr templates here.

Mathapalooza and Citizen Statisticians

This Friday, I (Rob) had the honor of giving the same talk three times in a row at the Mathapalooza, held at one of the Austin City College campuses. The audience was mostly central-Texas area community college faculty.  Giving the same talk three times in a row can be tiring, but the professors were very engaged and  very involved and so I had fun.  The topic was ‘Educating Citizen Statisticians’, and I mentioned the need to do what it takes so that intro stats is the most important class students take in college.  Intro Stats should be most important because today’s students have access to data and to data analysis tools, and so have access to opportunities as never before.  And it should be most important because data privacy issues are of such importance and have the potential to do real harm to those who aren’t aware of these issues.

Some sites mentioned in the talk:

 

 

 

An Apropos Talk for this Blog

Jeffrey Breen just gave a talk entitled “Tapping the Data Deluge with R” to the Boston Predictive Analytics Meetup. He suggests there are two types of data in this world

  1. Data you have, and
  2. Data you don’t have…yet.

In the talk Jeffrey provided a nice overview of several methods for importing data into R, including:

  • Reading CSV files
  • Reading XLS files
  • Reading data formats from other statistics packages (e.g., SPSS, Stata, etc.)
  • Reading email data
  • Reading online data files
  • Web scraping data
  • Using APIs to access data

He also touches on some of the R packages that are useful for adding supplementary data to enrich an analysis (e.g., zipcode).

More on FitBit data

First the good news:

 

Your data belongs to you!

And now the bad: It costs you $50/ year for your data to truly belong to you.  For a ‘premium’ membership, you can visit your data as often as you choose.  If only Andy had posted sooner, I would have saved $50.  But, dear readers, in order to explore all avenues, I spent the bucks.  And here’s some data (screenshot–I don’t want you analyzing *my* data!)

It’s pretty easy and painless.  Next I’ll try Andy’s advice, and see if I can save $50 next year.

 

 

Yelp Data

Yelp LogoYelp is a website on which people review local businesses. In their own words, Yelp describes themselves as an “online urban city guide that helps people find cool places to eat, shop, drink, relax and play, based on the informed opinions of a vibrant and active community of locals in the know.”

Earlier this year, Yelp released a subset of their data to be used for academic use. The dataset includes business data (e.g., address, longitude, latitude), review data (e.g., number of ‘cool’ votes) and user data (e.g., name of user, number of reviews). Yelp provided these data for the following 30 colleges and universities (the nearest 250 businesses to these campuses):

  • Brown University
  • California Institute of Technology
  • California Polytechnic State University
  • Carnegie Mellon University
  • Columbia University
  • Cornell University
  • Georgia Institute of Technology
  • Harvard University
  • Harvey Mudd College
  • Massachusetts Institute of Technology
  • Princeton University
  • Purdue University
  • Rensselaer Polytechnic Institute
  • Rice University
  • Stanford University
  • University of California – Los Angeles
  • University of California – San Diego
  • University of California at Berkeley
  • University of Illinois – Urbana-Champaign
  • University of Maryland – College Park
  • University of Massachusetts – Amherst
  • University of Michigan – Ann Arbor
  • University of North Carolina – Chapel Hill
  • University of Pennsylvania
  • University of Southern California
  • University of Texas – Austin
  • University of Washington
  • University of Waterloo
  • University of Wisconsin – Madison
  • Virginia Tech

*Yelp claims that you can email them to add other campuses, but when I did that to request the University of Minnesota, they responded that they were not adding other campuses at the time. Perhaps if more requests come in they will update the dataset.

The data is in the JSON format, which makes d3 a good candidate for visualization work. R can also read JSON formatted data using the rjson package. [See the StackOverflow post here.] You need to install Yelp’s Python framework to obtain the data, but instructions are on the webpage. They also provide a GitHub page to inspire you.

What questions can you answer with this dataset? Here are some examples?

  • Which review or reviews have the most number of unique words (i.e., words not seen in any other review)? [see here]
  • Can you predict which businesses will close based on reviews?
  • Are there businesses that users tend to rate highly when they rate other businesses highly (i.e., are there associations between businesses that owners should/could take advantage of)?
  • Are there differences in the reviews of `new’ and `old’ businesses? How do these changes play out over time?

There are many others that I am sure people can come up with, most probably more interesting than those. If anyone has worked with this dataset and produced something, link to it in the comments.