Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.

A walk in Venice Beach

For various reasons, I decided to walk this weekend from my house to Venice Beach, a distance of about four and a half miles.  The weather was beautiful, and I thought a walk would help clear my mind.  I had recently heard a story on NPR in which it was reported that Thoreau kept data on when certain flowers opened, a record now used to help understand the effects of global warming.  Some of these flowers were as far as 5 miles from Thoreau’s home.  Which made me think, that if he could walk 5 miles to collect data, so could I.  Inspired also, perhaps, by the UCLA Mobilize project, I made a decision to take a photo every 5 minutes.  The rule was simple: I would set my phone’s timer for 5 minutes. When it rang, no matter where I was, I would snap a picture.

I decided I would take just one picture, so that I would be forced to exercise some editorial decision making. That way, the data would reflect my own state of mind, in some sense.  Later in the walk, I cheated, because it’s easier to take many pictures than to decide on one.  I also sometimes cheated by taking pictures of things when it wasn’t the right time.  Here’s the last picture I decided to take, at the end of my walk (I took a cab home. I am that lazy) on Abbot Kinney.

mural.

Brick mural, on Abbot Kinney

This exercise brought up a dilemma I often encounter when touristing–do you take intimate, close-up pictures of interesting features, like the above, or do you take pictures of the environment, to give people an idea of the surroundings?  This latter is almost always a bad idea, particularly if all you’ve got is an iPhone 4; it really is difficult to improve on Google Street View.  It is, however, extremely tempting, despite the fact that it leads to pictures like this:

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

Lincoln Blvd (Pacific Coast Hwy) and Venice Blvd, looking North

But my subject-matter choices were also limited in other ways.  For one, it was fairly hot, as this temperature plot (http://www.friendlyforecast.com/usa/archive) shows.

temp plot

The heat kept me on the shady side of the street, and the sun meant that I usually had to shoot across the street, although there were some exceptions:

IMG_1345(The object on the left is what we once called a “pay phone”. The only public phone I encountered that day, in fact, which added to the mystery of this storefront which had a colorful mural, but no name or address marker.)

During the walk I stopped at a farmer’s market and at a used book sale at the Mar Vista Library (bought an Everyman’s Library book about Beethoven and the score to Bach’s Cantata #4.) I watched toddler-aged girls fight and cry and dance outside a ballet studio, drank a too-expensive cup of coffee at Intelligentia coffee (but it was good), and bought my sister, for her birthday,  a terrarium at a make-your-own terrarium shop.

Books.

Books.

What to do with these data?  One challenge is to see what can be gleaned  from the photos.  The only trend that jumped out at me, while reviewing these, was the fact that I was in line at that coffee shop for a very long time, as this series of photos (taken every 5 minutes, remember), attest:

IMG_1369

Closer

Closer

waiting for the hand-pour-briewed coffee to actually be poured

waiting for the hand-pour-briewed coffee to actually be poured

So at the risk of overthinking this post, I’ll just come right to the point (finally):  how do we provide tools to make it easier for people to make sense of these data?

Rather than organize my partial answer in a thoughtful way, and thus spend weeks writing it down, let me just make a list.  I will organize the list, however, by sub-category.

Gathering the Data

  • The iPhone, of course, stores date and time stamps, as well as location stamps, whenever I snapped a photo.  And lots of other data, called exif data.  I can look at some of these using Preview or iPhoto,  but trying to extract the data for my own use is hard.  Does anyone know a way of getting a datafile that has the time, date, GPS coordinates for my pictures?  (And any other photo meta-data, for that matter.)  I browsed through a discussion on stackoverflow, and for me the take-home message was “no.” I did find a way to view the data; first, load the iPhone photos into iPhoto. Then export to hard drive, being sure to check the ‘include location information’ box. Then, open with Preview, open the Inspector (command-i or choose from drop-down menu), and then click on the GPS tab.  From there it is a simple matter of typing everything in, photo by photo, into another file.
  • Weather data is easily found to supplement the story, as the above graph shows.
  • OpenPaths provides free location data, and even stores if for you.  It allows you to export nice csv files, such as this file

Displaying the Data

  •  Well, you can always paste photos and graphs into along, rambling narrative.
  • iPhoto is apparently one of the few softwares that does have access to your exif data, and the “Places” feature will, with some playing around, let you show where you’ve been. It’s tedious, and you can’t easily share the results (maybe not at all).  But it does let you click on a location pin and see the picture taken there, which is fun.
  • StatCrunch has a new feature that lets you easily communicate with google maps. You provide latitude, longitude and optional other data, and it makes a map.  some funny formatting requirements:  data must be in this form  lat lon|color|other_variable
    Hopefully, StatCrunch will add a feature that let’s you easily move from the usual flat-file format for data to this format.  In the meantime, I had to export my StatCrunch OpenPaths data to excel, (could have used R, but I’m rusty with the string commands), and then re-import as a new data set.
  • Venice Walk Open Paths map on StatCrunch-1

Making Sense of It All

But the true challenge is how do we make sense of it all?  How do we merge these data in such a way that unexpected patterns that reveal deeper truths can be revealed? At the very least, is there a single, comprehensive data display that would allow you to more fully appreciate my experience?  If (and when) I do this walk again, how can I compare the data from the two different walks?

Some other themes:  our data should be ours to do with as we please. OpenPaths has it right; iPhone has it wrong wrong wrong.  Another theme: maps are now a natural and familiar way of storing and displaying data.  StatCrunch has taken some steps in the right direction in attempting to provide a smooth pathway between data and map, but more is needed.  Perhaps there’s a friendly, flexible, open-source mapping tool out there somewhere that would encourage our data-concious citizens to share their lives through maps?

If you’re still reading, you can view all of the pictures on flikr.

Policy By the Numbers: Data in the News

Participating in the “hangout” hosted by Jess Hemerly’s Policy By the Numbers blog was fun, but even better was learning about this cool blog.  It’s very exciting to meet people from so many different backgrounds and from so many varied interests who share an interest in data accessibility.  One feature  of PBtN that I think many of our readers will find particularly useful is the weekly roundup of data in the news. Check it out!

Inference for the population by the population — what does that even mean?

In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students  define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.

Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.

The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.

I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).

Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.

On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.

One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.

Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.

Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.

So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?

An Apropos Talk for this Blog

Jeffrey Breen just gave a talk entitled “Tapping the Data Deluge with R” to the Boston Predictive Analytics Meetup. He suggests there are two types of data in this world

  1. Data you have, and
  2. Data you don’t have…yet.

In the talk Jeffrey provided a nice overview of several methods for importing data into R, including:

  • Reading CSV files
  • Reading XLS files
  • Reading data formats from other statistics packages (e.g., SPSS, Stata, etc.)
  • Reading email data
  • Reading online data files
  • Web scraping data
  • Using APIs to access data

He also touches on some of the R packages that are useful for adding supplementary data to enrich an analysis (e.g., zipcode).

Yelp Data

Yelp LogoYelp is a website on which people review local businesses. In their own words, Yelp describes themselves as an “online urban city guide that helps people find cool places to eat, shop, drink, relax and play, based on the informed opinions of a vibrant and active community of locals in the know.”

Earlier this year, Yelp released a subset of their data to be used for academic use. The dataset includes business data (e.g., address, longitude, latitude), review data (e.g., number of ‘cool’ votes) and user data (e.g., name of user, number of reviews). Yelp provided these data for the following 30 colleges and universities (the nearest 250 businesses to these campuses):

  • Brown University
  • California Institute of Technology
  • California Polytechnic State University
  • Carnegie Mellon University
  • Columbia University
  • Cornell University
  • Georgia Institute of Technology
  • Harvard University
  • Harvey Mudd College
  • Massachusetts Institute of Technology
  • Princeton University
  • Purdue University
  • Rensselaer Polytechnic Institute
  • Rice University
  • Stanford University
  • University of California – Los Angeles
  • University of California – San Diego
  • University of California at Berkeley
  • University of Illinois – Urbana-Champaign
  • University of Maryland – College Park
  • University of Massachusetts – Amherst
  • University of Michigan – Ann Arbor
  • University of North Carolina – Chapel Hill
  • University of Pennsylvania
  • University of Southern California
  • University of Texas – Austin
  • University of Washington
  • University of Waterloo
  • University of Wisconsin – Madison
  • Virginia Tech

*Yelp claims that you can email them to add other campuses, but when I did that to request the University of Minnesota, they responded that they were not adding other campuses at the time. Perhaps if more requests come in they will update the dataset.

The data is in the JSON format, which makes d3 a good candidate for visualization work. R can also read JSON formatted data using the rjson package. [See the StackOverflow post here.] You need to install Yelp’s Python framework to obtain the data, but instructions are on the webpage. They also provide a GitHub page to inspire you.

What questions can you answer with this dataset? Here are some examples?

  • Which review or reviews have the most number of unique words (i.e., words not seen in any other review)? [see here]
  • Can you predict which businesses will close based on reviews?
  • Are there businesses that users tend to rate highly when they rate other businesses highly (i.e., are there associations between businesses that owners should/could take advantage of)?
  • Are there differences in the reviews of `new’ and `old’ businesses? How do these changes play out over time?

There are many others that I am sure people can come up with, most probably more interesting than those. If anyone has worked with this dataset and produced something, link to it in the comments.