Policy By the Numbers: Data in the News

Participating in the “hangout” hosted by Jess Hemerly’s Policy By the Numbers blog was fun, but even better was learning about this cool blog.  It’s very exciting to meet people from so many different backgrounds and from so many varied interests who share an interest in data accessibility.  One feature  of PBtN that I think many of our readers will find particularly useful is the weekly roundup of data in the news. Check it out!

Upcoming events: Rob Gould and Chris Franklin on Google+ Hangouts on Air

Update: the event has ended, but can be watched via YouTube

Google+ Policy by the Numbers is airing a K-12 statistics education discussion on Nov. 28 at 4 pm EST via Hangout on Air. With the ever-increasing number of students taking AP Statistics each year and the inclusion of statistics in the Common Core State Standards for Mathematics, Franklin and Gould will address the value of statistical literacy, the increasing interest, and the challenges. Please tune in Nov. 28 at 4 p.m. EST at the Policy By the Numbers Google+ page.  I thank the American Statistical Association for working with Google+ to arrange the event.

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

Inference for the population by the population — what does that even mean?

In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students  define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.

Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.

The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.

I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).

Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.

On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.

One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.

Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.

Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.

So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?

Big Data and Privacy

The L.A. Times today (Monday, November 19) ran an editorial about the benefits and costs of Big Data.  I truly believe that statisticians should teach introductory students (and all students, really) about data privacy. But who feels they have a realistic handle on the nature of these threats and the size of the risk? I know I don’t.  Does anyone teach this in their class?  Let’s hear about it!  In the meantime, you might enjoy reading (or re-reading) a classic on the topic by Latanya Sweeney: k-Anonymity: a model for protecting privacy.

Winner: Statistics

It is a happy day to be a statistician, as bloggers and columnists are bragging about many correctly predicted victories in an age in which traditional survey methodologies have been made out of date.  Mark Blumenthal at the Huffington Post reminds us that one role of statistics is to temper personal bias.  He gives a shout out to several pollsters, but I think Nate Silver at 538 is due special mention.

More than just a pollster, Silver is a great statistics educators.  And readers of 538 have learned several important lessons that all Citizen Statisticians should know.  For instance, that data-based decisions and predictions are based on models, and all models rest on assumptions.  Silver examines and challenges these assumptions, and the success of his blog and the success of his predictions results from his willingness to probe the robustness of the polls, and to examine the resulting scenarios.  I’m not sure if other pollsters due this.  (Blumenthal suggests that Huffington Post has software that automatically merges polls, but does not suggest that they check the assumptions underlying the various polling models.)

Silver’s willingness to talk about the assumptions behind the models and to explicitly address his methodology is a type of transparency that we should all practice.  We talk much about data transparency–which is important–but transparency in analysis is equally important, and vital to scientific reproducibility.  Really, it is a type of sharing, and isn’t that one of those lessons we learned in kindergarten?

This testing of assumptions is something we should all teach, at all levels.  I fear I don’t do it well enough.  Certainly, I teach the conditions/assumptions underlying statistical models, but I’m not sure I do enough to show students what goes wrong if assumptions fail.  Sometimes I’ll say a few words, but students need the experience of making failed predictions to understand the importance of assumptions. And they need the tools to tune their models to adjust to inappropriate assumptions.

Incidentally, Silver’s new book (which should be flying off the shelves after last night) has a chapter about political pundits.  Pundits are not paid to be correct; they are paid to be provocative.  And last night shows that they’ll keep their paychecks.

Data Sets: A List in Flux

After my Pinterest post, I got a little bit hooked, mostly because I realized that it was a visual way for me to see my bookmarks. This makes it easier for me to find the information I am looking for quickly. One problem is that it requires an image, so I quickly realized that the links for data sets wouldn’t work so well on Pinterest.

Then I remembered that I have used my personal blog as an organized reminder list (see this post where I remind myself how to re-set features on my computer after disaster), and thought I could do the same here, but with  data sets that others could also use. So, inspired by the post Rob linked to some time ago (Finding Data) I thought I would start putting together a more comprehensive list as I have time. This way, when I come across new data sets, I can just add them in.

Below, I have taken the data sets listed by RevoJoe from his post for Inside-R, and reorganized them a little. Over time, they could be re-organized again, and again, and again. It would be nice to add a short description for each as well (maybe someday). I have added some others to the list as well.

COLLECTIONS

ECONOMICS

EDUCATION

ENTERTAINMENT

FINANCE

GOVERNMENT, WORLD-LEVEL

GOVERNMENT, COUNTRY-LEVEL

GOVERNMENT, CITY-LEVEL

MACHINE LEARNING

SCIENCE

SOCIAL SCIENCES

TIME SERIES

UNIVERSITIES

USING R TO PULL DATA