Participating in the “hangout” hosted by Jess Hemerly’s Policy By the Numbers blog was fun, but even better was learning about this cool blog. It’s very exciting to meet people from so many different backgrounds and from so many varied interests who share an interest in data accessibility. One feature of PBtN that I think many of our readers will find particularly useful is the weekly roundup of data in the news. Check it out!
Update: the event has ended, but can be watched via YouTube
Google+ Policy by the Numbers is airing a K-12 statistics education discussion on Nov. 28 at 4 pm EST via Hangout on Air. With the ever-increasing number of students taking AP Statistics each year and the inclusion of statistics in the Common Core State Standards for Mathematics, Franklin and Gould will address the value of statistical literacy, the increasing interest, and the challenges. Please tune in Nov. 28 at 4 p.m. EST at the Policy By the Numbers Google+ page. I thank the American Statistical Association for working with Google+ to arrange the event.
I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…
I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.
I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].
The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?
The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.
- At least on programming language (probably R)
- In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
- Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
- Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
- Regular Expressions
- BibTeX (or some program to work with references….Mendeley, EndNote, something…)
- Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS, etc.)
- Unix/Linux and Shell Scripting
I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.
What have I missed?
In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.
Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.
The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.
I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).
Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.
On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.
One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.
Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.
Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.
So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?
The L.A. Times today (Monday, November 19) ran an editorial about the benefits and costs of Big Data. I truly believe that statisticians should teach introductory students (and all students, really) about data privacy. But who feels they have a realistic handle on the nature of these threats and the size of the risk? I know I don’t. Does anyone teach this in their class? Let’s hear about it! In the meantime, you might enjoy reading (or re-reading) a classic on the topic by Latanya Sweeney: k-Anonymity: a model for protecting privacy.
It is a happy day to be a statistician, as bloggers and columnists are bragging about many correctly predicted victories in an age in which traditional survey methodologies have been made out of date. Mark Blumenthal at the Huffington Post reminds us that one role of statistics is to temper personal bias. He gives a shout out to several pollsters, but I think Nate Silver at 538 is due special mention.
More than just a pollster, Silver is a great statistics educators. And readers of 538 have learned several important lessons that all Citizen Statisticians should know. For instance, that data-based decisions and predictions are based on models, and all models rest on assumptions. Silver examines and challenges these assumptions, and the success of his blog and the success of his predictions results from his willingness to probe the robustness of the polls, and to examine the resulting scenarios. I’m not sure if other pollsters due this. (Blumenthal suggests that Huffington Post has software that automatically merges polls, but does not suggest that they check the assumptions underlying the various polling models.)
Silver’s willingness to talk about the assumptions behind the models and to explicitly address his methodology is a type of transparency that we should all practice. We talk much about data transparency–which is important–but transparency in analysis is equally important, and vital to scientific reproducibility. Really, it is a type of sharing, and isn’t that one of those lessons we learned in kindergarten?
This testing of assumptions is something we should all teach, at all levels. I fear I don’t do it well enough. Certainly, I teach the conditions/assumptions underlying statistical models, but I’m not sure I do enough to show students what goes wrong if assumptions fail. Sometimes I’ll say a few words, but students need the experience of making failed predictions to understand the importance of assumptions. And they need the tools to tune their models to adjust to inappropriate assumptions.
Incidentally, Silver’s new book (which should be flying off the shelves after last night) has a chapter about political pundits. Pundits are not paid to be correct; they are paid to be provocative. And last night shows that they’ll keep their paychecks.
After my Pinterest post, I got a little bit hooked, mostly because I realized that it was a visual way for me to see my bookmarks. This makes it easier for me to find the information I am looking for quickly. One problem is that it requires an image, so I quickly realized that the links for data sets wouldn’t work so well on Pinterest.
Then I remembered that I have used my personal blog as an organized reminder list (see this post where I remind myself how to re-set features on my computer after disaster), and thought I could do the same here, but with data sets that others could also use. So, inspired by the post Rob linked to some time ago (Finding Data) I thought I would start putting together a more comprehensive list as I have time. This way, when I come across new data sets, I can just add them in.
Below, I have taken the data sets listed by RevoJoe from his post for Inside-R, and reorganized them a little. Over time, they could be re-organized again, and again, and again. It would be nice to add a short description for each as well (maybe someday). I have added some others to the list as well.
- Data360: http://www.data360.org/index.aspx
- Datamob.org: http://datamob.org/datasets
- Death Master File: Social Security Death Master File. http://ssdmf.info/download.html
- Enron emails: Carnegie Mellon University. http://www.cs.cmu.edu/~enron/
- Factual: http://www.factual.com/topics/browse
- Freebase: http://www.freebase.com/
- GeoDa Center: Collection of spatial data. https://geodacenter.asu.edu/datalist/
- Google: http://www.google.com/publicdata/directory
- Google Books Corpora: 1.55 billion words. http://googlebooks.byu.edu/#
- Infochimps: http://www.infochimps.com/
- Numbray: http://numbrary.com/
- Rdatasets: Collection of 597 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. http://vincentarelbundock.github.com/Rdatasets/
- Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
- SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
- UFO Reports: http://www.nuforc.org/webreports.html
- Wikileaks: 911 pager intercepts. http://911.wikileaks.org/files/index.html
- Stats4Stem.org: R data sets. http://www.stats4stem.org/data-sets.html
- Washington Post: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
- EconData: A Source of Economic Time Series Data from Inforum, at the University of Maryland. http://inforumweb.umd.edu/econdata/econdata.html
- World Bank: http://data.worldbank.org/indicator
- America’s Top Colleges: Forbes. http://www.forbes.com/top-colleges/
- Minnesota Department of Education: http://education.state.mn.us/ibi_apps/WFServlet?IBIF_ex=mdea_ddl_driver&TOPICID=1&DDL_VARS=5&NoCache=11.16.04
- Beverage Tasting Institute: Text-based and quantitative reviews for beer and wine http://www.tastings.com
- Hip-Hop Word Count: The Hip-Hop Word Count is a searchable ethnographic database built from the lyrics of over 40,000 Hip-Hop songs from 1979 to present day. http://staplecrops.com/index.php/hiphop_wordcount/
- Magnatagatune: Ready to use research dataset for MIR tasks such as automatic tagging. It contains human annotations, corresponding sound clips, and a detailed analysis of the track’s structure and musical content, including rhythm, pitch and timbre. http://tagatune.org/Magnatagatune.html
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- MusicBrainz Database: MusicBrainz is a community-maintained open source encyclopedia of music information. http://musicbrainz.org/doc/MusicBrainz_Database
- Sean Lahman’s Baseball Database: complete batting and pitching statistics back to 1871, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. http://www.seanlahman.com/baseball-archive/statistics/
- CBOE Futures Exchange: http://cfe.cboe.com/Data/
- Google Finance: http://finance.yahoo.com/ (R)
- Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
- St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
- NASDAQ: https://data.nasdaq.com/
- OANDA: http://www.oanda.com/ (R)
- Yahoo Finance: http://finance.yahoo.com/ (R)
- Archived national government statistics: http://www.archive-it.org/
- Guardian: Data on world governments. http://www.guardian.co.uk/world-government-data
- OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
- United Nations: http://data.un.org/
- World Bank: http://wdronline.worldbank.org/
- Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
- Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
- DataMarket: http://datamarket.com/
- Fed Stats: Celebrating over 10 years of making statistics from more than 100 agencies available to citizens everywhere. http://www.fedstats.gov/cgi-bin/A2Z.cgi
- New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by-subject.aspx
- United Kingdom: Government data. http://data.gov.uk/data
- United States: Federal Government agencies. http://www.data.gov/metric
- United States: Center for Disease Control (CDC) public health data. http://www.cdc.gov/nchs/data_access/ftp_data.htm
- London, U.K. data: http://data.london.gov.uk/catalogue
- New York City Data: http://nycplatform.socrata.com/
- San Francisco Data sets: http://datasf.org/
- ACM KDD Cup: Annual Data Mining and Knowledge Discovery competition organized by ACM SIGKDD, targeting real-world problems. http://www.sigkdd.org/kddcup/
- Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
- Frequent Itemset Mining Dataset Repository: Click-stream data, retail market basket data, traffic accident data and web html document data. http://fimi.ua.ac.be/data/
- Image Spam Dataset: A collection of ham and spam images taken from real user email. http://www.cs.jhu.edu/~mdredze/datasets/image_spam/ [Mark Dredze]
- Kaggle: Competition data. http://www.kaggle.com/
- KDNuggets: Competition site. http://www.kdnuggets.com/datasets/
- Koblenz Network Collection: http://konect.uni-koblenz.de/
- Machine Learning Data Set Repository: http://mldata.org/
- Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
- UCI Machine Learning Repository: Collection of databases, domain theories, and data generators. http://archive.ics.uci.edu/ml/
- Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
- Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
- Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
- Geo Spatial Data: http://geodacenter.asu.edu/datalist/
- Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
- MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
- NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
- Protein structure: http://www.infobiotic.net/PSPbenchmarks/
- Public Gene Data: http://www.pubgene.org/
- Stanford Microarray Data: http://smd.stanford.edu//
- General Social Survey: http://www3.norc.org/GSS+Website/
- Inter-university Consortium for Political and Social Research (ICPSR): http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
- Reality Mining: Collection of machine-sensed environmental data pertaining to human social behavior from the MIT Media Lab. http://reality.media.mit.edu/
- UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
- Upjohn Institute for Employment Research: With the cooperation and assistance of the U.S. Department of Labor (USDOL), the Upjohn Institute is serving as the data repository for many research and evaluation projects sponsored by the USDOL. http://www.upjohn.org/erdc/erdc.html
- Time Series Data Library: Collection of about 800 time series drawn from many different fields. http://robjhyndman.com/TSDL/
- University of California, Riverside: Data for time series classification and clustering. http://www.cs.ucr.edu/~eamonn/time_series_data/
- Carnegie Mellon University: StatLab. http://lib.stat.cmu.edu/datasets/
- Carnegie Mellon University: JASA data archive. http://lib.stat.cmu.edu/jasadata/
- Ohio State University: Financial data. http://fisher.osu.edu/fin/osudata.htm
- University of California, Berkeley: http://ucdata.berkeley.edu/
- University of California, Los Angeles (UCLA): http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
- University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html
USING R TO PULL DATA
- Congressional Ideology: DW-NOMINATE scores from voteview.com. http://is-r.tumblr.com/post/33765462561/the-distribution-of-ideology-in-the-u-s-house-with
- Current Population Survey (CPS): Statistics for the census bureau’s report on income, poverty, and health insurance coverage since 1948. http://usgsd.blogspot.com/2012/10/analyzing-current-population-survey-cps.html
- Huffington Post: Download poll data from the Huffington Post API. http://alandgraf.blogspot.com/2012/11/quick-post-about-getting-and-plotting.html
- 2012 Olympics: 100m Sprint (running) Men Finals. http://lamages.blogspot.com/2012/08/london-olympics-100m-mens-sprint-results.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FrKuKM+%28mages%27+blog%29
- 2012 Olympics: 100m Butterfly (swimming) Men Finals. http://www.actuarially.co.uk/post/28625598393/2012-olympics-swimming-100m-butterfly-men-finals
- Infochimps Geo API: Getting data from the Infochimps Geo API in R. The example used pulls data from the American Community Survey. http://dataexcursions.wordpress.com/2011/09/10/getting-data-from-the-infochimps-geo-api-in-r/