After my Pinterest post, I got a little bit hooked, mostly because I realized that it was a visual way for me to see my bookmarks. This makes it easier for me to find the information I am looking for quickly. One problem is that it requires an image, so I quickly realized that the links for data sets wouldn’t work so well on Pinterest.
Then I remembered that I have used my personal blog as an organized reminder list (see this post where I remind myself how to re-set features on my computer after disaster), and thought I could do the same here, but with data sets that others could also use. So, inspired by the post Rob linked to some time ago (Finding Data) I thought I would start putting together a more comprehensive list as I have time. This way, when I come across new data sets, I can just add them in.
Below, I have taken the data sets listed by RevoJoe from his post for Inside-R, and reorganized them a little. Over time, they could be re-organized again, and again, and again. It would be nice to add a short description for each as well (maybe someday). I have added some others to the list as well.
COLLECTIONS
- Data360: http://www.data360.org/index.aspx
- Datamob.org: http://datamob.org/datasets
- Death Master File: Social Security Death Master File. http://ssdmf.info/download.html
- Enron emails: Carnegie Mellon University. http://www.cs.cmu.edu/~enron/
- Factual: http://www.factual.com/topics/browse
- Freebase: http://www.freebase.com/
- GeoDa Center: Collection of spatial data. https://geodacenter.asu.edu/datalist/
- Google: http://www.google.com/publicdata/directory
- Google Books Corpora: 1.55 billion words. http://googlebooks.byu.edu/#
- Infochimps: http://www.infochimps.com/
- Numbray: http://numbrary.com/
- Rdatasets: Collection of 597 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. http://vincentarelbundock.github.com/Rdatasets/
- Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
- SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
- UFO Reports: http://www.nuforc.org/webreports.html
- Wikileaks: 911 pager intercepts. http://911.wikileaks.org/files/index.html
- Stats4Stem.org: R data sets. http://www.stats4stem.org/data-sets.html
- Washington Post: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
ECONOMICS
- EconData: A Source of Economic Time Series Data from Inforum, at the University of Maryland. http://inforumweb.umd.edu/econdata/econdata.html
- World Bank: http://data.worldbank.org/indicator
EDUCATION
- America’s Top Colleges: Forbes. http://www.forbes.com/top-colleges/
- Minnesota Department of Education: http://education.state.mn.us/ibi_apps/WFServlet?IBIF_ex=mdea_ddl_driver&TOPICID=1&DDL_VARS=5&NoCache=11.16.04
ENTERTAINMENT
- Beverage Tasting Institute: Text-based and quantitative reviews for beer and wine http://www.tastings.com
- Hip-Hop Word Count: The Hip-Hop Word Count is a searchable ethnographic database built from the lyrics of over 40,000 Hip-Hop songs from 1979 to present day. http://staplecrops.com/index.php/hiphop_wordcount/
- Magnatagatune: Ready to use research dataset for MIR tasks such as automatic tagging. It contains human annotations, corresponding sound clips, and a detailed analysis of the track’s structure and musical content, including rhythm, pitch and timbre. http://tagatune.org/Magnatagatune.html
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- MusicBrainz Database: MusicBrainz is a community-maintained open source encyclopedia of music information. http://musicbrainz.org/doc/MusicBrainz_Database
- Sean Lahman’s Baseball Database: complete batting and pitching statistics back to 1871, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. http://www.seanlahman.com/baseball-archive/statistics/
FINANCE
- CBOE Futures Exchange: http://cfe.cboe.com/Data/
- Google Finance: http://finance.yahoo.com/ (R)
- Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
- St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
- NASDAQ: https://data.nasdaq.com/
- OANDA: http://www.oanda.com/ (R)
- Yahoo Finance: http://finance.yahoo.com/ (R)
GOVERNMENT, WORLD-LEVEL
- Archived national government statistics: http://www.archive-it.org/
- Guardian: Data on world governments. http://www.guardian.co.uk/world-government-data
- OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
- United Nations: http://data.un.org/
- World Bank: http://wdronline.worldbank.org/
GOVERNMENT, COUNTRY-LEVEL
- Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
- Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
- DataMarket: http://datamarket.com/
- Fed Stats: Celebrating over 10 years of making statistics from more than 100 agencies available to citizens everywhere. http://www.fedstats.gov/cgi-bin/A2Z.cgi
- New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by-subject.aspx
- United Kingdom: Government data. http://data.gov.uk/data
- United States: Federal Government agencies. http://www.data.gov/metric
- United States: Center for Disease Control (CDC) public health data. http://www.cdc.gov/nchs/data_access/ftp_data.htm
GOVERNMENT, CITY-LEVEL
- London, U.K. data: http://data.london.gov.uk/catalogue
- New York City Data: http://nycplatform.socrata.com/
- San Francisco Data sets: http://datasf.org/
MACHINE LEARNING
- ACM KDD Cup: Annual Data Mining and Knowledge Discovery competition organized by ACM SIGKDD, targeting real-world problems. http://www.sigkdd.org/kddcup/
- Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
- Frequent Itemset Mining Dataset Repository: Click-stream data, retail market basket data, traffic accident data and web html document data. http://fimi.ua.ac.be/data/
- Image Spam Dataset: A collection of ham and spam images taken from real user email. http://www.cs.jhu.edu/~mdredze/datasets/image_spam/ [Mark Dredze]
- Kaggle: Competition data. http://www.kaggle.com/
- KDNuggets: Competition site. http://www.kdnuggets.com/datasets/
- Koblenz Network Collection: http://konect.uni-koblenz.de/
- Machine Learning Data Set Repository: http://mldata.org/
- Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
- UCI Machine Learning Repository: Collection of databases, domain theories, and data generators. http://archive.ics.uci.edu/ml/
SCIENCE
- Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
- Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
- Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
- Geo Spatial Data: http://geodacenter.asu.edu/datalist/
- Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
- MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
- NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
- Protein structure: http://www.infobiotic.net/PSPbenchmarks/
- Public Gene Data: http://www.pubgene.org/
- Stanford Microarray Data: http://smd.stanford.edu//
SOCIAL SCIENCES
- General Social Survey: http://www3.norc.org/GSS+Website/
- Inter-university Consortium for Political and Social Research (ICPSR): http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
- Reality Mining: Collection of machine-sensed environmental data pertaining to human social behavior from the MIT Media Lab. http://reality.media.mit.edu/
- UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
- Upjohn Institute for Employment Research: With the cooperation and assistance of the U.S. Department of Labor (USDOL), the Upjohn Institute is serving as the data repository for many research and evaluation projects sponsored by the USDOL. http://www.upjohn.org/erdc/erdc.html
TIME SERIES
- Time Series Data Library: Collection of about 800 time series drawn from many different fields. http://robjhyndman.com/TSDL/
- University of California, Riverside: Data for time series classification and clustering. http://www.cs.ucr.edu/~eamonn/time_series_data/
UNIVERSITIES
- Carnegie Mellon University: StatLab. http://lib.stat.cmu.edu/datasets/
- Carnegie Mellon University: JASA data archive. http://lib.stat.cmu.edu/jasadata/
- Ohio State University: Financial data. http://fisher.osu.edu/fin/osudata.htm
- University of California, Berkeley: http://ucdata.berkeley.edu/
- University of California, Los Angeles (UCLA): http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
- University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html
USING R TO PULL DATA
- Congressional Ideology: DW-NOMINATE scores from voteview.com. http://is-r.tumblr.com/post/33765462561/the-distribution-of-ideology-in-the-u-s-house-with
- Current Population Survey (CPS): Statistics for the census bureau’s report on income, poverty, and health insurance coverage since 1948. http://usgsd.blogspot.com/2012/10/analyzing-current-population-survey-cps.html
- Huffington Post: Download poll data from the Huffington Post API. http://alandgraf.blogspot.com/2012/11/quick-post-about-getting-and-plotting.html
- 2012 Olympics: 100m Sprint (running) Men Finals. http://lamages.blogspot.com/2012/08/london-olympics-100m-mens-sprint-results.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FrKuKM+%28mages%27+blog%29
- 2012 Olympics: 100m Butterfly (swimming) Men Finals. http://www.actuarially.co.uk/post/28625598393/2012-olympics-swimming-100m-butterfly-men-finals
- Infochimps Geo API: Getting data from the Infochimps Geo API in R. The example used pulls data from the American Community Survey. http://dataexcursions.wordpress.com/2011/09/10/getting-data-from-the-infochimps-geo-api-in-r/
The link for MIT Reality Commons gives a 404. It should be: http://reality.media.mit.edu/