After my Pinterest post, I got a little bit hooked, mostly because I realized that it was a visual way for me to see my bookmarks. This makes it easier for me to find the information I am looking for quickly. One problem is that it requires an image, so I quickly realized that the links for data sets wouldn’t work so well on Pinterest.
Then I remembered that I have used my personal blog as an organized reminder list (see this post where I remind myself how to re-set features on my computer after disaster), and thought I could do the same here, but with data sets that others could also use. So, inspired by the post Rob linked to some time ago (Finding Data) I thought I would start putting together a more comprehensive list as I have time. This way, when I come across new data sets, I can just add them in.
Below, I have taken the data sets listed by RevoJoe from his post for Inside-R, and reorganized them a little. Over time, they could be re-organized again, and again, and again. It would be nice to add a short description for each as well (maybe someday). I have added some others to the list as well.
COLLECTIONS
-
Data360: http://www.data360.org/index.aspx
-
Datamob.org: http://datamob.org/datasets
-
Death Master File: __Social Security Death Master File. http://ssdmf.info/download.html
-
_Enron emails: _Carnegie Mellon University. http://www.cs.cmu.edu/~enron/
-
_Factual: _http://www.factual.com/topics/browse
-
Freebase: http://www.freebase.com/
-
_**GeoDa Center: **_Collection of spatial data. https://geodacenter.asu.edu/datalist/
-
_**Google Books Corpora: **_1.55 billion words. http://googlebooks.byu.edu/#
-
Infochimps: http://www.infochimps.com/
-
_Numbray: _http://numbrary.com/
-
Rdatasets: Collection of 597 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. http://vincentarelbundock.github.com/Rdatasets/
-
_Sample R data sets: _http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
-
_SourceForge Research Data: _http://www.nd.edu/~oss/Data/data.html
-
_UFO Reports: _http://www.nuforc.org/webreports.html
-
_**Wikileaks: **911 pager intercepts. _http://911.wikileaks.org/files/index.html
-
Stats4Stem.org: R data sets. http://www.stats4stem.org/data-sets.html
-
_Washington Post: _http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
ECONOMICS
-
EconData: A Source of Economic Time Series Data from Inforum, at the University of Maryland. http://inforumweb.umd.edu/econdata/econdata.html
-
World Bank: http://data.worldbank.org/indicator
EDUCATION
-
_**America’s Top Colleges: **_Forbes. http://www.forbes.com/top-colleges/
-
**Minnesota Department of Education: **http://education.state.mn.us/ibi_apps/WFServlet?IBIF_ex=mdea_ddl_driver&TOPICID=1&DDL_VARS=5&NoCache=11.16.04
ENTERTAINMENT
-
_**Beverage Tasting Institute: **_Text-based and quantitative reviews for beer and wine http://www.tastings.com
-
_**Hip-Hop Word Count: **_The Hip-Hop Word Count is a searchable ethnographic database built from the lyrics of over 40,000 Hip-Hop songs from 1979 to present day. http://staplecrops.com/index.php/hiphop_wordcount/
-
_**Magnatagatune: **_Ready to use research dataset for MIR tasks such as automatic tagging. It contains human annotations, corresponding sound clips, and a detailed analysis of the track’s structure and musical content, including rhythm, pitch and timbre. http://tagatune.org/Magnatagatune.html
-
_Million Song Dataset: _http://blog.echonest.com/post/3639160982/million-song-dataset
-
_**MusicBrainz Database: **_MusicBrainz is a community-maintained open source encyclopedia of music information. http://musicbrainz.org/doc/MusicBrainz_Database
-
Sean Lahman’s Baseball Database: complete batting and pitching statistics back to 1871, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. http://www.seanlahman.com/baseball-archive/statistics/
FINANCE
-
CBOE Futures Exchange: http://cfe.cboe.com/Data/
-
Google Finance: http://finance.yahoo.com/ (R)
-
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
-
St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
-
NASDAQ: https://data.nasdaq.com/
-
OANDA: http://www.oanda.com/ (R)
-
Yahoo Finance: http://finance.yahoo.com/ (R)
GOVERNMENT, WORLD-LEVEL
-
Archived national government statistics: http://www.archive-it.org/
-
_**Guardian: **_Data on world governments. http://www.guardian.co.uk/world-government-data
-
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
-
United Nations: http://data.un.org/
-
World Bank: http://wdronline.worldbank.org/
GOVERNMENT, COUNTRY-LEVEL
-
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
-
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
-
DataMarket: http://datamarket.com/
-
_**Fed Stats: **_Celebrating over 10 years of making statistics from more than 100 agencies available to citizens everywhere. http://www.fedstats.gov/cgi-bin/A2Z.cgi
-
New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by-subject.aspx
-
_**United Kingdom: **_Government data. http://data.gov.uk/data
-
_**United States: **_Federal Government agencies. http://www.data.gov/metric
-
_**United States: **Center for Disease Control (CDC) public health data. _http://www.cdc.gov/nchs/data_access/ftp_data.htm
GOVERNMENT, CITY-LEVEL
-
**London, U.K. data:_ http://data.london.gov.uk/catalogue**_
-
New York City Data: http://nycplatform.socrata.com/
-
San Francisco Data sets: http://datasf.org/
MACHINE LEARNING
-
_**ACM KDD Cup: **_Annual Data Mining and Knowledge Discovery competition organized by ACM SIGKDD, targeting real-world problems. http://www.sigkdd.org/kddcup/
-
Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
-
_**Frequent Itemset Mining Dataset Repository: **_Click-stream data, retail market basket data, traffic accident data and web html document data. http://fimi.ua.ac.be/data/
-
Image Spam Dataset: A collection of ham and spam images taken from real user email. http://www.cs.jhu.edu/~mdredze/datasets/image_spam/ [Mark Dredze]
-
_**Kaggle: **_Competition data. http://www.kaggle.com/
-
_**KDNuggets: **_Competition site. http://www.kdnuggets.com/datasets/
-
Koblenz Network Collection: http://konect.uni-koblenz.de/
-
_Machine Learning Data Set Repository: _http://mldata.org/
-
Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
-
_Million Song Dataset: _http://blog.echonest.com/post/3639160982/million-song-dataset
-
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
-
_**UCI Machine Learning Repository: **_Collection of databases, domain theories, and data generators. http://archive.ics.uci.edu/ml/
SCIENCE
-
_Agricultural Experiments: _http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
-
_Climate data: _http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
-
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
-
_Geo Spatial Data: _http://geodacenter.asu.edu/datalist/
-
_Human Microbiome Project: _http://www.hmpdacc.org/reference_genomes/reference_genomes.php
-
_MIT Cancer Genomics Data: _http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
-
_NASA: _http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
-
_NIH Microarray data: _ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
-
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
-
Public Gene Data: http://www.pubgene.org/
-
Stanford Microarray Data: http://smd.stanford.edu//
SOCIAL SCIENCES
-
_General Social Survey: _http://www3.norc.org/GSS+Website/
-
Inter-university Consortium for Political and Social Research (ICPSR): http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
-
_**Reality Mining: **_Collection of machine-sensed environmental data pertaining to human social behavior from the MIT Media Lab. http://reality.media.mit.edu/
-
_UCLA Social Sciences Archive: _http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
-
Upjohn Institute for Employment Research:** **With the cooperation and assistance of the U.S. Department of Labor (USDOL), the Upjohn Institute is serving as the data repository for many research and evaluation projects sponsored by the USDOL._ _http://www.upjohn.org/erdc/erdc.html
TIME SERIES
-
_**Time Series Data Library: **_Collection of about 800 time series drawn from many different fields. http://robjhyndman.com/TSDL/
-
****University of California, _Riverside:_ Data for time series classification and clustering. http://www.cs.ucr.edu/~eamonn/time_series_data/
UNIVERSITIES
-
Carnegie Mellon University: StatLab. http://lib.stat.cmu.edu/datasets/
-
Carnegie Mellon University: JASA data archive. http://lib.stat.cmu.edu/jasadata/
-
Ohio State University: Financial data. http://fisher.osu.edu/fin/osudata.htm
-
_University of California, Berkeley: _http://ucdata.berkeley.edu/
-
****University of California, _Los Angeles (UCLA): _http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
-
_University of Toronto: _http://www.cs.toronto.edu/~delve/data/datasets.html
USING R TO PULL DATA
-
**Congressional Ideology: DW-NOMINATE scores from voteview.com. **http://is-r.tumblr.com/post/33765462561/the-distribution-of-ideology-in-the-u-s-house-with
-
Current Population Survey (CPS): Statistics for the census bureau’s report on income, poverty, and health insurance coverage since 1948. http://usgsd.blogspot.com/2012/10/analyzing-current-population-survey-cps.html
-
Huffington Post: Download poll data from the Huffington Post API. http://alandgraf.blogspot.com/2012/11/quick-post-about-getting-and-plotting.html
-
****2012 Olympics: __100m Sprint (running) Men Finals. http://lamages.blogspot.com/2012/08/london-olympics-100m-mens-sprint-results.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FrKuKM+%28mages%27+blog%29
-
_**2012 Olympics: **_100m Butterfly (swimming) Men Finals. http://www.actuarially.co.uk/post/28625598393/2012-olympics-swimming-100m-butterfly-men-finals
-
_**Infochimps Geo API: **_Getting data from the Infochimps Geo API in R. The example used pulls data from the American Community Survey. http://dataexcursions.wordpress.com/2011/09/10/getting-data-from-the-infochimps-geo-api-in-r/