Data Analysis and Statistical Inference starts tomorrow on Coursera

It has been (and still is) lots of work putting this course together, but I’m incredibly excited about the opportunity to teach (and learn from) the masses! Course starts tomorrow (Feb 17, 2014) at noon EST.

coursera_dasi

A huge thanks also goes out to my student collaborators who helped develop, review, and revise much of the course materials (and who will be taking the role of Community TAs on the course discussion forums) and to Duke’s Center for Instructional Technology who pretty much runs the show.

This course is also part of the Reasoning, Data Analysis and Writing Specialization, along with Think Again: How to Reason and Argue and English Composition 1: Achieving Expertise. This interdisciplinary specialization is designed to strengthen students’ ability to engage with others’ ideas and communicate productively with them by analyzing their arguments, identifying the inferences they are drawing, and understanding the reasons that inform their beliefs. After taking all three courses, students complete an in-depth capstone project where they choose a controversial topic and write an article-length essay in which they use their analysis of the data to argue for their own position about that topic.

Let’s get this party started!

Conditional probabilities and kitties

I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.

vodka

And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):

cond

The question asks: “My cat is indoor only. Is it still at risk?”

The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).

However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.

Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.

When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!

My first Shiny experience – CLT applet

When introducing the Central Limit Theorem for the first time in class, I used to use applets like the SOCR Sampling Distribution Applet or the OnlineStatBook Sampling Distribution Applet. If you are reading this post on Google Chrome, chances are those previous links did not work for you. If on another browser, they may have, but you may have also seen warnings like this one:

java_warning

Last year when I tried using one of these applets in class and had students pull it up on their own computers as well, it was a chaos. Between warnings like this and no simple way for everyone in their various computers and operating systems to update Java, most students got frustrated. As a class we had to give up playing with the applet, and the students just watched me go through the demonstrations on the screen.

In an effort to make things a little easier this year, I searched to see if I could find something similar created using Shiny. This one, created by Tarik Gouhier, looked pretty promising. However it wasn’t exactly what I was looking for. For example, it’s pretty safe to assume that my students have never heard of the Cauchy distribution, and I didn’t want to present something that might confuse them further.

Thanks to the code being available on GitHub, I was able to re-write the applet to match the functionality of the previous CLT applets: http://rundel.dyndns.org:3838/CLT.

clt_applet

I’m sure I’ll make some edits to the applet after I class-test it today. Among planned improvements are:

  • an intermediary step between the top (population distribution) and the bottom (sampling distribution) plots: the sample distribution.
  • sliders for input parameters (like mean and standard deviation) for the population distribution.

None of this is revolutionary, but it’s great to be able to build on someone else’s work so quickly. Plus, since all of the code is in R, which the students are learning anyway, those who are particularly motivated can dive deeper and can see the connection between the demonstration and what they’re doing in lab.

If you use such demonstrations in your class and have suggestions for improvements, leave a comment below. If you’d like to customize the applet for your use, the code is linked on the applet page, and I’ll be transitioning it to GitHub as I work on creating a few more of such applets.

(I should also thank Colin Rundel who helped with the implementation and is temporarily hosting the applet on his server until I get my Shiny Server set up — I filled out the registration form last night but I’m not yet sure what the next step is supposed to be.)

JSM 2013 – Days 4 and 4.5

I started off my Wednesday with the “The New Face of Statistics Education (#480)” session. Erin Blackenship from UNL talked about their second course in statistics, a math/stat course where students don’t just learn how to calculate sufficient statistics and unbiased estimators but also learn what the values they’re calculating mean in context of the data. The goal of the course is to bring together the kind of reasoning emphasized in intro stat courses with the mathematical rigor of a traditional math/stat course. Blackenship mentioned that almost 90% of the students taking the class are actuarial science students who need to pass the P exam (the first actuarial exam) therefore the probability theory must be a major component of the course. However UNL has been bridging the gap between these demands and the GAISE guidelines by introducing technology to the course (simulating empirical sampling distributions, checking distributional assumptions, numerical approximation) as well as using writing assessments to improve and evaluate student learning. For example, students are asked to explain in their own words the difference between a sufficient statistic and minimal sufficient statistic, and answers that put things in context instead of regurgitating differences are graded highly. This approach not only allows students who struggle with math to demonstrate understanding, but it also reveals shallow understanding of students who might be testing well in terms of the math by simply going through the mechanics.

In my intro stat class I used to ask similar questions on exams, but have been doing so less and less lately in the interest of time spent on grading (they can be tedious to grade). However lately I’ve been trying to incorporate more activities into the class, and I’m thinking such exercises might be quite appropriate as class activities where students work in teams to perfect their answers and perhaps even teams then grading each others’ answers.

Anyway, back to the session… Another talk in the session given by Chris Malone from Winona State was about modernizing the undergraduate curriculum. Chris made the point that we need much more than just cosmetic changes as he believes the current undergraduate curriculum is disconnected from what graduates are doing when they get their first job. His claim was that the current curriculum is designed for the student who is going on to graduate school in statistics, but that that’s only about a fifth of the students in undergraduate majors. (As an aside, I would have guessed the ratio to be even lower.) He advocated for more computing in the undergrad curriculum, a common thread among many of the education talks at JSM this year, and described a few new programs at Winona and other universities on data science. Another common thread was this discussion of “data science” vs. “statistics”, but I’m not going to go there – at least not in this post. (If you’re interested in this discussion, this Simply Statistician post initiated a good conversation on the topic in the comments section.) I started making a list of Data Science programs I found while searching online but this post seems to have a pretty exhaustive list (original post dates back to 2012 but it seems to be updated regularly).

Other notes from the day:
- R visreg package looks pretty cool, though perhaps not necessarily very useful for an intro stat course where we don’t cover interactions, non-linear regression, etc.
- There is another DataFest like competition going on in the Midwest: MUDAC – maybe we should do a contributed session at JSM next year where organizers share experiences with each other and the audience to solicit more interest in their events or inspire others.

On Thursday I only attended one session: “Teaching the Fundamentals (#699)” (the very last session, mine). You can find my slides for my talk on using R Markdown to teach data analysis in R as well as to instill the importance of reproducible research early on here.

One of the other speakers in my session was Robert Jernigan, who I recognize from this video. He talked about how students confuse “diversity” and “variability” and hence have a difficult time understanding why a dataset like [60,60,60,10,10,10] has a higher standard deviation than a dataset like [10,20,30,40,50,60]. He also mentioned his blog statpics.com, which seems to have some interesting examples of images like the ones in his video on distributions.

John Walker from Cal Poly San Luis Obispo discussed his experiment on how well students can recognize normal and non-normal distributions using normal probability plots — a standard approach for checking conditions for many statistical methods. He showed that faculty do significantly better than students, which I suppose means that you do get better at this with more exposure. However the results aren’t final, and he is considering some changes to his design. I’m eager to see the final results of his experiment, especially if they come with some evidence/suggestions for what the best method to teach this skill is.

JSM 2013 – Day 3

Tuesday was a slightly shorter day for me in terms of talks as I had a couple meetings to attend. The first talk I attended was my colleague Kari Lock Morgan’s talk titled “Teaching PhD Students How to Teach” (in the “Teaching Outside the Box, Ever So Slightly (# 358)” session). The talk was about a class on teaching that she took as a grad student and now teaches at Duke. She actually started off by saying that she thought the title of her talk was misleading, as the talk wasn’t about teaching PhD students a particular way to teach, but instead about getting these students to think about teaching, which, especially in research universities, can take a backseat to research. This course features role playing office hours, video-taped teaching sessions which students then watch and critique themselves and each other, as well as writing and revising teaching statements. If you’re interested in creating a similar course, you can find her materials on her course webpage.

In the afternoon I attended part of the “The ‘Third’ Course in Applied Statistics for Undergraduates  (#414)” session. The first talk titled “Statistics Without the Normal Distribution” by Monnie McGee started off by listing three “lies” and corresponding “truths”:

  • Lie: T-intervals are appropriate for n>30.
  • Truth: It’s time to retire the n>30 rule. (She referenced this paper by Tim Hesterberg.)
  • Lie: Use the t-distribution for small data sets.
  • Truth: Permutation distributions give exact p-values for small data sets.
  • Lie: If a linear regression doesn’t work, try a transformation.
  • Truth: The world is nonlinear and multivariate and dynamic. (I don’t think “try a transformation” should be considered a lie, perhaps a “lie” would be “If a linear regression doesn’t work, a transformation will always work.”)

McGee talked about how they’ve reorganized the curriculum at Southern Methodist University so that statistics students take a class on non-parametrics before their sampling course. This class covers rank and EDF-based procedures such as the Wilcoxon, signed rank, and Mann-Whitney tests as well as resampling methods which are especially useful for estimation of numerous features of a distribution, like the median, independently of the population distribution. The course uses the text by Higgins (Introduction to Modern Nonparametric Statistics) as well as a series of supplements (which I didn’t take notes on, but I’m sure she’d be happy to share the list with you if you’re interested). However she also mentioned that she is looking for an alternative textbook for the course. Pedagogically, the class uses just in time teaching methods — students read the material and complete warm up exercises before class each week, and class time is tailored to concepts that students appear to be struggling with based on their performance on the warm up exercises.

The second talk in the session titled “Nonlinear, Non-Normal, Non-Independent?” was given by Alison Gibbs. Gibbs also described a course that focuses on models for situations when classical regression assumptions aren’t met. She gave examples from a case study on HPV vaccinations that she uses in this class (I believe the data come from this paper). She emphasized the importance of introducing datasets that are interesting, controversial, authentic, and that lend themselves to asking compelling questions. She also mentioned that she doesn’t use a textbook for this class, and finds this liberating. While I can see how not being tied to a textbook would be liberating, I can’t help but think some students might find it difficult to not have a reference — especially those who are struggling in the class. However I presume this issue can be addressed by providing the students with lecture notes and other resources in a very organized fashion. I have to admit that I was hoping that I would hear Gibbs talk about her MOOC at this conference as I am gearing up to teach a similar MOOC next year. Perhaps I should track her down and pick her brain a bit…

At this point I ducked out of this session to see my husband Colin Rundel’s talk in the “Statistical Computing: Software and Graphics (#430)” session. His talk was on a new R package that he is working on (RcppGP) to improve the performance of Gaussian process models using GPU computing. He started with a quote: “If the computing complexity is linear, you’re OK; if quadratic, pray; if cubic, give up.” Looks like he and other people working in this area are not willing to give up quite yet. If you’re interested in his code and slides, you can find them at his GitHub page.

The sessions on my agenda for tomorrow are:

JSM 2013 – Day 2

My Monday at JSM started with the “The Profession of Statistics and Its Impact on the Media (#102)” session. The first speaker in the session, Mark Hansen, was a professor of mine at UCLA, so it was nice to see a familiar face (or more like hear a familiar voice – the room was so jam packed that I couldn’t really “see” him) and catch up on what he has been working on at his new position at Columbia University as a Professor of Journalism and the Director of David and Helen Gurley Brown Institute for Media Innovation. The main theme of the talk was the interaction between journalists and statisticians — he discussed how journalism can provide much needed perspective, language, and practices necessary to describe the forces that data exert in our worlds, to help even statisticians gain fresh perspective on their practice. He pointed out a difference between how journalists and statisticians work with data: journalists work with data to tell a story in the context of a dataset, while statisticians tend to tell a story of the dataset. Hansen also discussed Columbia’s new two-year dual degree Master’s in journalism and computer science. The Brown Institute also awards seed funding to students for developing media technologies that could transform how news content is produced, delivered and consumed. I’ve listed a few of the projects that Hansen discussed below, and a detailed post on these grants can be found here.

  • Dispatch: a mobile application that provides secure, authenticated, anonymous instant publishing.
  • Personalized Television News: a project that seeks to develop and demonstrate a platform for personalized television news to replace the traditional one-broadcast-fits-all model.
  • CityBeat: a project that looks for newsworthy events in the patterns of real-time, geotagged social media feeds.
  • The Declassification Engine: An engine that uses machine learning to declassify documents.
  • Bushwig: Telling the story of a drag renaissance taking place in Bushwick, Brooklyn, that is enlisting and extending social media platforms for the “identity curation” that happens in the drag community. I had no idea that Facebook does not allow, or at least takes down when found out, two profiles for the same person, which, as you can imagine, can be an issue for people who live their lives in two identities.

Hansen also discussed a recent project where he collaborated with the NYTimes’ R&D lab, working on projects such as Project Cascade, which is a tool that constructs “a detailed picture of how information propagates through the social media space”, like Twitter.

The next talk in the session by Don Berry discussed fundamental issues in statistics that are difficult to convey to journalists, and hence the rest of the public, such as Simpson’s paradox, results that are “too good to be true” (e.g. dogs sniffing cancer), regression to the mean, multiple comparisons, etc. He also discussed at length prosecutor’s fallacy, within the context of the case of nurse Lucia de Berk who was convicted in was convicted in 2004 of a number of murders and attempted murders of patients in her care, but then was freed in 2010. I don’t discuss prosecutor’s fallacy in my introductory statistics class, but I’m thinking that I should… Berry recommended this NYTimes article on the topic, as well as this TED talk on prosecutor’s fallacy in general). Berry, who is often quoted in newspaper articles as an expert, also discussed what statisticians (and other scientists) should and should not do when interacting with journalists. Some of the key points were:

  • Simplify, short of lying
  • Be pithy
  • Avoid questions that you don’t want to answer – He mentioned that he avoids questions like “What are the economic implications?”
  • Use going off the record sparingly
  • Prefer email over telephone – so that you can edit your own words
  • Don’t diss anyone

The last one seems obvious, but see this  this Washington Post article on the 2009 breast cancer screening frequency controversy. In the article, a radiology professor from Harvard is quoted saying “Tens of thousands of lives are being saved by mammography screening, and these idiots want to do away with it”. Wow!

The next speaker was Howard Wainer (whose article titled ”The Most Dangerous Equation: Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium” is a good read, by the way). I am excited to take a peek at his recently published book Medical Illuminations at the Expo later today.

The last speaker in the session was Alan Schwarz, the Pulitzer-prize nominated reporter at the NYTimes who did an expose on current and retired football players suffering from post-concussion syndrome and early-onset dementia, more specifically Chronic Traumatic Encephalopathy. A journalist talking about using data and statistics to uncover a story was a nice complement to the earlier talks by statisticians talking about working with journalists.

In the afternoon I attended the “Toward Big Data in Teaching Statistics (#210)” session. Nicholas Chamandy from Google talked about how big data requires novel solutions to old problems and gave examples from some of the algorithms Google uses to solve problems in predictive modeling.

Randall Pruim’s talk focused on the efforts of the Computation and Visualization Consortium that has started working on identifying key skills that students need to work with big data and ways to teach them. It was quite eye opening to hear about a survey he conducted asking faculty members from science departments such as physics and chemistry what kind/size of data their students work with – turns out for many the answer is no data at all! He also gave an overview of efforts at Macalester, Smith, and Calvin Colleges for introducing big data skills into their curriculum. I will be looking into the syllabus for the class being taught at Macalester by Danny Kaplan, as I’m also currently brainstorming how best to teach core computational skills to our students.

Nick Horton also discussed his vision for accomplishing this, which is to start in the first course, to build on it in the second course, to provide more opportunities for students to apply their knowledge in practice (internships, collaborative research, teaching assistants), and to introduce new courses focused on data science into the curriculum. He also discussed exposing students to reproducible research using RStudio and R Markdown. I’ve previously written a blog post about this, and will be talking about it on Thursday at my own talk as well. It was nice to see a similar approach being used by others in the statistics education field. What especially resonated with me was Nick’s comment on how using R Markdown facilitates appropriate and correct statistical workflow for students.

The last talk of the day I attended was Nate Silver’s President’s Invited Address, along with just about everyone else attending JSM. The turnout was great, and his talk was highly enjoyable, as expected. Gregory Matthews (Stats in the Wild) already posted a list of his talking points, so instead of listing them here again, I’ll just link to that post. The Q&A was just as interesting as the talk itself, below are a few notes I jotted down:

  • Q: What can statisticians learn from journalists?
  • A: Clarity of expression – results are only useful when you can explain them.
  • Q: How can ASA and statisticians do more on advocacy?
  • A: Blog! Researchers should do their own communication.
  • Q: Any career advice for young statisticians?
  • A: Do something practical and applied first, theory is easier to learn as needed.
  • Q: Favorite journalist/writer?
  • A: Bill James
  • Q: Data scientist vs. statistician?
  • A: Call yourself whatever you want, just do good work. (One of the better answers I’ve heard on this topic. Though his earlier answer “data science is just a sexed up term for statistics” seemed to resonate well with some in the room and not so much with others.)
  • Q: What is the future of sports statistics?
  • A: More data being collected on soccer, so there is more to be done there. (This means that finally there may be sports statistics that I actually care about and can get excited by!)

After the talks I stopped by the UCLA mixer, it was nice to see some old faces. And I finished up the evening at the Duke dinner, with great company and lots of wine…

Now on to Day 3…

JSM 2013 – Day 1

Bonjour de Montréal!

I’m at JSM 2013, and thought it might be nice to give a brief summary of highlights of each day. Given the size of the event, any session that I attend means I’m missing at least ten others. So this is in no way an exhaustive overview of the day at the conference, more tidbits from my day here. I’ll make a public commitment to post daily throughout the conference, hoping that the guilt of not living up to my promise helps me not lose steam after a couple days.

The first session I attended today had a not so exciting title — “Various Topics in Statistics Education (#43)” — but turned out to be quite the opposite. The first three talks of the session were about the case of Diederik Stapel – a former professor of social psychology in the Netherlands who was suspended from Tilburg University for research fraud. Stapel published widely publicized studies, some of which included results that purport to show that a trash-filled environment tended to bring out racist tendencies in individuals or that eating meat made people selfish and less social. As the speakers at the session (Ruud Koning, Marijtje van Duijn, and Wendy Post from the University of Groningen, and Don van Ravenzwaaij from the University of New South Wales) put it today, the data and the results were “too good to be true”.

First, Koning gave an overview of the case – unfortunately I walked in a little late. If you’re not familiar with it, I would recommend this NYTimes article as well as this paper by Pieter Drenth.

Next, van Duijn discussed best practices for reviewers so that fraud can be caught early on. For example, some indicators of mistakes in Stapel’s papers were impossible means and effect sizes (compared to previous literature), impossible combinations of sample size and degrees of freedom, and incorrect p-values. These could, and should, have been caught by reviewers but this is easier said than done. van Dujin and van Ravenzwaaij suggest that journals should encourage sharing data and reproducibility (a view shared by many in the statistics community). However the responsibility of ensuring thorough reviews should also be shared by universities, science foundations, and policy makers. For example, an interesting suggestion was universities rewarding good peer reviews, as well as good data collection, archiving, and sharing.

The last talk in the series given by Post focused on what to do in education to prevent fraud. Two points that resonated with me were the need for teaching data management, as early as possible in the curriculum, and focusing on descriptive statistics before p-values. Post also advocates for putting emphasis on teaching philosophy of science.

Not only was this discussion very informative and interesting to listen to, it also provided me with a good case study to incorporate into my Statistical Consulting course which has a research ethics component. In the past year we’ve discussed the Potti case, so this will be a nice addition from a different field (and a different university!).

The other session that I attended today was the “Introductory Overview Lecture: Celebrating the History of Statistics (#47)” by Xiao-Li Meng, Alan Agresti, and Stephen Stigler. If you are interested in history of statistics departments, Agresti and Meng’s book Strength in Numbers: The Rising of Academic Statistics Departments in the US sounds like a promising read. The session wrapped up with Stigler’s history of statistics review, titled “How Statistics Saved the Human Race”. He was a delight to listen to as usual. I hope that in the future such sessions are recorded and posted online for all to see, as they should be of interest to a wide audience of statisticians and non-statisticians alike. I don’t think the ASA does this yet, but correct me if I’m wrong.

Three other sessions that I would like to have attended today were

  • “Teaching Ethics in Statistics and Biostatistics: What Works, What Doesn’t Work, and Lessons Learned (#55)”,
  • “The Interplay Between Consulting and Teaching (#68)”, and
  • “Teaching Online on a Budget (#75)”.

If you’ve been to any of these, and have notes to share, please comment below!

On a separate note, unrelated to JSM –

  • If you’re here in Montréal, and especially if you live in a city without good bagels (Durham, I love you, but you don’t deliver on this account), I strongly recommend a trip up to Fairmount Bagel. They’re open 24 hours, and the bagels are great, but the matzoh bread is to die for. Also, apparently in Quebec “everything” bagels are called “all dressed”.
  • It turns out that not everything is good with maple syrup. I strongly advise against trying the Lay’s Maple Moose chips. Trust me on this one.

lays_maple

Datasets handpicked by students

I’m often on the hunt for datasets that will not only work well with the material we’re covering in class, but will (hopefully) pique students’ interest. One sure choice is to use data collected from the students, as it is easy to engage them with data about themselves. However I think it is also important to open their eyes to the vast amount of data collected and made available to the public. It’s always a guessing game whether a particular dataset will actually be interesting to students, so learning from the datasets they choose to work with seems like a good idea.

Below are a few datasets that I haven’t seen in previous project assignments. I’ve included the research question the students chose to pursue, but most of these datasets have multiple variables, so you might come up with different questions.

1. Religious service attendance and moral beliefs about contraceptive use: The data are from a February 2012 Pew Research poll. To download the dataset, go to http://www.people-press.org/category/datasets/?download=20039620. You will be prompted to fill out some information and will receive a zipped folder including the questionnaire, methodology, the “topline” (distributions of some of the responses), as well as the raw data in SPSS format (.sav file). Below I’ve provided some code to load this dataset in R, and then to clean it up a bit. Most of the code should apply to any dataset released by Pew Research.

# read data
library(foreign)
d_raw = as.data.frame(read.spss("Feb12 political public.sav"))

# clean up
library(stringr)
d = lapply(d_raw, function(x) str_replace(x, " \\[OR\\]", ""))
d = lapply(d, function(x) str_replace(x, "\\[VOL. DO NOT READ\\] ", ""))
d = lapply(d, function(x) str_replace(x, "\222", "'"))
d = lapply(d, function(x) str_replace(x, " \\(VOL.\\)", ""))
d$partysum = factor(d$partysum)
levels(d$partysum) = c("Refused","Democrat","Independent","Republican","No preference","Other party")

The student who found this dataset was interested examining the relationship between religious service attendance and views on contraceptive use. The code provided below can be used to organize the levels of these variables in a meaningful way, and to take a quick peek at a contingency table.

# variables of interest
d$attend = factor(d$attend, levels = c("More than once a week","Once a week", "Once or twice a month", "A few times a year", "Seldom", "Never", "Don't know/Refused"))
d$q40a = factor(d$q40a, levels = c("Morally acceptable","Morally wrong", "Not a moral issue", "Depends on situation", "Don't know/Refused"))
table(d$attend, d$q40a)

2. Social network use and reading: Another student was interested in the relationship between number of books read in the last year and social network use. This dataset is provided by the Pew Internet and American Life Project. You can download a .csv version of the data file at http://www.pewinternet.org/Shared-Content/Data-Sets/2012/February-2012–Search-Social-Networking-Sites-and-Politics.aspx. The questionnaire can also be found at this website. One of the variables of interest, number of books read in the past 12 months (q2), is  recorded using the following scheme:

  • 0: none
  • 1-96: exact number
  • 97: 97 or more
  • 98: don’t know
  • 99: refused

This could be used to motivate a discussion about the importance doing exploratory data analysis prior to jumping into running inferential tests (like asking “Why are there no people who read more than 99 books?”) and also pointing out the importance of checking the codebook.

3. Parental involvement and disciplinary actions at schools: The 2007-2008 School Survey on Crime and Safety, conducted by the National Center for Education Statistics, contains school level data on crime and safety. The dataset can be downloaded at http://nces.ed.gov/surveys/ssocs/data_products.asp.  The SPSS formatted version of the data file (.sav) can be loaded in R using the read.spss() function in the foreign library (used above in the first data example). The variables of interest for the particular research question the student proposed are parent involvement in school programs (C0204) and number of disciplinary actions (DISTOT08), but the dataset can be used to explore other interesting characteristics of schools, like type of security guards, whether guards are armed with firearms, etc.

4. Dieting in school-aged children: The Health Behavior in School-Aged Children is an international survey on health-risk behaviors of children in grades 6 through 10. The 2005-2006 US dataset can be found at http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/28241. You will need to log in to download the dataset, but you can do so using a Google or a Facebook account. There are multiple versions of the dataset posted, and the Delimited version (.tsv) can be easily loaded in R using the read.delim() function. The student who found this dataset was interested in exploring the relationship between race of the student (Q6_COMP) and whether or not the student is on a diet to lose weight (Q30). The survey also asks questions on body image, substance use, bullying, etc. that may be interesting to explore.

One common feature of the above datasets is that they are all observational/survey based as it’s more challenging to find experimental (raw) datasets online. Any suggestions?

Introducing Statistics: A Graphic Guide

Source: introducingbooks.com

Source: introducingbooks.com

Over the winter break I was travelling in the UK and I came across this little book called “Introducing Statistics: A Graphic Guide” by Ellen Magnello and Borin Van Loon at the gift shop in the Tate Modern museum in London. The book is published in 2009, and Significance magazine already reviewed it here, so I won’t repeat their comments. I hadn’t heard about the book before, so I picked it up, along with a copy of Introducing Post-Modernism (they were 2 for £10, I had to get two, obviously).

I think the book would be more appropriately named “an illustrated guide”, since the images are mostly illustrations of statisticians with speech bubbles instead of graphics that help visualize the concepts being discussed. The most unexpected are the images of the author herself. The first time I came across one of those I was thinking “who is this lady in the pant-suit standing next to Karl Pearson?”. Needless to say, the illustrations sometimes distract from the text, but they’re fun and nicely drawn.

The book does a very good job of describing the differences between vital statistics and mathematical statistics, and what the terms “statistic” and “variability” mean. Therefore, while the audience of the book is not clear, it could be a perfect gift for parents of statisticians who still don’t quite understand what their offspring do. Or really anyone who is interested in statistics, but has no real formal experience with it.

While the book tells the early history of statistics well, the introduction of statistical concepts follow a strange order. It is useful for gaining familiarity with some terminology and simple statistical distributions and tests, but it would be quite difficult to acquire a thorough understanding of these concepts from the book’s introduction. However, I’m guessing this is not the intent of the book, anyway.

The book is part of a series called Introducing Books, which contain about 80 graphical guides from Introducing Aesthetics to Marxism to Wittgenstein. The museum shop where I got the book carried only about 10 of these titles, and I was happy to see that Introducing Statistics was one of them.

Inference for the population by the population — what does that even mean?

In an effort to integrate more hands on data analysis in my introductory statistics class, I’ve been assigning students a project early on in the class where they answer a research question of interest to them using a hypothesis test and/or confidence interval. One goal of this project is getting the students to decide which methods to use in which situations, and how to properly apply them. But there’s more to it — students  define their own research question and find an appropriate dataset to answer that question with. The analysis and findings are then presented in a cohesive research paper.

Settling on a research question that can be answered using limited methods (one or two mean or proportion testing, ANOVA, or chi-square) is the first half of the battle. Some of the research questions students come up with require methods much more involved than simple hypothesis testing or parameter estimation. These students end up having to dial back and narrow down the focus of the research topic to meet the assignment guidelines. I think that this is a useful exercise as it helps them evaluate what they have and have not learned.

The next step is finding data, and this can be quite time consuming. Some students choose research questions about the student body and collect data via in-person surveys at the student center or Facebook polls. A few students even go so far as to conduct experiments on their friends. A huge majority look for data online, which initially appears to be the path of least resistance. However finding raw data that is suitable for statistical inference, i.e. data from a random sample, is not a trivial task.

I (purposefully) do not give much guidance on where to look for data. In the past, even casually mentioning one source has resulted in more than half the class using that source, therefore I find it best to give them free reign during this exploration stage (unless someone is really struggling).

Some students use data from national surveys like the BRFSS or the GSS. The data come from a (reasonably) representative sample, and are a perfect candidate for applying statistical inference methods. One problem with such data is that they rarely come in plain text format (SAS, SPSS, etc.), and importing such data into R can be a challenge for novice R users, even with step-by-step instructions.

On the other hand, many students stumble upon the resources like World Bank Database, OECD, the US Census, etc., where data are presented in much more user friendly formats. The drawback is that these are essentially population data, e.g. country indicators like human development index for all countries, and there is really no need for hypothesis testing or parameter estimation when the parameter is already known. To complicate matters further, some of the tables presented are not really “raw data” but instead summary tables, e.g. median household income for all states calculated based on random sample data from each state.

One obvious way to avoid this problem is to make the assignment stricter by requiring that chosen data must come from a (reasonably) random sample. However, this stricter rule would give students much less freedom in the research question they can investigate, and the projects tend to be much more engaging and informative when students write about something they genuinely care about.

Limiting data sources also have the effect of increasing the time spent finding data, and hence decreasing the time students spend actually analyzing the data and writing up results. Providing a list of resources for curated datasets (e.g. DASL) would certainly diminish time spent looking for data, but I would argue that the learning that happens during the data exploration process is just as valuable (if not more) than being able to conduct a hypothesis test.

Another approach (one that I have been taking) is allowing the use population data but requiring a discussion of why it is actually not necessary to do statistical inference in these circumstances. This approach lets the students pursue their interests, but interpretations of p-values and confidence intervals calculated based on data from the entire population can get quite confusing. In addition, it has the side-effect of sending the message “it’s ok if you don’t meet the conditions, just say so, and carry on.” I don’t think this is the message we want students to walk away with from an introductory statistics course. Instead, we should be insisting that they don’t just blindly carry on with the analysis if conditions aren’t met. The “blindly” part is (somewhat) adressed by the required discussion, but the “carry on with the analysis” part is still there.

So is this assignment a disservice to students because it might leave some with the wrong impression? Or is it still a valuable experience regardless of the caveats?