Yup. You read it here first. The National Climatic Data Center has a nice overview that compares climate data to Punxsutawney Phil’s predictions. The data aren’t quite (yet) upload-ready, but some links on the page to more raw data might be entertaining.
Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).
Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.
Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.
Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.
As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.
For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.
The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at firstname.lastname@example.org. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.
Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.
During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.
Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)
It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3″ browsers have a strong hold on the market.
The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.
The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.
So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.
*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.
**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.
# Input the ad data ad = data.frame( type = c("Poster", "Billboard", "Bus", "Digital"), n = c(529, 356, 59, 81) ) # Bar plot library(ggplot2) ggplot(data = ad, aes(x = type, y = n, fill = type)) + geom_bar(stat = "identity", show_guide = FALSE) + theme_bw() # Add addition columns to data, needed for donut plot. ad$fraction = ad$n / sum(ad$n) ad$ymax = cumsum(ad$fraction) ad$ymin = c(0, head(ad$ymax, n = -1)) # Donut plot ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) + geom_rect(colour = "grey30", show_guide = FALSE) + coord_polar(theta = "y") + xlim(c(0, 4)) + theme_bw() + theme(panel.grid=element_blank()) + theme(axis.text=element_blank()) + theme(axis.ticks=element_blank()) + geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) + xlab("") + ylab("")
You can read about DataFest, which is quickly going national, at the fivethiryeight blog:
I was at the vet yesterday, and just like with any doctor’s visit experience, there was a bit of waiting around — time for re-reading all the posters in the room.
And this is what caught my eye on the information sheet about feline heartworm (I’ll spare you the images):
The question asks: “My cat is indoor only. Is it still at risk?”
The way I read it, this question is asking about the risk of an indoor only cat being heartworm positive. To answer this question we would want to know P(heartworm positive | indoor only).
However the answer says: “A recent study found that 27% of heartworm positive cats were identified as exclusively indoor by their owners”, which is P(indoor only | heartworm positive) = 0.27.
Sure, this gives us some information, but it doesn’t actually answer the original question. The original question is asking about the reverse of this conditional probability.
When we talk about Bayes’ theorem in my class and work through examples about sensitivity and specificity of medical tests, I always tell my students that doctors are actually pretty bad at these, looks like I’ll need to add vets to my list too!
Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for presenting findings, and this means there’s much lively, constructive discussion about works in progress.
I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2. Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.
What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.
Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.) This technology imposes a sort of discipline on students’ thinking. Using the technology is easy enough that they can be creative, but the technology is rigid enough that their mistakes are made apparent. This means that mistakes are cheap, and attempts to repair mistakes are easily made. And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.
I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker. But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas. But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise. It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability. More impressive still, the students were able to use these models to solve problems. In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.
Tinkerplots is aimed at a younger audience than the one I teach. But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.
DataFest is growing larger and larger. This year, we hosted an event at Duke (Mine organized this) with teams from NCSU and UNC, and at UCLA (Rob organized) with teams from Pomona College, Cal State Long Beach, University of Southern California, and UC Riverside. We are very grateful to Vaclav Petricek at eHarmony for providing us with the data, which consisted of roughly one million “user-candidate” pairs, and a couple of hundred variables including “words friends would use to describe you”, ideal characteristics in a partner, the importance of those characteristics, and the all-important ‘did she email him’ and ‘did he email her’ variables.
The students had a great time, and worked hard for 48 hours to prepare short presentations for the judges. This is the third year we’ve done this, and I’m growing impressed with the growing technical skills of the students. (Which makes our life a lot easier, as far as providing help goes.) Or maybe it’s just that I’ve been lucky enough to get more and more “VIP Consultants” (statisticians from off-campus) and talented and dedicated grad students to help out, so that I can be comfortably oblivious to the technical struggles. Or all of the above.
One thing I noticed that will definitely require some adjustment to our curriculum: Our students had a hard time generating interesting questions from these data. Part of the challenge is to look at a large, rich dataset and think “What can I show the world that the world would like to know?” Too many students went directly to model-fitting, without making visuals or engaging in the content of the materials (a surprise, since we thought they would find this material much more easily-engageable than last year’s micro-lending transaction data), or strategizing around some Big Questions. They managed to pull it off in the end, most of them, but would have done better to brainstorm some good questions to follow, and would have done much better to start with the visuals.
One of the fun parts of DataFest is the presentations. Students have only 5 minutes and 2 slides to convince the judges of their worthiness. At UCLA, because we were concerned about having too many teams for the judges to endure, we had two rounds. First, a “speed dating” round in which participants had only 60 seconds and one slide. We surprised them by announcing, at the start, that to move onto the next round, they would have to merge their team with one other team, and so these 60-second presentations should be viewed as pitches to potential partners. We had hoped that teams would match on similar-themes or something, and this did happen; but many matches were between teams of friends. The “super teams” were then allowed to make a 5-minute presentation, and awards were given to these large teams. The judges gave two awards for Best Insight (one to a super-team from Pomona College and another to a super-team from UCLA) and a Best Visualization (to the super-team from USC). We did have two inter-collegiate super-teams (UCLA/Cal State Long Beach and UCLA/UCR) make it to the final round.
If you want to host your own DataFest, drop a line to Mine or me and we can give you lots of advice. And if you sit on a large, interesting data set we can use for next year, definitely drop us a line!
I’m often on the hunt for datasets that will not only work well with the material we’re covering in class, but will (hopefully) pique students’ interest. One sure choice is to use data collected from the students, as it is easy to engage them with data about themselves. However I think it is also important to open their eyes to the vast amount of data collected and made available to the public. It’s always a guessing game whether a particular dataset will actually be interesting to students, so learning from the datasets they choose to work with seems like a good idea.
Below are a few datasets that I haven’t seen in previous project assignments. I’ve included the research question the students chose to pursue, but most of these datasets have multiple variables, so you might come up with different questions.
1. Religious service attendance and moral beliefs about contraceptive use: The data are from a February 2012 Pew Research poll. To download the dataset, go to http://www.people-press.org/category/datasets/?download=20039620. You will be prompted to fill out some information and will receive a zipped folder including the questionnaire, methodology, the “topline” (distributions of some of the responses), as well as the raw data in SPSS format (.sav file). Below I’ve provided some code to load this dataset in R, and then to clean it up a bit. Most of the code should apply to any dataset released by Pew Research.
# read data library(foreign) d_raw = as.data.frame(read.spss("Feb12 political public.sav")) # clean up library(stringr) d = lapply(d_raw, function(x) str_replace(x, " \\[OR\\]", "")) d = lapply(d, function(x) str_replace(x, "\\[VOL. DO NOT READ\\] ", "")) d = lapply(d, function(x) str_replace(x, "\222", "'")) d = lapply(d, function(x) str_replace(x, " \\(VOL.\\)", "")) d$partysum = factor(d$partysum) levels(d$partysum) = c("Refused","Democrat","Independent","Republican","No preference","Other party")
The student who found this dataset was interested examining the relationship between religious service attendance and views on contraceptive use. The code provided below can be used to organize the levels of these variables in a meaningful way, and to take a quick peek at a contingency table.
# variables of interest d$attend = factor(d$attend, levels = c("More than once a week","Once a week", "Once or twice a month", "A few times a year", "Seldom", "Never", "Don't know/Refused")) d$q40a = factor(d$q40a, levels = c("Morally acceptable","Morally wrong", "Not a moral issue", "Depends on situation", "Don't know/Refused")) table(d$attend, d$q40a)
2. Social network use and reading: Another student was interested in the relationship between number of books read in the last year and social network use. This dataset is provided by the Pew Internet and American Life Project. You can download a .csv version of the data file at http://www.pewinternet.org/Shared-Content/Data-Sets/2012/February-2012–Search-Social-Networking-Sites-and-Politics.aspx. The questionnaire can also be found at this website. One of the variables of interest, number of books read in the past 12 months (q2), is recorded using the following scheme:
- 0: none
- 1-96: exact number
- 97: 97 or more
- 98: don’t know
- 99: refused
This could be used to motivate a discussion about the importance doing exploratory data analysis prior to jumping into running inferential tests (like asking “Why are there no people who read more than 99 books?”) and also pointing out the importance of checking the codebook.
3. Parental involvement and disciplinary actions at schools: The 2007-2008 School Survey on Crime and Safety, conducted by the National Center for Education Statistics, contains school level data on crime and safety. The dataset can be downloaded at http://nces.ed.gov/surveys/ssocs/data_products.asp. The SPSS formatted version of the data file (.sav) can be loaded in R using the read.spss() function in the foreign library (used above in the first data example). The variables of interest for the particular research question the student proposed are parent involvement in school programs (C0204) and number of disciplinary actions (DISTOT08), but the dataset can be used to explore other interesting characteristics of schools, like type of security guards, whether guards are armed with firearms, etc.
4. Dieting in school-aged children: The Health Behavior in School-Aged Children is an international survey on health-risk behaviors of children in grades 6 through 10. The 2005-2006 US dataset can be found at http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/28241. You will need to log in to download the dataset, but you can do so using a Google or a Facebook account. There are multiple versions of the dataset posted, and the Delimited version (.tsv) can be easily loaded in R using the read.delim() function. The student who found this dataset was interested in exploring the relationship between race of the student (Q6_COMP) and whether or not the student is on a diet to lose weight (Q30). The survey also asks questions on body image, substance use, bullying, etc. that may be interesting to explore.
One common feature of the above datasets is that they are all observational/survey based as it’s more challenging to find experimental (raw) datasets online. Any suggestions?
The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data. What is participatory sensing, you ask?
I’ve recently been trying to answer this question, with mixed success. As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it. I like to think of it as observations on a living object. Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize, PS meant specially equipped smart-phones to serve as sensors. Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food. A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment. And there’s an equity issue, because what some people consider a common household item, others consider rare and precious. And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.
If we ditch the gadgetry, any human being can serve as a sensor. Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be a sensor. If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”. She’ll turn to data next time she has a question or problem, too.
Nothing in this process is trivial. Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data. Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success. And that’s the easy part. How do we help them learn something of value about themselves or their world?
I’m open to suggestions here. Please. One step seems to be to point them towards a larger context in which to make sense of their data. This larger context could be a social network, or a community, or larger datasets collected on large populations. And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream). Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.
This is scary stuff. Traditionally, we teach students how to make sense of *our* data. And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right” conclusions. But these PS data have not before been analyzed. Even if we the teacher may have seen similar data, we have not seen these data. The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion. What’s more disorienting, the teacher doesn’t have control of the method. Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see. But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)
So I’ll sign off with some questions. Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data. Students do not know any statistics–this is their first data experience. What is the first thing we should show them? A distribution? Of what? What concepts do students bring to the table that will help them make sense of longitudinal data? If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)? And what’s the lesson, or should be the lesson, in examining such plots?
My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened. I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’ (We had talked a bit in class about what that meant, and about what devices were storing data.) They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.
The results were interesting. The vast majority “got” it. The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”
But those were very few (maybe 2 or 3). The rest were quite thoughtful. The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments). Many of the students were to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure. And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future. They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.
Here’s my question for you: what’s the next step? Where do we go from here to build on this lesson? And to what purpose?