Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Is Data Science Real?

Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas.  There were many thought-provoking talks, but what was even better were the thought-provoking conversations.  One theme, at least for me, is just what is this thing called Data Science?  One esteemed colleague suggested it was simply a re-branding.  Other speakers used it somewhat perjoratively, in reference  to outsiders (i.e. computer scientists).   Here are some answers from panelists at a discussion on the future of technology in statistics education.  All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.

Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data.  I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life.  But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.

Nick Horton offered that data mining was an activity that could be considered part of data science.  And he sees data mining as part of statistics.  Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered.  But he also offered a broader definition:  using data to answer a statistical question.   Which I quite like.  It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.

Bill Finzer relied on set theory:  data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data.  I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course.  It doesn’t contradict Nick’s definition, but is a little more precise.  As always, Bill has a knack for phrasing things just right without any practice.

Deb Nolan answered last, and I think I liked her answer the best.  Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle.  (I like to use this simplified version of the cycle:  ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)

One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District.  (With a new and improved webpage appearing soon! I promise!)  Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.

During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion.   (Is it possible to have a slow epiphany?)

Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data.  But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking.  To some extent, statistical thinking is considered to be independent of computation.  We’d like to think that we’d reach the same conclusions regardless of which software we were using.  While that’s true, I think it’s also true that our approach to solving the problem may be software dependent.  We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.

And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.

What does this have to do with Daniel’s talk?   Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class.  In this talk, he offered a model for the expert statistician problem solving process.  Another version of the data analysis cycle, if you will.  His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.

I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science.  Or, as my colleague Mark Hansen once told me, “Teaching R  *is* teaching statistics.”  Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics.  But it’s also possible to teach it as a complement to developing statistical understanding.

I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts.  But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground.  But when our thinking unites these views, we begin to think like data scientists.  And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.

I’m not done with this issue.  The term still bothers me.  Just what is the science in data science?  I feel a refresher course in Popper and Kuhn is in order.  Are we really thinking scientifically about data?  Comments and thoughts welcome.

Lively R

Next week, the UseR conference comes to UCLA.  And in anticipation, I thought a little foreshadowing would be nice.  Amelia McNamara, UCLA Stats grad student and rising stats ed star, shared with me a new tool that has the potential to do some wonderful things.  LivelyR is a work-in-progress that is, in the words of its creators, a “mashup of R with packages of Rstudio.” The result is a highly interactive.  I was particularly struck by and intrigued by the ‘sweeping’ function, which visually smears graphics across several parameter values.  The demonstration shows how this can help understand the effects of bin-width and off-set changes on a histogram so that a more robust sense of the sample distribution shines through.

R is beginning to become a formidable educational tool, and I’m looking forward to learning more at UseR next week. For those of you in L.A. who can attend, Aron Lunzer will be talking about LivelyR at 4pm on Tuesday, July 1.

Data Privacy (L.A. Times)

The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…

http://www.latimes.com/nation/politics/politicsnow/la-pn-white-house-big-data-privacy-report-20140501,0,5624003.story

An Open Letter to the TinkerPlots Community

I received the following from Cliff Konold:

We have just release the following to answer questions many have asked us about when TinkerPlots will be available for sale again. Unfortunately, we do not have a list of current users to send this to, so please distribute this to others you think would be interested.


March 21, 2014

As you may have discovered by now, you can no longer purchase TinkerPlots. Many of you who have been using TinkerPlots in courses and workshops have found your way to us asking if and when it will be available for purchase again. We expect soon, by this June.  But to allow you to make informed decisions about future instructional uses of TinkerPlots, we need to provide a little background.

On December 10, 2013, we received a letter from McGraw-Hill Education giving us notice that in 90 days they would be terminating their agreement with us to publish TinkerPlots. For those of you who remember Key Curriculum as our publisher, McGraw-Hill Education acquired Key in August 2012, and as part of that acquisition became the new publisher of The Geometer’s Sketchpad, Fathom, and TinkerPlots.

Though McGraw-Hill Education had informally told us of their plans to terminate sales of both TinkerPlots and Fathom as of December 31, 2013, we were nevertheless surprised when they actually did this. We were assuming this wouldn’t happen until mid March (i.e., 90 days). In any case, since January 1 of this year, no new licenses for TinkerPlots have been sold.

Fortunately, TinkerPlots is actually owned by our University, so we are now free to find another publisher. We are in ongoing discussions with four different organizations who have expressed interest in publishing TinkerPlots. But there are many components of TinkerPlots in addition to the application (data sets, activities, help manual, instructional movies, tutorials, on-line course materials, artwork, the license server/installer, the list of existing users), which McGraw-Hill Education does own that would be hard to do without; to replace them would require a significant undertaking. Fortunately, McGraw-Hill Education has indicated their willingness to transfer most all of these assets to us, and we are very grateful for this because they are not legally bound to do so.  However, we have not yet received any of these resources or written permission that we can use them. Until we do, we cannot realistically build and release another version of the application. We are in regular communication with people at McGraw-Hill Education who have assured us that they will begin very shortly to deliver to us these materials and official permissions for their use.

We have been telling folks that a new version of TinkerPlots will be available by June 2014, and we still think this a reasonable timeframe.  We’d give it about an 85% probability. By August, 98.2%.

In the meantime, if you have unused licenses for TinkerPlots, you will still be able to register new computers on that license number. To see how many licenses you have, go to License Information… under the Help menu. If you have one license, our memory is that you can actually register 3 computers on it — they built in a little leeway. From that same dialog box you can also deregister a computer and in this way free up a currently used license. (We just checked, and when the deregister dialog comes up, it now has the name of Sketchpad where TinkerPlots should be.  But ignore that. It’s just an indication of the publisher slowly phasing the name TinkerPlots out of its system.)

Also, the resource links under the TinkerPlots Help menu still take you to resources such as movies on the publisher’s site. They have told us, however, that after March 2015, they will discontinue hosting these materials on their web site. But by that time, all these should be available on the site of the new publisher.

We are so sorry for the inconvenience this interruption and the lack of communication has caused many of you. McGraw-Hill Education has not notified its existing users, and we don’t know who most of you are.  We have heard of several instances where teachers planning to start a course or workshop in a few days have suddenly learned that their students will not be able to purchase TinkerPlots, and they have had to quickly redesign their course. We understand that because of this ordeal, some of you will decide to jump ship on TinkerPlots. But we certainly hope that most of you will stick with us through this bumpy transition. We have put nearly 15 years of ourselves into the creation of TinkerPlots and the development of its community, and we are committed to keeping both going.

Cliff Konold and Craig Miller
The TinkerPlots Development Team
Scientific Reasoning Research Institute
University of Massachusetts Amherst
Amherst, Massachusetts

Email: konold@srri.umass.edu
Web:   www.umass.edu/srri/serg/

JMM 2014

Two weeks ago I traveled to Baltimore to the Joint Mathematics Meetings. These meetings are very much like the Joint Statistics Meetings except for mathematicians. “Now, um, usually I don’t do this but uh….Go head’ on and break em off wit a lil’ preview of the remix….” (Kelly, 2003).

The JMM are a great place to educate and work with mathematics teachers at the collegiate level who are teaching introductory statistics courses. One group that is quite active in this community is the Statistics Education Special Interest Group of the Mathematical Association of America (SIGMAA). If you are a member of the MAA, let me put in a plug to join this SIGMAA. Each year they sponsor at least one contributed paper session and often several minicourses.

This year, aside from the perennial Teaching introductory statistics (for instructors new to teaching intro stats minicourse, the SIGMAA also endorsed two minicourses aimed at using randomization/bootstrapping in the introductory course, CATALST: Introductory statistics using randomization and bootstrap methods and Using randomization methods to build conceptual understanding of statistical inference. Both mini courses were well attended and will likely be offered again next January.

JMM-2014-Minicourse-Nicola

Nicola during the CATALST minicourse.

The SIGMAA also sponsored a Contributed Paper Session entitled, Data, Modeling, and Computing in the Introductory Statistics Course. The marathon session, running from 1:00pm–6:00pm, was very well attended and included 15 presentations.

Nick-Horton

Nick Horton gives the paper, Big Data in the Intro Stats Class: Use of the Airline Delays Dataset to Expose Students to a Real-World, Complex Dataset by himself, Ben Baumer, and Hadley Wickham.

One of my favorite things at JMM is attending the SIGMAA Stat-Ed Business Meeting. This took place immediately following the CPS, so we were able to capitalize on inviting many of the attendees to join us. After eating what might have been the best spread of food I have encountered at one of these meetings, we had our meeting.

The SIGMAA presents two awards during these meetings.

The Dex Whittinghill Award is presented to the first author of the paper that receives the highest evaluations during the CPS session from the previous JMM. This year, it was presented to Kari Lock-Morgan of Duke University (who was unable to be there, but sent her heartfelt thanks via her parents).

The Robert V. Hogg Award for excellence in teaching introductory statistics was presented to Johanna Hardin of Pomona College. Johanna’s colleague, Gizem Karaali, gave a heartwarming talk when presenting Johanna the award.

IMG_3772

Scott Albers, SIGMAA chair, congratulates Johanna Hardin on winning the Robert V. Hogg Award

IMG_3776

Gizem Karaali reads a heartwarming note from the Johanna’s colleagues.

 

References

Kelly, R. (2003). Ignition (remix). On Chocolate factory. Chicago: Jive, Sony.

The Future of Inference

We had an interesting departmental seminar last week, thanks to our post-doc Joakim Ekstrom, that I thought would be fun to share.  The topic was The Future of Statistics discussed by a panel of three statisticians.  From left to right in the room: Songchun Zhu (UCLA Statistics), Susan Paddock (RAND), and Jan DeLeeuw (UCLA Statistics).  The panel was asked about the future of inference: waxing or waning.

The answers spanned the spectrum from “More” to “Less” and did so, interestingly enough, as one moved left to right in order of seating.  Songchun staked a claim for waxing, in part because  he knows of groups that are hiring statisticians instead of computer scientists because statisticians’ inclination to cast problems in an inferential context makes them more capable of finding conclusions in data, and not simply presenting summaries and visualizations.  Susan felt that it was neither waxing nor waning, and pointed out that she and many of the statisticians she knows spend much of their time doing inference.  Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I think one reason that many stats educators might object to this because its hard to think of how else to fill the curriculum.  That might have been an issue when most students took a single Introductory course in their early twenties and then never saw statistics again.  But now we must think of the long game, and realize that students begin learning statistics early.  The Common Core stakes out one learning pathway, but we should be looking ahead, and thinking of future curricula, since the importance of statistics will grow.

If statistics is the science of data, I suggest we spend more time thinking about how to teach students to behave more like scientists.  And this means thinking seriously about how we can  develop their sense of curiosity.  The Common Core introduces the notion of a ‘statistical question’– a question that recognizes variability.  To the statisticians reading this, this needs no more explanation.  But I’ve found it surprisingly difficult to teach this practice to math teachers teaching statistics.  I’m not sure, yet, why this is.  Part of the reason might be that in order to answer a statistical question such as “What is the most popular favorite color in this class” we must ask the non-statistical question “What is your favorite color.”  But there’s more to it than that.  A good statistical question isn’t as simple as the one I mentioned, and leads to discovery beyond the mere satisfaction of curiosity.  I’m reminded of the Census at Schools program that encouraged students to become Data Detectives.

In short, its time to think seriously about teaching students why they should want to do data analysis.  And if we’re successful, they’ll want to learn how to do inference.

So what role does inference play in your Ideal Statistics Curriculum?

Should Programming Count as a “Foreign Language”?

I re-hashed this blog post title from the Edutopia article, Should Coding be the “New Foreign Language” Requirement? Texas legislators just answered this question with “Yes”. I hope Minnesota doesn’t follow suit.

Now, in all fairness, I need to disclose that when I taught high school, the Math department played a practical joke on the Languages department by faking a document that claimed that mathematics would be accepted as a foreign language requirement and then conveniently dropping the document outside the classroom door of the Spanish teacher. The ensuing result had the faculty laughing for weeks.

But, I would have no more stood up for mathematics fulfilling a foreign language requirement than computer science fulfilling the same requirement. I think a better substitution however is that computer science should count as fulfilling a mathematics requirement!

The authors of the Edutopia blog write,

In terms of cognitive advantages, learning a system of signs, symbols and rules used to communicate — that is, language study — improves thinking by challenging the brain to recognize, negotiate meaning and master different language patterns. Coding does the same thing.

Substitute the word “mathematics” for “language study” in the previous paragraph and in my mind, it is an even better sell.

While I hope coding does not replace foreign language, I am glad that it is receiving its time in the spotlight. And, I hope the statistics community can use this to its advantage. This is perhaps the perfect route for building on the success of AP statistics…statistical computing. The combined sexiness (sorry Mr. Varian!) of statistics and coding would be amazing (p < .000001) and would be beneficial to both disciplines.

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.

Warning: Mac OS 10.9 Mavericks and R Don’t Play Nicely

For some reason I was compelled to update my Mac’s OS and R on the same day. (I know…) It didn’t go well on several accounts and I mostly blame Apple. Here are the details.

  • I updated R to version 3.0.2 “Frisbee Sailing”
  • I updated my OS to 10.9 “Mavericks”

When I went to use R things were going fine until I mistyped a command. Rather than giving some sort of syntax error, R responded with,

&gt; *** caught segfault *** 
&gt; address 0x7c0, cause 'memory not mapped' 
&gt; 
&gt; Possible actions: 
&gt; 1: abort (with core dump, if enabled) 
&gt; 2: normal R exit 
&gt; 3: exit R without saving workspace 
&gt; 4: exit R saving workspace 
&gt; Selection:

Unlike most of my experiences with computing, this I was able to replicate many times. After a day of panic and no luck on Google, I was finally able to find a post on one of the Google Groups from Simon Urbanek responding to someone with a similar problem. He points out that there are a couple of solutions, one of which is to wait until Apple gets things stabilized. (This is an issue since if you have ever tried to go back to a previous OS on a Mac, you will know that this might take several days of pain and swearing.)

The second solution he suggests is to install the nightly build or rebuild the GUI. To install the nightly build visit the R  for Mac OS X Developer’s page. Or, in Terminal issue the following commands,

svn co https://svn.r-project.org/R-packages/trunk/Mac-GUI 
cd Mac-GUI 
xcodebuild -configuration Debug 
open build/Debug/R.app

I tried both and this worked fine…until I needed to load a package. Then I was given an error that the package couldn’t be found. Now I realize that you can download the packages you need from source and compile them yourself, but I was trying to figure out how to deal with students who were in a similar situation. (This is not an option for most social science students.)

The best solution it turned out is to use RStudio, which my students pretty much all use anyway. (My problem is that I am a Sublime Text 2 user.) This allowed the newest version of R to run on the new Mac OS. But, as is pointed out on the RStudio blog,

As a result of a problem between Mavericks and the user interface toolkit underlying RStudio (Qt) the RStudio IDE is very slow in painting and user interactions  when running under Mavericks.

I re-downloaded the latest stable release of the R GUI about an hour ago, and so far it seems to be working fine with Mavericks (no abort message yet), so this whole post may be moot.