Is Data Science Real?

Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas.  There were many thought-provoking talks, but what was even better were the thought-provoking conversations.  One theme, at least for me, is just what is this thing called Data Science?  One esteemed colleague suggested it was simply a re-branding.  Other speakers used it somewhat perjoratively, in reference  to outsiders (i.e. computer scientists).   Here are some answers from panelists at a discussion on the future of technology in statistics education.  All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.

Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data.  I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life.  But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.

Nick Horton offered that data mining was an activity that could be considered part of data science.  And he sees data mining as part of statistics.  Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered.  But he also offered a broader definition:  using data to answer a statistical question.   Which I quite like.  It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.

Bill Finzer relied on set theory:  data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data.  I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course.  It doesn’t contradict Nick’s definition, but is a little more precise.  As always, Bill has a knack for phrasing things just right without any practice.

Deb Nolan answered last, and I think I liked her answer the best.  Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle.  (I like to use this simplified version of the cycle:  ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)

One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District.  (With a new and improved webpage appearing soon! I promise!)  Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.

During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion.   (Is it possible to have a slow epiphany?)

Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data.  But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking.  To some extent, statistical thinking is considered to be independent of computation.  We’d like to think that we’d reach the same conclusions regardless of which software we were using.  While that’s true, I think it’s also true that our approach to solving the problem may be software dependent.  We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.

And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.

What does this have to do with Daniel’s talk?   Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class.  In this talk, he offered a model for the expert statistician problem solving process.  Another version of the data analysis cycle, if you will.  His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.

I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science.  Or, as my colleague Mark Hansen once told me, “Teaching R  *is* teaching statistics.”  Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics.  But it’s also possible to teach it as a complement to developing statistical understanding.

I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts.  But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground.  But when our thinking unites these views, we begin to think like data scientists.  And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.

I’m not done with this issue.  The term still bothers me.  Just what is the science in data science?  I feel a refresher course in Popper and Kuhn is in order.  Are we really thinking scientifically about data?  Comments and thoughts welcome.

Lively R

Next week, the UseR conference comes to UCLA.  And in anticipation, I thought a little foreshadowing would be nice.  Amelia McNamara, UCLA Stats grad student and rising stats ed star, shared with me a new tool that has the potential to do some wonderful things.  LivelyR is a work-in-progress that is, in the words of its creators, a “mashup of R with packages of Rstudio.” The result is a highly interactive.  I was particularly struck by and intrigued by the ‘sweeping’ function, which visually smears graphics across several parameter values.  The demonstration shows how this can help understand the effects of bin-width and off-set changes on a histogram so that a more robust sense of the sample distribution shines through.

R is beginning to become a formidable educational tool, and I’m looking forward to learning more at UseR next week. For those of you in L.A. who can attend, Aron Lunzer will be talking about LivelyR at 4pm on Tuesday, July 1.

Data Privacy (L.A. Times)

The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…

http://www.latimes.com/nation/politics/politicsnow/la-pn-white-house-big-data-privacy-report-20140501,0,5624003.story

An Open Letter to the TinkerPlots Community

I received the following from Cliff Konold:

We have just release the following to answer questions many have asked us about when TinkerPlots will be available for sale again. Unfortunately, we do not have a list of current users to send this to, so please distribute this to others you think would be interested.


March 21, 2014

As you may have discovered by now, you can no longer purchase TinkerPlots. Many of you who have been using TinkerPlots in courses and workshops have found your way to us asking if and when it will be available for purchase again. We expect soon, by this June.  But to allow you to make informed decisions about future instructional uses of TinkerPlots, we need to provide a little background.

On December 10, 2013, we received a letter from McGraw-Hill Education giving us notice that in 90 days they would be terminating their agreement with us to publish TinkerPlots. For those of you who remember Key Curriculum as our publisher, McGraw-Hill Education acquired Key in August 2012, and as part of that acquisition became the new publisher of The Geometer’s Sketchpad, Fathom, and TinkerPlots.

Though McGraw-Hill Education had informally told us of their plans to terminate sales of both TinkerPlots and Fathom as of December 31, 2013, we were nevertheless surprised when they actually did this. We were assuming this wouldn’t happen until mid March (i.e., 90 days). In any case, since January 1 of this year, no new licenses for TinkerPlots have been sold.

Fortunately, TinkerPlots is actually owned by our University, so we are now free to find another publisher. We are in ongoing discussions with four different organizations who have expressed interest in publishing TinkerPlots. But there are many components of TinkerPlots in addition to the application (data sets, activities, help manual, instructional movies, tutorials, on-line course materials, artwork, the license server/installer, the list of existing users), which McGraw-Hill Education does own that would be hard to do without; to replace them would require a significant undertaking. Fortunately, McGraw-Hill Education has indicated their willingness to transfer most all of these assets to us, and we are very grateful for this because they are not legally bound to do so.  However, we have not yet received any of these resources or written permission that we can use them. Until we do, we cannot realistically build and release another version of the application. We are in regular communication with people at McGraw-Hill Education who have assured us that they will begin very shortly to deliver to us these materials and official permissions for their use.

We have been telling folks that a new version of TinkerPlots will be available by June 2014, and we still think this a reasonable timeframe.  We’d give it about an 85% probability. By August, 98.2%.

In the meantime, if you have unused licenses for TinkerPlots, you will still be able to register new computers on that license number. To see how many licenses you have, go to License Information… under the Help menu. If you have one license, our memory is that you can actually register 3 computers on it — they built in a little leeway. From that same dialog box you can also deregister a computer and in this way free up a currently used license. (We just checked, and when the deregister dialog comes up, it now has the name of Sketchpad where TinkerPlots should be.  But ignore that. It’s just an indication of the publisher slowly phasing the name TinkerPlots out of its system.)

Also, the resource links under the TinkerPlots Help menu still take you to resources such as movies on the publisher’s site. They have told us, however, that after March 2015, they will discontinue hosting these materials on their web site. But by that time, all these should be available on the site of the new publisher.

We are so sorry for the inconvenience this interruption and the lack of communication has caused many of you. McGraw-Hill Education has not notified its existing users, and we don’t know who most of you are.  We have heard of several instances where teachers planning to start a course or workshop in a few days have suddenly learned that their students will not be able to purchase TinkerPlots, and they have had to quickly redesign their course. We understand that because of this ordeal, some of you will decide to jump ship on TinkerPlots. But we certainly hope that most of you will stick with us through this bumpy transition. We have put nearly 15 years of ourselves into the creation of TinkerPlots and the development of its community, and we are committed to keeping both going.

Cliff Konold and Craig Miller
The TinkerPlots Development Team
Scientific Reasoning Research Institute
University of Massachusetts Amherst
Amherst, Massachusetts

Email: konold@srri.umass.edu
Web:   www.umass.edu/srri/serg/

JMM 2014

Two weeks ago I traveled to Baltimore to the Joint Mathematics Meetings. These meetings are very much like the Joint Statistics Meetings except for mathematicians. “Now, um, usually I don’t do this but uh….Go head’ on and break em off wit a lil’ preview of the remix….” (Kelly, 2003).

The JMM are a great place to educate and work with mathematics teachers at the collegiate level who are teaching introductory statistics courses. One group that is quite active in this community is the Statistics Education Special Interest Group of the Mathematical Association of America (SIGMAA). If you are a member of the MAA, let me put in a plug to join this SIGMAA. Each year they sponsor at least one contributed paper session and often several minicourses.

This year, aside from the perennial Teaching introductory statistics (for instructors new to teaching intro stats minicourse, the SIGMAA also endorsed two minicourses aimed at using randomization/bootstrapping in the introductory course, CATALST: Introductory statistics using randomization and bootstrap methods and Using randomization methods to build conceptual understanding of statistical inference. Both mini courses were well attended and will likely be offered again next January.

JMM-2014-Minicourse-Nicola

Nicola during the CATALST minicourse.

The SIGMAA also sponsored a Contributed Paper Session entitled, Data, Modeling, and Computing in the Introductory Statistics Course. The marathon session, running from 1:00pm–6:00pm, was very well attended and included 15 presentations.

Nick-Horton

Nick Horton gives the paper, Big Data in the Intro Stats Class: Use of the Airline Delays Dataset to Expose Students to a Real-World, Complex Dataset by himself, Ben Baumer, and Hadley Wickham.

One of my favorite things at JMM is attending the SIGMAA Stat-Ed Business Meeting. This took place immediately following the CPS, so we were able to capitalize on inviting many of the attendees to join us. After eating what might have been the best spread of food I have encountered at one of these meetings, we had our meeting.

The SIGMAA presents two awards during these meetings.

The Dex Whittinghill Award is presented to the first author of the paper that receives the highest evaluations during the CPS session from the previous JMM. This year, it was presented to Kari Lock-Morgan of Duke University (who was unable to be there, but sent her heartfelt thanks via her parents).

The Robert V. Hogg Award for excellence in teaching introductory statistics was presented to Johanna Hardin of Pomona College. Johanna’s colleague, Gizem Karaali, gave a heartwarming talk when presenting Johanna the award.

IMG_3772

Scott Albers, SIGMAA chair, congratulates Johanna Hardin on winning the Robert V. Hogg Award

IMG_3776

Gizem Karaali reads a heartwarming note from the Johanna’s colleagues.

 

References

Kelly, R. (2003). Ignition (remix). On Chocolate factory. Chicago: Jive, Sony.

The Future of Inference

We had an interesting departmental seminar last week, thanks to our post-doc Joakim Ekstrom, that I thought would be fun to share.  The topic was The Future of Statistics discussed by a panel of three statisticians.  From left to right in the room: Songchun Zhu (UCLA Statistics), Susan Paddock (RAND), and Jan DeLeeuw (UCLA Statistics).  The panel was asked about the future of inference: waxing or waning.

The answers spanned the spectrum from “More” to “Less” and did so, interestingly enough, as one moved left to right in order of seating.  Songchun staked a claim for waxing, in part because  he knows of groups that are hiring statisticians instead of computer scientists because statisticians’ inclination to cast problems in an inferential context makes them more capable of finding conclusions in data, and not simply presenting summaries and visualizations.  Susan felt that it was neither waxing nor waning, and pointed out that she and many of the statisticians she knows spend much of their time doing inference.  Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I think one reason that many stats educators might object to this because its hard to think of how else to fill the curriculum.  That might have been an issue when most students took a single Introductory course in their early twenties and then never saw statistics again.  But now we must think of the long game, and realize that students begin learning statistics early.  The Common Core stakes out one learning pathway, but we should be looking ahead, and thinking of future curricula, since the importance of statistics will grow.

If statistics is the science of data, I suggest we spend more time thinking about how to teach students to behave more like scientists.  And this means thinking seriously about how we can  develop their sense of curiosity.  The Common Core introduces the notion of a ‘statistical question’– a question that recognizes variability.  To the statisticians reading this, this needs no more explanation.  But I’ve found it surprisingly difficult to teach this practice to math teachers teaching statistics.  I’m not sure, yet, why this is.  Part of the reason might be that in order to answer a statistical question such as “What is the most popular favorite color in this class” we must ask the non-statistical question “What is your favorite color.”  But there’s more to it than that.  A good statistical question isn’t as simple as the one I mentioned, and leads to discovery beyond the mere satisfaction of curiosity.  I’m reminded of the Census at Schools program that encouraged students to become Data Detectives.

In short, its time to think seriously about teaching students why they should want to do data analysis.  And if we’re successful, they’ll want to learn how to do inference.

So what role does inference play in your Ideal Statistics Curriculum?

Should Programming Count as a “Foreign Language”?

I re-hashed this blog post title from the Edutopia article, Should Coding be the “New Foreign Language” Requirement? Texas legislators just answered this question with “Yes”. I hope Minnesota doesn’t follow suit.

Now, in all fairness, I need to disclose that when I taught high school, the Math department played a practical joke on the Languages department by faking a document that claimed that mathematics would be accepted as a foreign language requirement and then conveniently dropping the document outside the classroom door of the Spanish teacher. The ensuing result had the faculty laughing for weeks.

But, I would have no more stood up for mathematics fulfilling a foreign language requirement than computer science fulfilling the same requirement. I think a better substitution however is that computer science should count as fulfilling a mathematics requirement!

The authors of the Edutopia blog write,

In terms of cognitive advantages, learning a system of signs, symbols and rules used to communicate — that is, language study — improves thinking by challenging the brain to recognize, negotiate meaning and master different language patterns. Coding does the same thing.

Substitute the word “mathematics” for “language study” in the previous paragraph and in my mind, it is an even better sell.

While I hope coding does not replace foreign language, I am glad that it is receiving its time in the spotlight. And, I hope the statistics community can use this to its advantage. This is perhaps the perfect route for building on the success of AP statistics…statistical computing. The combined sexiness (sorry Mr. Varian!) of statistics and coding would be amazing (p < .000001) and would be beneficial to both disciplines.

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.

Warning: Mac OS 10.9 Mavericks and R Don’t Play Nicely

For some reason I was compelled to update my Mac’s OS and R on the same day. (I know…) It didn’t go well on several accounts and I mostly blame Apple. Here are the details.

  • I updated R to version 3.0.2 “Frisbee Sailing”
  • I updated my OS to 10.9 “Mavericks”

When I went to use R things were going fine until I mistyped a command. Rather than giving some sort of syntax error, R responded with,

&gt; *** caught segfault *** 
&gt; address 0x7c0, cause 'memory not mapped' 
&gt; 
&gt; Possible actions: 
&gt; 1: abort (with core dump, if enabled) 
&gt; 2: normal R exit 
&gt; 3: exit R without saving workspace 
&gt; 4: exit R saving workspace 
&gt; Selection:

Unlike most of my experiences with computing, this I was able to replicate many times. After a day of panic and no luck on Google, I was finally able to find a post on one of the Google Groups from Simon Urbanek responding to someone with a similar problem. He points out that there are a couple of solutions, one of which is to wait until Apple gets things stabilized. (This is an issue since if you have ever tried to go back to a previous OS on a Mac, you will know that this might take several days of pain and swearing.)

The second solution he suggests is to install the nightly build or rebuild the GUI. To install the nightly build visit the R  for Mac OS X Developer’s page. Or, in Terminal issue the following commands,

svn co https://svn.r-project.org/R-packages/trunk/Mac-GUI 
cd Mac-GUI 
xcodebuild -configuration Debug 
open build/Debug/R.app

I tried both and this worked fine…until I needed to load a package. Then I was given an error that the package couldn’t be found. Now I realize that you can download the packages you need from source and compile them yourself, but I was trying to figure out how to deal with students who were in a similar situation. (This is not an option for most social science students.)

The best solution it turned out is to use RStudio, which my students pretty much all use anyway. (My problem is that I am a Sublime Text 2 user.) This allowed the newest version of R to run on the new Mac OS. But, as is pointed out on the RStudio blog,

As a result of a problem between Mavericks and the user interface toolkit underlying RStudio (Qt) the RStudio IDE is very slow in painting and user interactions  when running under Mavericks.

I re-downloaded the latest stable release of the R GUI about an hour ago, and so far it seems to be working fine with Mavericks (no abort message yet), so this whole post may be moot.

Statistics, the government shutdown, and causality.

There’s a  statistical meme that is making its way into pundits’ discussions (as we might politely call them) that is of interest to statistics educators.  There are several variations, but the basic theme is this:  because of the government shutdown, people are unable to benefit from the new drugs they receive by participating in clinical trials.  The L.A Times went so far as to publish an editorial from a gentleman who claimed that he was cured by his participation in a clinical trial.

Now if they had said that future patients are prevented from benefiting from what is learned from a clinical trial, then they’d nail it.  Instead, they seem to be overlooking the fact that some patients will be randomized to the control group, and probably get the same treatment as if there were no trial at all.  And in many trials (a majority?), the result will be that the experimental treatment had little or no effect beyond the traditional treatment.  And in a very small number of cases, the experimental effect will be found to have serious side effects.  And so the pundits should really be telling us that the government shutdown prevents patients from a small probability of a benefitting from experimental treatment.

All snarkiness aside, I think the prevalence of this meme points to the subtleties of interpreting probabilistic experiments, in which outcomes contain much variability, and so conclusions must be stated in terms of group characteristics.  This came out in the SRTL discussion in Minnesota this summer, when Maxinne Pfannkuch, Pip Arnold, and Stephanie Budgett at the University of Auckland  presented their work leading towards a framework for describing students’ understanding of causality.  I don’t remember very well the example they used, but it was similar to this (and was a real-life study):   patients were randomized to receive either fish oil or vegetable oil in their diet.  The goal of the study was to determine if fish oil lowered cholesterol.  At the end of the study, the fish oil group had a slightly lower average cholesterol levels.  A typical interpretation was, “If I take fish oil, my cholesterol will go down.”

One problem with this interpretation is that it ignored the within-group variation.  Some of patients in the fish oil group saw their cholesterol go up; some saw little or no change.  The study’s conclusion is about group means, not about individuals.  (There were other problems, too.  This interpretation ignores the existence of the control group: we don’t really know if fish oil improves cholesterol compared to your current diet; we know only that it tends to go down in comparison to a vegetable-oil diet.  Also, we know the effects only for those who participated in the study. We assume they were not special people, but possibly the results won’t hold for other groups.)

Understanding causality in probabilistic settings (or any setting) is a challenge for young students and even adults.  I’m very excited to see such a distinguished group of researchers begin  to help us understand.  Judea Pearl, at UCLA, has done much to encourage statisticians to think about the importance of teaching causal inference.  Recently, he helped the American Statistical Association establish the Causality in Statistics Education prize, won this year by Felix Elwert, a sociologist at the University of Wisconsin-Madison.  We still have a ways to go before we understand how to best teach this topic at the undergraduate level and even further before we understand how to teach it at earlier levels.  But, as the government shut down has shown, understanding probabilistic causality is an important component of statistical literacy.