Is Data Science Real?

Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas.  There were many thought-provoking talks, but what was even better were the thought-provoking conversations.  One theme, at least for me, is just what is this thing called Data Science?  One esteemed colleague suggested it was simply a re-branding.  Other speakers used it somewhat perjoratively, in reference  to outsiders (i.e. computer scientists).   Here are some answers from panelists at a discussion on the future of technology in statistics education.  All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.

Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data.  I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life.  But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.

Nick Horton offered that data mining was an activity that could be considered part of data science.  And he sees data mining as part of statistics.  Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered.  But he also offered a broader definition:  using data to answer a statistical question.   Which I quite like.  It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.

Bill Finzer relied on set theory:  data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data.  I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course.  It doesn’t contradict Nick’s definition, but is a little more precise.  As always, Bill has a knack for phrasing things just right without any practice.

Deb Nolan answered last, and I think I liked her answer the best.  Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle.  (I like to use this simplified version of the cycle:  ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)

One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District.  (With a new and improved webpage appearing soon! I promise!)  Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.

During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion.   (Is it possible to have a slow epiphany?)

Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data.  But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking.  To some extent, statistical thinking is considered to be independent of computation.  We’d like to think that we’d reach the same conclusions regardless of which software we were using.  While that’s true, I think it’s also true that our approach to solving the problem may be software dependent.  We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.

And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.

What does this have to do with Daniel’s talk?   Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class.  In this talk, he offered a model for the expert statistician problem solving process.  Another version of the data analysis cycle, if you will.  His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.

I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science.  Or, as my colleague Mark Hansen once told me, “Teaching R  *is* teaching statistics.”  Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics.  But it’s also possible to teach it as a complement to developing statistical understanding.

I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts.  But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground.  But when our thinking unites these views, we begin to think like data scientists.  And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.

I’m not done with this issue.  The term still bothers me.  Just what is the science in data science?  I feel a refresher course in Popper and Kuhn is in order.  Are we really thinking scientifically about data?  Comments and thoughts welcome.

Fathom Returns

The other shoe has fallen.  Last week (or so) Tinkerplots returned to the market, and now Fathom Version 2.2 (which is the foundation on which Tinkerplots is built) is  available for a free download.  Details are available on Bill Finzer‘s website.

Fathom is one of my favorite softwares…the first commercially available package to be based on learning theory, Fathom’s primary goal is to teach statistics.  After a one-minute introduction, beginning students can quickly discuss ‘findings’ across several variables.  So many classroom exercises involve only one or two variables, and Fathom taught me  that this is unfair to students and artificially holds them back.

Welcome back, Fathom!

Tinkerplots Available Again

Very exciting news for Tinkerplots users (and for those who should be Tinkerplots users).  Tinkerplots is highly visual dynamic software that lets students design and implement simulation machines, and includes many very cool data analysis tools.

To quote from TP developer Cliff Konold:

Today we are releasing Version 2.2 of TinkerPlots.  This is a special, free version, which will expire in a year  — August 31, 2015.

To start the downloading process

Go to the TinkerPlots home page and click on the Download TinkerPlots link in the right hand panel. You’ll fill out a form. Shortly after submitting it, you’ll get an email with a link for downloading.

Help others find the TinkerPlots Download page

If you have a website, blog, or use a social media site, please help us get the word out so others can find the new TinkerPlots Download page. You could mention that you are using TinkerPlots 2.2 and link to www.srri.umass.edu/tinkerplots.

Why is this an expiring version?

As we explained in this correspondence, until January of 2014, TinkerPlots was published and sold by Key Curriculum, a division of McGraw Hill Education. Their decision to cease publication caught us off guard, and we have yet to come up with an alternative publishing plan. We created this special expiring version to meet the needs of users until we can get a new publishing plan in place.

What will happen after version 2.2 expires?

By August 2015, we will either have a new publisher lined up, or we will create another free version.  What is holding us up right now is our negotiations with the University of Massachusetts Amherst, who currently owns TinkerPlots.  Once they have decided about their future involvement with TinkerPlots, we can complete our discussions with various publishing partners.

If I have versions 2.0 or 2.1 should I delete them?

No, you should keep them. You already paid for these, and they are not substantively different from version 2.2. If and when a new version of TinkerPlots is ready for sale, you may not want to pay for it.  So keep your early version that you’ve already paid for.
Cliff and Craig

Lively R

Next week, the UseR conference comes to UCLA.  And in anticipation, I thought a little foreshadowing would be nice.  Amelia McNamara, UCLA Stats grad student and rising stats ed star, shared with me a new tool that has the potential to do some wonderful things.  LivelyR is a work-in-progress that is, in the words of its creators, a “mashup of R with packages of Rstudio.” The result is a highly interactive.  I was particularly struck by and intrigued by the ‘sweeping’ function, which visually smears graphics across several parameter values.  The demonstration shows how this can help understand the effects of bin-width and off-set changes on a histogram so that a more robust sense of the sample distribution shines through.

R is beginning to become a formidable educational tool, and I’m looking forward to learning more at UseR next week. For those of you in L.A. who can attend, Aron Lunzer will be talking about LivelyR at 4pm on Tuesday, July 1.

Data Privacy (L.A. Times)

The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…

http://www.latimes.com/nation/politics/politicsnow/la-pn-white-house-big-data-privacy-report-20140501,0,5624003.story

Personal Data Apps

Fitbit, you know I love you and you’ll always have a special place in my pocket.  But now I have to make room for the Moves app to play a special role in my capture-the-moment-with-data existence.

Moves is an ios7 app that is free.  It eats up some extra battery power and in exchange records your location and merges this with various databases and syncs it up to other databases and produces some very nice “story lines” that remind you about the day you had and, as a bonus, can motivate you to improved your activity levels.  I’ve attached two example storylines that do not make it too embarrassingly clear how little exercise I have been getting. (I have what I can consider legitimate excuses, and once I get the dataset downloaded, maybe I’ll add them as covariates.)  One of the timelines is from a day that included an evening trip to Disneyland. The other is a Saturday spent running errands and capped with dinner at a friend’s.  Its pretty easy to tell which day is which.

movings1movings2

But there’s more.  Moves has an API, thus allowing developers to tap into their datastream to create apps.  There’s an app that exports the data for you (although I haven’t really had success with it yet) and several that create journals based on your Moves data.  You can also merge Foursquare, Twitter, and all the usual suspects.

I think it might be fun to have students discuss how one could go from the data Moves collects to creating the storylines it makes.  For instance, how does it know I’m in a car, and not just a very fast runner?  Actually, given LA traffic, a better question is how it knows I’m stuck in traffic and not just strolling down the freeway at a leisurely pace? (Answering these questions requires another type of inference than what we normally teach in statistics. )  Besides journals, what apps might they create with these data and what additional data would they need?

The Future of Inference

We had an interesting departmental seminar last week, thanks to our post-doc Joakim Ekstrom, that I thought would be fun to share.  The topic was The Future of Statistics discussed by a panel of three statisticians.  From left to right in the room: Songchun Zhu (UCLA Statistics), Susan Paddock (RAND), and Jan DeLeeuw (UCLA Statistics).  The panel was asked about the future of inference: waxing or waning.

The answers spanned the spectrum from “More” to “Less” and did so, interestingly enough, as one moved left to right in order of seating.  Songchun staked a claim for waxing, in part because  he knows of groups that are hiring statisticians instead of computer scientists because statisticians’ inclination to cast problems in an inferential context makes them more capable of finding conclusions in data, and not simply presenting summaries and visualizations.  Susan felt that it was neither waxing nor waning, and pointed out that she and many of the statisticians she knows spend much of their time doing inference.  Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I think one reason that many stats educators might object to this because its hard to think of how else to fill the curriculum.  That might have been an issue when most students took a single Introductory course in their early twenties and then never saw statistics again.  But now we must think of the long game, and realize that students begin learning statistics early.  The Common Core stakes out one learning pathway, but we should be looking ahead, and thinking of future curricula, since the importance of statistics will grow.

If statistics is the science of data, I suggest we spend more time thinking about how to teach students to behave more like scientists.  And this means thinking seriously about how we can  develop their sense of curiosity.  The Common Core introduces the notion of a ‘statistical question’– a question that recognizes variability.  To the statisticians reading this, this needs no more explanation.  But I’ve found it surprisingly difficult to teach this practice to math teachers teaching statistics.  I’m not sure, yet, why this is.  Part of the reason might be that in order to answer a statistical question such as “What is the most popular favorite color in this class” we must ask the non-statistical question “What is your favorite color.”  But there’s more to it than that.  A good statistical question isn’t as simple as the one I mentioned, and leads to discovery beyond the mere satisfaction of curiosity.  I’m reminded of the Census at Schools program that encouraged students to become Data Detectives.

In short, its time to think seriously about teaching students why they should want to do data analysis.  And if we’re successful, they’ll want to learn how to do inference.

So what role does inference play in your Ideal Statistics Curriculum?

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.