Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas. There were many thought-provoking talks, but what was even better were the thought-provoking conversations. One theme, at least for me, is just what is this thing called Data Science? One esteemed colleague suggested it was simply a re-branding. Other speakers used it somewhat perjoratively, in reference to outsiders (i.e. computer scientists). Here are some answers from panelists at a discussion on the future of technology in statistics education. All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.
Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data. I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life. But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.
Nick Horton offered that data mining was an activity that could be considered part of data science. And he sees data mining as part of statistics. Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered. But he also offered a broader definition: using data to answer a statistical question. Which I quite like. It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.
Bill Finzer relied on set theory: data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data. I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course. It doesn’t contradict Nick’s definition, but is a little more precise. As always, Bill has a knack for phrasing things just right without any practice.
Deb Nolan answered last, and I think I liked her answer the best. Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle. (I like to use this simplified version of the cycle: ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)
One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District. (With a new and improved webpage appearing soon! I promise!) Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.
During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion. (Is it possible to have a slow epiphany?)
Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data. But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking. To some extent, statistical thinking is considered to be independent of computation. We’d like to think that we’d reach the same conclusions regardless of which software we were using. While that’s true, I think it’s also true that our approach to solving the problem may be software dependent. We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.
And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.
What does this have to do with Daniel’s talk? Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class. In this talk, he offered a model for the expert statistician problem solving process. Another version of the data analysis cycle, if you will. His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.
I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science. Or, as my colleague Mark Hansen once told me, “Teaching R is teaching statistics.” Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics. But it’s also possible to teach it as a complement to developing statistical understanding.
I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts. But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground. But when our thinking unites these views, we begin to think like data scientists. And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.
I’m not done with this issue. The term still bothers me. Just what is the science in data science? I feel a refresher course in Popper and Kuhn is in order. Are we really thinking scientifically about data? Comments and thoughts welcome.