Pie Charts. Are they worth the Fight?

Like Rob, I recently got back from ICOTS. What a great conference. Kudos to everyone who worked hard to organize and pull it off. In one of the sessions I was at, Amelia McNamara (@AmeliaMN) gave a nice presentation about how they were using data and computer science in high schools as a part of the Mobilize Project. At one point in the presentation she had a slide that showed a screenshot of the dashboard used in one of their apps. It looked something like this.

screenshot-app

During the Q&A, one of the critiques of the project was that they had displayed the data as a donut plot. “Pie charts (or any kin thereof) = bad” was the message. I don’t really want to fight about whether they are good, nor bad—the reality is probably in between. (Tufte, the most cited source to the ‘pie charts are bad’ rhetoric, never really said pie charts were bad, only that given the space they took up they were, perhaps less informative than other graphical choices.) Do people have trouble reading radians? Sure. Is the message in the data obscured because of this? Most of the time, no.

plots_1Here, is the bar chart (often the better alternative to the pie chart that is offered) and the donut plot for the data shown in the Mobilize dashboard screenshot? The message is that most of the advertisements were from posters and billboards. If people are interested in the n‘s, that can be easily remedied by including them explicitly on the plot—which neither the bar plot nor donut plot has currently. (The dashboard displays the actual numbers when you hover over the donut slice.)

It seems we are wasting our breath constantly criticizing people for choosing pie charts. Whether we like it or not, the public has adopted pie charts. (As is pointed out in this blog post, Leland Wilkinson even devotes a whole chapter to pie charts in his Grammar of Graphics book.) Maybe people are reasonably good at pulling out the often-not-so-subtle differences that are generally shown in a pie chart. After all, it isn’t hard to understand (even when using a 3-D exploding pie chart) that the message in this pie chart is that the “big 3″ browsers have a strong hold on the market.

The bigger issue to me is that these types of graphs are only reasonable choices when examining simple group differences—the marginals. Isn’t life, and data, more complex than that?Is the distribution of browser type the same for Mac and PC users? For males and females? For different age groups? These are the more interesting questions.

The dashboard addresses this through interactivity between the multiple donut charts. Clicking a slice in the first plot, shows the distribution of product types (the second plot) for those ads that fit the selected slice—the conditional distributions.

So it is my argument, that rather than referring to a graph choice as good or bad, we instead focus on the underlying question prompting the graph in the first place. Mobilize acknowledges that complexity by addressing the need for conditional distributions. Interactivity and computing make the choice of pie charts a reasonable choice to display this.

*If those didn’t persuade you, perhaps you will be swayed by the food argument. Donuts and pies are two of my favorite food groups. Although bars are nice too. For a more tasty version of the donut plot, perhaps somebody should come up with a cronut plot.

**The ggplot2 syntax for the bar and donut plot are provided below. The syntax for the donut plot were adapted from this blog post.

# Input the ad data
ad = data.frame(
	type = c("Poster", "Billboard", "Bus", "Digital"),
	n = c(529, 356, 59, 81)
	)

# Bar plot
library(ggplot2)
ggplot(data = ad, aes(x = type, y = n, fill = type)) +
     geom_bar(stat = "identity", show_guide = FALSE) +
     theme_bw()

# Add addition columns to data, needed for donut plot.
ad$fraction = ad$n / sum(ad$n)
ad$ymax = cumsum(ad$fraction)
ad$ymin = c(0, head(ad$ymax, n = -1))

# Donut plot
ggplot(data = ad, aes(fill = type, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
     geom_rect(colour = "grey30", show_guide = FALSE) +
     coord_polar(theta = "y") +
     xlim(c(0, 4)) +
     theme_bw() +
     theme(panel.grid=element_blank()) +
     theme(axis.text=element_blank()) +
     theme(axis.ticks=element_blank()) +
     geom_text(aes(x = 3.5, y = ((ymin+ymax)/2), label = type)) +
     xlab("") +
     ylab("")

 

 

Is Data Science Real?

Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas.  There were many thought-provoking talks, but what was even better were the thought-provoking conversations.  One theme, at least for me, is just what is this thing called Data Science?  One esteemed colleague suggested it was simply a re-branding.  Other speakers used it somewhat perjoratively, in reference  to outsiders (i.e. computer scientists).   Here are some answers from panelists at a discussion on the future of technology in statistics education.  All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.

Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data.  I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life.  But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.

Nick Horton offered that data mining was an activity that could be considered part of data science.  And he sees data mining as part of statistics.  Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered.  But he also offered a broader definition:  using data to answer a statistical question.   Which I quite like.  It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.

Bill Finzer relied on set theory:  data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data.  I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course.  It doesn’t contradict Nick’s definition, but is a little more precise.  As always, Bill has a knack for phrasing things just right without any practice.

Deb Nolan answered last, and I think I liked her answer the best.  Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle.  (I like to use this simplified version of the cycle:  ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)

One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District.  (With a new and improved webpage appearing soon! I promise!)  Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.

During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion.   (Is it possible to have a slow epiphany?)

Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data.  But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking.  To some extent, statistical thinking is considered to be independent of computation.  We’d like to think that we’d reach the same conclusions regardless of which software we were using.  While that’s true, I think it’s also true that our approach to solving the problem may be software dependent.  We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.

And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.

What does this have to do with Daniel’s talk?   Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class.  In this talk, he offered a model for the expert statistician problem solving process.  Another version of the data analysis cycle, if you will.  His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.

I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science.  Or, as my colleague Mark Hansen once told me, “Teaching R  *is* teaching statistics.”  Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics.  But it’s also possible to teach it as a complement to developing statistical understanding.

I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts.  But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground.  But when our thinking unites these views, we begin to think like data scientists.  And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.

I’m not done with this issue.  The term still bothers me.  Just what is the science in data science?  I feel a refresher course in Popper and Kuhn is in order.  Are we really thinking scientifically about data?  Comments and thoughts welcome.

Fathom Returns

The other shoe has fallen.  Last week (or so) Tinkerplots returned to the market, and now Fathom Version 2.2 (which is the foundation on which Tinkerplots is built) is  available for a free download.  Details are available on Bill Finzer‘s website.

Fathom is one of my favorite softwares…the first commercially available package to be based on learning theory, Fathom’s primary goal is to teach statistics.  After a one-minute introduction, beginning students can quickly discuss ‘findings’ across several variables.  So many classroom exercises involve only one or two variables, and Fathom taught me  that this is unfair to students and artificially holds them back.

Welcome back, Fathom!

Tinkerplots Available Again

Very exciting news for Tinkerplots users (and for those who should be Tinkerplots users).  Tinkerplots is highly visual dynamic software that lets students design and implement simulation machines, and includes many very cool data analysis tools.

To quote from TP developer Cliff Konold:

Today we are releasing Version 2.2 of TinkerPlots.  This is a special, free version, which will expire in a year  — August 31, 2015.

To start the downloading process

Go to the TinkerPlots home page and click on the Download TinkerPlots link in the right hand panel. You’ll fill out a form. Shortly after submitting it, you’ll get an email with a link for downloading.

Help others find the TinkerPlots Download page

If you have a website, blog, or use a social media site, please help us get the word out so others can find the new TinkerPlots Download page. You could mention that you are using TinkerPlots 2.2 and link to www.srri.umass.edu/tinkerplots.

Why is this an expiring version?

As we explained in this correspondence, until January of 2014, TinkerPlots was published and sold by Key Curriculum, a division of McGraw Hill Education. Their decision to cease publication caught us off guard, and we have yet to come up with an alternative publishing plan. We created this special expiring version to meet the needs of users until we can get a new publishing plan in place.

What will happen after version 2.2 expires?

By August 2015, we will either have a new publisher lined up, or we will create another free version.  What is holding us up right now is our negotiations with the University of Massachusetts Amherst, who currently owns TinkerPlots.  Once they have decided about their future involvement with TinkerPlots, we can complete our discussions with various publishing partners.

If I have versions 2.0 or 2.1 should I delete them?

No, you should keep them. You already paid for these, and they are not substantively different from version 2.2. If and when a new version of TinkerPlots is ready for sale, you may not want to pay for it.  So keep your early version that you’ve already paid for.
Cliff and Craig

Lively R

Next week, the UseR conference comes to UCLA.  And in anticipation, I thought a little foreshadowing would be nice.  Amelia McNamara, UCLA Stats grad student and rising stats ed star, shared with me a new tool that has the potential to do some wonderful things.  LivelyR is a work-in-progress that is, in the words of its creators, a “mashup of R with packages of Rstudio.” The result is a highly interactive.  I was particularly struck by and intrigued by the ‘sweeping’ function, which visually smears graphics across several parameter values.  The demonstration shows how this can help understand the effects of bin-width and off-set changes on a histogram so that a more robust sense of the sample distribution shines through.

R is beginning to become a formidable educational tool, and I’m looking forward to learning more at UseR next week. For those of you in L.A. who can attend, Aron Lunzer will be talking about LivelyR at 4pm on Tuesday, July 1.

Data Privacy (L.A. Times)

The L.A. Times ran an article on data privacy today, which, I think it’s fair to say, puts “Big Data” in approximately the same category as fire. In the right hands, it can do good. But…

http://www.latimes.com/nation/politics/politicsnow/la-pn-white-house-big-data-privacy-report-20140501,0,5624003.story

Increasing the Numbers of Females in STEM

I just read a wonderful piece written about how the Harvey Mudd increased the ratio of females declaring a major in Computer Science from 10% to 40% since 2006. That is awesome!

One of the things that they attribute this success to is changing the name of their introductory course. They renamed the course from Introduction to programming in Java to Creative Approaches to Problem Solving in Science and Engineering using Python.

Now, clearly, they changed the language they were using (literally) as well,from Java to Python, but it does beg the question, “what’s in a name?” According to Jim Croce and Harvey Mudd, a lot. If you don’t believe that, just ask anyone who has been in a class with the moniker Data Science, or any publisher who has published a book recently entitled [Insert anything here] Using R.

It would be interesting to study the effect of changing a course name. Are there words or phrases that attract more students to the course (e.g., creative, problem solving)?  Are there gender differences? How long does the effect last? Is it a flash-in-the-pan? Or does it continue to attract students after a short time period? (My guess is that the teacher plays a large role in the continued attraction of students to the course.)

Looking at the effects of a name is not new. Stephen Dubner and Steve Levitt of Freakonomics fame have illuminated folks about research about whether a child’s name has an effect on a variety of outcomes such as educational achievement and future income [podcast], and suggest that it isn’t as predictive as some people believe. Perhaps someone could use some of their ideas and methods to examine the effect of course names.

Has anyone tried this with statistics (aside from Data Science)? I know Harvard put in place a course called Real Life Statistics: Your Chance for Happiness (or Misery) which got good numbers of students (and a lot of press). My sense is that this happens much more in liberal arts schools (David Moore’s Concepts and Controversies book springs to mind). What would good course words or phrases for statistics include? Evidence. Uncertainty. Data. Variation. Visualization. Understanding. Although these are words that statisticians use constantly, I have to admit they all sound better than An Introduction to Statistics.

 

An Open Letter to the TinkerPlots Community

I received the following from Cliff Konold:

We have just release the following to answer questions many have asked us about when TinkerPlots will be available for sale again. Unfortunately, we do not have a list of current users to send this to, so please distribute this to others you think would be interested.


March 21, 2014

As you may have discovered by now, you can no longer purchase TinkerPlots. Many of you who have been using TinkerPlots in courses and workshops have found your way to us asking if and when it will be available for purchase again. We expect soon, by this June.  But to allow you to make informed decisions about future instructional uses of TinkerPlots, we need to provide a little background.

On December 10, 2013, we received a letter from McGraw-Hill Education giving us notice that in 90 days they would be terminating their agreement with us to publish TinkerPlots. For those of you who remember Key Curriculum as our publisher, McGraw-Hill Education acquired Key in August 2012, and as part of that acquisition became the new publisher of The Geometer’s Sketchpad, Fathom, and TinkerPlots.

Though McGraw-Hill Education had informally told us of their plans to terminate sales of both TinkerPlots and Fathom as of December 31, 2013, we were nevertheless surprised when they actually did this. We were assuming this wouldn’t happen until mid March (i.e., 90 days). In any case, since January 1 of this year, no new licenses for TinkerPlots have been sold.

Fortunately, TinkerPlots is actually owned by our University, so we are now free to find another publisher. We are in ongoing discussions with four different organizations who have expressed interest in publishing TinkerPlots. But there are many components of TinkerPlots in addition to the application (data sets, activities, help manual, instructional movies, tutorials, on-line course materials, artwork, the license server/installer, the list of existing users), which McGraw-Hill Education does own that would be hard to do without; to replace them would require a significant undertaking. Fortunately, McGraw-Hill Education has indicated their willingness to transfer most all of these assets to us, and we are very grateful for this because they are not legally bound to do so.  However, we have not yet received any of these resources or written permission that we can use them. Until we do, we cannot realistically build and release another version of the application. We are in regular communication with people at McGraw-Hill Education who have assured us that they will begin very shortly to deliver to us these materials and official permissions for their use.

We have been telling folks that a new version of TinkerPlots will be available by June 2014, and we still think this a reasonable timeframe.  We’d give it about an 85% probability. By August, 98.2%.

In the meantime, if you have unused licenses for TinkerPlots, you will still be able to register new computers on that license number. To see how many licenses you have, go to License Information… under the Help menu. If you have one license, our memory is that you can actually register 3 computers on it — they built in a little leeway. From that same dialog box you can also deregister a computer and in this way free up a currently used license. (We just checked, and when the deregister dialog comes up, it now has the name of Sketchpad where TinkerPlots should be.  But ignore that. It’s just an indication of the publisher slowly phasing the name TinkerPlots out of its system.)

Also, the resource links under the TinkerPlots Help menu still take you to resources such as movies on the publisher’s site. They have told us, however, that after March 2015, they will discontinue hosting these materials on their web site. But by that time, all these should be available on the site of the new publisher.

We are so sorry for the inconvenience this interruption and the lack of communication has caused many of you. McGraw-Hill Education has not notified its existing users, and we don’t know who most of you are.  We have heard of several instances where teachers planning to start a course or workshop in a few days have suddenly learned that their students will not be able to purchase TinkerPlots, and they have had to quickly redesign their course. We understand that because of this ordeal, some of you will decide to jump ship on TinkerPlots. But we certainly hope that most of you will stick with us through this bumpy transition. We have put nearly 15 years of ourselves into the creation of TinkerPlots and the development of its community, and we are committed to keeping both going.

Cliff Konold and Craig Miller
The TinkerPlots Development Team
Scientific Reasoning Research Institute
University of Massachusetts Amherst
Amherst, Massachusetts

Email: konold@srri.umass.edu
Web:   www.umass.edu/srri/serg/