Data Visualization Course for First-Year Students

A little over a year ago, we decided to propose a data visualization course at the first-year level. We had been thinking about this for awhile, but never had the time to teach it given the scheduling constraints we had. When one of the other departments on campus was shut down and the faculty merged in with other departments, we felt that the time was ripe to make this proposal.

Course description of the EPsy 1261 data visualization course

In putting together the proposal, we knew that:

  • The course would be primarily composed of social science students. My department, Educational Psychology, attracts students from the College of Education and Human Development (e.g., Child Psychology, Social Work, Family Social Science).
  • To attract students, it would be helpful if the course would fulfill the University’s Liberal Education (LE) requirement for Mathematical Thinking.

This led to several challenges and long discussions about the curriculum for this course. For example:

  • Should the class focus on producing data visualizations (very exciting for the students) or on understanding/interpreting existing visualizations (useful for most social science students)?
  • If we were going to produce data visualizations, which software tool would we use? Could this level of student handle R?
  • In order to meet the LE requirement, the curriculum for the course would need to show a rigorous treatment of students actually “doing” mathematics. How could we do this?
  • Which types of visualizations would we include in the course?
  • Would we use a textbook? How might this inform the content of the course?

Software and Content

After several conversations among the teaching team, with stakeholder departments, and with colleagues teaching data visualization courses at other universities, we eventually proposed that the course:

  • Focus both on students’ being able to read and understand existing visualizations and produce a subset of these visualizations, and
  • Use R (primary tool) and RAWGraphs for the production of these plots.

Software: Use ggplot2 in R

The choice to use R was not an immediate one. We initially looked at using Tableau, but the default choices made by the software (e.g., to immediately plot summaries rather than raw data) and the cost for students after matriculating from the course eventually sealed its fate (we don’t use it). We contemplated using Excel for a minute (gasp!), but we vetoed that even quicker than Tableau. The RAWGraphs website, we felt, held a lot of promise as a software tool for the course. It had an intuitive drag-and-drop interface, and could be used to create many of the plots we wanted students to produce. Unfortunately, we were not able to get the bar graph widget to produce side-by-side bar plots easily (actually at all). The other drawback was that the drag-and-drop interactions made it a harder sell to the LE committee as a method of building students’ computational and mathematical thinking if we used it as the primary tool.

Once we settled on using R, we had to decide between using the suite of base plots, or ggplot2 (lattice was not in the running). We decided that ggplot made the most sense in terms of thinking about extensibility. Its syntax was based on a theoretical foundation for creating and thinking about plots, which also made it a natural choice for a data visualization course. The idea of mapping variables to aesthetics was also consistent with the language used in RAWGraphs, so it helped reenforce core ideas across the tools. Lastly, we felt that using the ggplot syntax would also help students transition to other tools (such as ggviz or plotly) more easily.

One thing that the teaching team completely agreed on (and was mentioned by almost everyone who we talked to who taught data visualization) was that we wanted students to be producing graphs very early in the course; giving them a sense of power and the reenforcement that they could be successful. We felt this might be difficult for students with the ggplot syntax. To ameliorate this, we wrote a course-specific R package (epsy1261; available on github) that allows students to create a few simple plots interactively by employing functionality from the manipulate package. (We could have also done this via Shiny, but I am not as well-versed in Shiny and only had a few hours to devote to this over the summer given other responsibilities.)

Interactive creation of the bar chart using the epsy1261 package. This allows students to input  minimal syntax, barchart(data), and then use interaction to create plots.

Course Content

We decided on a three-pronged approach to the course content. The first prong would be based on the production of common statistical plots: bar charts, scatterplots, and maps, and some variations of these (e.g., donut plots, treemaps, bubble charts). The second prong was focused on reading more complex plots (e.g., networks, alluvial plots), but not producing them, except maybe by hand. The third prong was a group project. This would give students a chance to use what they had learned, and also, perhaps, access other plots we had not covered. In addition, we wanted students to consider narrative in the presentation of these plots—to tell a data-driven story.

Along with this, we had hoped to introduce students to computational skills such as data summarization, tidying, and joining data sets. We also wanted to introduce concepts such as smoothing (especially for helping describe trend in scatterplots), color choice, and projection and coordinate systems (in maps). Other things we thought about were using R Markdown and data scraping.

Reality

The reality, as we are finding now that we are over a third of the way through the course, is that this amount of content was over-ambitious. We grossly under-estimated the amount of practice time these students would need, especially working with R. Two things play a role in this:

  1. The course attracted way more students than we expected for the first offering (our class size is 44) and there is a lot of heterogeneity of students’ experiences and academic background. For example, we have graduate students from the School of Design, some first years, and mostly sophomores and juniors. We also have a variety of majors including, design, the social sciences, and computer science.
  2. We hypothesize that students are not practicing much outside of class. This means they are essentially only using R twice a week for 75 minutes when they are in class. This amount of practice is too infrequent for students to really learn the syntax.

Most of the students’ computational experiences are minimal prior to taking this course. They are very competent at using point-and-click software (e.g., Google Docs), but have an abundance of trouble when forced to use syntax. The precision of case-sensitivity, commas, and parentheses is outside their wheelhouse.

I would go so far as to say that several of these students are intimidated by the computation, and completely panic on facing an error message. This has led to us having to really think through and spend time discussing computational workflows and dealing with how to “de-bug” syntax to find errors. All of this has added more time than we anticipated on the actual computing. (While this may add time, it is still educationally useful for these students.)

The teaching team meets weekly for 90 minutes to discuss and reflect on what happened in the course. We also plan what will happen in the upcoming week based on what we observed and what we see in students’ homework. As of now, we clearly see that students need more practice, and we have begun giving students the end result of a plot and asking them to re-create these.

I am still hoping to get to scatterplots and maps in the course. However, some of the other computational ideas (scraping, joining) may have to be relegated to conceptual ideas in a reading. We are also considering scrapping the project, at least for this semester. At the very least, we will change it to a more structured set of plots they need to produce rather than letting them choose the data sets, etc. Live and learn. Next time we offer the course it will be better.

*Technology note: RAWGraphs can be adapted by designing additional chart types, so in theory, if one had time, we could write our own version to be more compatible with the course. We are also considering using the ggplotgui package, which is a Shiny dashboard for creating ggplot plots.

 

 

Mapping Irma, but not really…

We’re discussing data visualization nowadays in my course, and today’s topic was supposed to be mapping. However late last night I realized I was going to run out of time and decided to table hands on mapping exercises till a bit later in the course (after we do some data manipulation as well, which I think will work better).

That being said, talking about maps seemed timely, especially with Hurricane Irma developing. Here is how we went about it:

In addition to what’s on the slide I told the students that they can assume the map is given, and they should only think about how the forecast lines would be drawn.

Everyone came up with “we need latitude and longitude and time”. However some teams suggested each column would represent one of the trajectories (wide data), while others came up with the idea of having an indicator column for the trajectory (long data). We sketched out on the board what these two data frames would look like, and evaluated which would be easier to directly plot using tools we’ve learned so far (plotting in R with ggplot2).

While this was a somewhat superficial activity compared to a hands on mapping exercise, I thought it worked well for a variety of reasons:

  1. It was a timely example that grabbed students’ attention.
  2. It generated lively discussion around various ways of organizing data into data frames (which hopefully will serve as a good primer for the data manipulation unit where we’ll discuss how data don’t always come in the format you need and you might need to get it in shape first before you can visualize/analyze it).
  3. Working backwards from a visualization to source data (as opposed to from data to visualization) provided a different challenge/perspective, and a welcome break from “how do I get R to plot this?”.
  4. We got to talk about the fact that predictions based on the same source data can vary depending on the forecasting model (foreshadowing of concepts we will discuss in the modeling unit coming up later in the course).
  5. It was quick to prepare! And quick to work through in class (~5 mins of team discussion + ~10 mins of class discussion).

I also suggested to students that they read the underlying NYTimes article as well as this Upshot article if they’re interested in finding out more about modeling the path of a hurricane (or modeling anything, really) and uncertainty.

Data Science Webinar Announcement

I’m pleased to announce that on Monday, September 11 , 9-11 am Pacific, I’ll be leading a Concord Consortium Data Science Education Webinar. Oddly, I forgot to give it a title, but it would be something like “Towards a Learning Trajectory for K-12 Data Science”. This webinar, like all Concord webinars, is intended to be highly interactive. Participants should have their favorite statistical software at the ready. A detailed abstract as well as registration information is here
https://www.eventbrite.com/e/data-science-education-webinar-rob-gould-tickets-35216886656

At the same site you can view recent wonderful webinars by Cliff Konold, Hollylynne Lee and Tim Erickson.

Envisioning Data Science Webinar Series and Call for Input

Webinar Series: Data Science Undergraduate Education

Join the National Academies of Sciences, Engineering, and Medicine for a webinar series on undergraduate data science education. Webinars will take place on Tuesdays from 3-4pm ET starting onSeptember 12 and ending on November 14. See below for the list of dates and themes for each webinar.

This webinar series is part of an input-gathering initiative for a National Academies study on Envisioning the Data Science Discipline: The Undergraduate Perspective. Learn more about the study, read the interim report, and share your thoughts with the committee on the study webpage at nas.edu/EnvisioningDS.

Webinar speakers will be posted as they are confirmed on the webinar series website.

Webinar Dates and Topics

  • 9/12/17 – Building Data Acumen
  • 9/19/17 – Incorporating Real-World Applications
  • 9/26/17 – Faculty Training and Curriculum Development
  • 10/3/17 – Communication Skills and Teamwork
  • 10/10/17 – Inter-Departmental Collaboration and Institutional Organization
  • 10/17/17 – Ethics
  • 10/24/17 – Assessment and Evaluation for Data Science Programs
  • 11/7/17 – Diversity, Inclusion, and Increasing Participation
  • 11/14/17 – Two-Year Colleges and Institutional Partnerships

All webinars take place from 3-4pm ET.  If you plan to join us online, please register to attend.  You will have the option to register for the entire webinar series or for individual webinars.

Share Your Input

The study committee is seeking public input for consideration in their upcoming report which will set forth a vision for the emerging discipline of data science at the undergraduate level.  To share your input with the committee, please fill out this form.

Revisiting that first day of class example

About a year ago I wrote this post: 

I wasn’t teaching that semester, so couldn’t take my own advice then, but thankfully (or the opposite of thankfully) Trump’s tweets still make timely discussion.

I had two goals for presenting this example on the first day of my data science course (to an audience of all first-year undergraduates, with little to no background in computing and statistics):

  1. Give a data analysis example with a familiar context
  2. Show that if they take the time to read the code, they can probably understand what it’s doing, at least at a high level

First, I provided them some context: “The author wanted to analyze Trump’s tweets: both the text, and some other information on the tweets like when and from what device they were posted.” And I asked the students “If you wanted to do this analysis, how would you go about collecting the data?”. Some suggested manual data collection, which we all agreed is too tedious. A few suggested there should be a way to get the data from Twitter. So then we went back to the blog post, and worked our way through some of the code. (My narrative is roughly outlined in handwriting below.)

The moral of the story: You don’t need to figure out how to write a program that gets tweets from Twitter. Someone else has already done it, and packaged it up (in a package called twitteR), and made it available for you to use. Here, the important message I tried to convey was that “No, I don’t expect you to know that this package exists, or to figure out how to use it. But I hope you agree that once you know the package exists, it’s worth the effort to figure out how to use its functionality to get the tweets, instead of collecting the data manually.”

Then, we discussed the following plot in detail:

First, I asked the students to come up with a list of variables we need in our dataset so that we can make this plot: we need to know what time each tweet was posted and what device it came from and we need to know how what percentage of tweets were posted in a given hour.

Here is the breakdown of the code (again, my narrative is in the handwritten comments):

Once again, I wanted to show the students that if they take some time, they can probably figure out roughly what each line (ok, maybe not each, but most lines) of code are doing. We didn’t get into discussing what’s a geom, what’s the difference between %>% and +, what’s an aesthetic, etc. We’ll get into those, but the night semester is young…

My hope is that next time I present how to do something new in R, they’ll remember this experience of being able to mostly figure out what’s happening by taking some time staring at the code and thinking about “if I had to do this by hand, how would I go about it?”.

Modernizing the Undergraduate Statistics Curriculum at #JSM2017

I’m a bit late in posting this, but travel delays post-JSM left me weary, so I’m just getting around to it. Better late than never?

Wednesday at JSM featured an invited statistics education session on Modernizing the Undergraduate Statistics Curriculum. This session featured two types of speakers: those who are currently involved in undergraduate education and those who are on the receiving end of graduating majors. The speakers involved in undergraduate education presented on their recent efforts for modernizing the undergraduate statistics curriculum to provide the essential computational and problem solving skills expected from today’s modern statistician while also providing a firm grounding in theory and methods. The speakers representing industry discussed their expectations (or hopes and dreams) for new graduates and where they find gaps in the knowledge of new hires.

The speakers were  Nick Horton (Amherst College), Hilary Parker (Stitch Fix), Jo Hardin (Pomona College), and Colin Rundel (Duke University). The discussant was Rob Gould (UCLA). Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments.

Modernizing the undergraduate statistics curriculum: what are the theoretical underpinnings? – Nick Horton

Hopes and dreams for statistics graduates – Hilary Parker

Expectations and Skills for Undergraduate Students Doing Research in Statistics and Data Science – Jo Hardin

Moving Away from Ad Hoc Statistical Computing Education – Colin Rundel

Discussion – Rob Gould

Novel Approaches to First Statistics / Data Science Course at #JSM2017

Tuesday morning, bright an early at 8:30am, was our session titled “Novel Approaches to First Statistics / Data Science Course”. For some students the first course in statistics may be the only quantitative reasoning course they take in college. For others, it is the first of many in a statistics major curriculum. The content of this course depends on which audience the course is aimed at as well as its place in the curriculum. However a data-centric approach with an emphasis on computation and algorithmic thinking is essential for all modern first statistics courses. The speakers in our session presented their approaches for the various first courses in statistics and data science that they have developed and taught. The discussion also highlighted pedagogical and curricular choices they have made in deciding what to keep, what to eliminate, and what to modify from the traditional introductory statistics curriculum. The speakers in the session were Ben Baumer from Smith College, Rebecca Nugent from CMU, myself, and Daniel Kaplan from Macalester College. Our esteemed discussant was Dick DeVeaux, and our chair, the person who managed to keep this rambunctious bunch on time, was Andrew Bray from Reed College. Here are the slides for each of the speakers. If you have any comments or questions, let us know in the comments, or find us on social media!

Ben Baumer – Three Methods Approach to Statistical InferenceRebecca Nugent – Lessons Learned in Transitioning from “Intro to Statistics” to “Reasoning with Data”

Mine Cetinkaya-Rundel – A First-Year Undergraduate Data Science Course

Daniel Kaplan – Teaching Stats for Data Science

Dick DeVeaux – Discussion

 

My JSM 2017 itinerary

JSM 2017 is almost here. I just landed in Maryland, and I finally managed to finish combing through the entire program. What a packed schedule! I like writing an itinerary post each year, mainly so I can come back to it during and after the event. I obviously won’t make it to all sessions listed for each time slot below, but my decision for which one(s) to attend during any time period will likely depend on proximity to previous session, and potentially also proximity to childcare area.

The focus of the sessions I selected are education, data science, computing, visualization, and social responsibility. In addition to talks on topics I actively work in, I also enjoy listening to talks in application areas I’m interested in, hence the last topic on this list.

If you have suggestions for other sessions (in these topics or other) that you think would be interested, let me know in the comments!

Sun, 7/30/2017

Sunday will be mostly meetings for me, and I’m skipping any evening stuff to see Andrew Bird & Belle and Sebastian!

Mon, 7/31/2017

  • DataFest meeting: 10am – 12pm at H-Key Ballroom 9. Stop by if you’re already an ASA DataFest organizer, or if you’d like to be one in the future!
    • First hour will be discussing what worked and what didn’t, any concerns, kudos, advice for new sites, etc.
    • Second hour will be drop-in for addressing any questions regarding organizing an ASA DataFest at your institution.
  • Computing and Graphics mixer: 6 – 8pm at H-Key Ballroom 1.
  • Caucus for Women in Statistics Reception and Business Meeting: 6:30 – 8:30pm at H-Holiday Ballroom 1&2.

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00 PM – 3:50 PM

4:00 PM – 5:50 PM

ASA President’s Invited Speaker: It’s Not What You Said. It’s What They Heard – Jo Craven McGinty, The Wall Street Journal

Tue, 8/1/2017

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

Deming Lecture: A Rake’s Progress Revisited – Fritz Scheuren, NORC-University of Chicago

Wed, 8/2/2017

  • Statistical Education Business Meeting – 6-7:30pm

8:30 AM – 10:20 AM

10:30 AM – 12:20 PM

2:00PM – 3:50 PM

4:00 PM – 5:50 PM

COPSS Awards and Fisher Lecture: The Importance of Statistics: Lessons from the Brain Sciences – Robert E. Kass, Carnegie Mellon University

Thur, 8/3/2017

8:30 AM – 10:20 AM

 10:30 AM – 12:20 PM

Structuring Data in Middle School

Of the many provocative and exciting discussions at this year’s Statistics Research Teaching and Learning conference in Rotarua, NZ, one that has stuck in my mind is from Lucia Zapata-Cardona, from the Universidad de Antioquia in Columbia. Lucia discussed data from her classroom observations of a teacher at a middle school (ages 12-13) in a “Northwest Columbian city”. The class was exciting for many reasons, but the reason that I want to write about it here is because of the fact that the teacher had the students structure and store their own data.

The classroom was remarkable – to my American eyes – for the large number of students (45) and for the noise (walls were thin, the playground was immediately outside, and windows were kept open because of the heat.) Despite this, the teacher led an inquiry-based discussion, skillfully prompting the students with questions from the back of the classroom. The discussion lasted over several days.

The students had collected data about the nutritional content of the foods they eat. Challenging students with real-world, meaningful problems is an important part of Prof. Zapata-Cardona’s research, since an important goal of education is to tie the world of the classroom to the real world. Lucia was interested in examining how (and whether) the students constructed and employed statistical models to reason with the data. (Modeling was the theme of this SRTL.) What fascinated me wasn’t the modeling, but the role that the structure of the data played in the students’ reasoning.

Students were asked to collect data on the food contained in their lunchboxes so that they could answer the statistical question “How nutritious is the food we bring to school in our lunchbox?” It’s important to note that in Columbia, as Lucia explained to us, the “lunch box” doesn’t contain actual lunch (which the students eat at home), but instead includes snacks for during the day. What interested me was that the teacher let the class, after discussion, decide how they would enter and organize the data. Now I’m not sure what parameters/options the students were given. I do know that the classroom had one computer, and students took turns entering the data into this computer. And I know that the students discussed which variables they wanted to store, and how they wanted to store them.

The pivotal decision here was that the students decided that each row would represent a food, for example, Chicle. They decided to record information about serving size, calories, fats, carbs, protein, sodium, sugars, whether it was “processed” (5 g, 18, 0, 5, 0, 0, 0 and si, in case you were curious). They decided not to store information about how many students brought this food, or how many servings any individual student brought.

At this point, you may have realized that their statistical question is rather difficult, if not impossible, to answer given the format in which they stored the data. Had each case of the data been an individual lunchbox or an individual person, then the students might have made headway. Instead, they stumbled over issues about how to compare the total calories of the dataset with the total calories eaten by individuals. (After much discussion, most of the class “discovered” that the average amount was a good way of summarizing the data, but some of the more perceptive students pointed out that it wasn’t clear what the average really meant.)

Lucia’s forthcoming paper will go into the details about the good and the bad in the students’ statistical reasoning, and the ways in which they used (or failed to use) statistical models. But what was fascinating to me was the opportunity this provided for helping students understand how the structure of data affects the questions that we can ask, and how the questions we ask should first consider the structure of the data.

Too often, particularly in textbooks, there is no opportunity to reason about the structure of data. When a question is asked, the students are given appropriate data, and rarely allowed even to decide which variables to consider (since the provided data usually includes only the necessary variables), much less whether or not the data should be restructured or re-collected.

Another reason classrooms have avoided letting students structure their own data is that many real-life datasets have complicated structures. The data these students collected is really (or should have been) hierarchical. If the case is the lunchbox, a lunchbox is associated with a student and possibly with more than 1 item. If data are collected on multiple days, then there is nesting within days as well as the potential for missing variables or unequal record lengths.

Data with such a complicated structure are simply not taught in middle schools, even though, as Lucia’s case study demonstrates, they arise easily from familiar contexts.   These data are messy and complicated. Should we even open this pandora’s box for middle school students, or should it wait until they are older? Is it enough to work with the simplified “flat” format such as the one these students came up with, and just modify the statistical question? Should students be taught how to manipulate such data into different formats to answer the questions they are interested in?

You might think hierarchical formats are beyond the middle school level, but work done by Cliff Konold and Bill Finzer, in the context of using the CODAP tool, suggests that it is possible. [I can’t find an online paper to link to for this result, but there are some leads here, and I’m told it has been approved for publication so should appear soon.]

So the question is: when do we teach students to reason with hierarchical data? When do we teach students to recognize that data can be stored in different formats? When do we teach students to convert data from one format to another?

We are back to the question I asked in my last blog: what’s the learning trajectory that takes statistical beginners and teaches them the computational and statistical tools to allow them to address fundamental questions that rely on data that, on the one hand, are complex but on the other hand are found in our day-to-day lives?

Are computers needed to teach Data Science?

One of the many nice things about summer is the time and space it allows for blogging. And, after a very stimulating SRTL conference (Statistics Reasoning, Teaching and Learning) in Rotorua, New Zealand, there’s lots to blog about.

Let’s begin with a provocative posting by fellow SRTL-er Tim Erickson at his excellent blog A Best Case Scenario.  I’ve known Tim for quite awhile, and have enjoyed many interesting and challenging discussions. Tim is a creator of curricula par excellence, and has first-hand experience in what inspires and motivates students to think deeply about statistics.

The central question here is: Is computation (on a computer) necessary for learning data science? The learners here are beginners in K-12. Tim answers no, and I answer, tentatively, yes. Tim portrays me in his blog as being a bit more steadfast on this position than I really am. In truth the answer is, some; maybe; a little; I don’t know.

My own experience in the topic comes from the Mobilize project  , in which we developed the course Introduction to Data Science for students in the Los Angeles Unified School District. (I’m pleased to say that the course is expanding. This summer, five new L.A.-area school districts will begin training teachers to teach this course. )

The course relies heavily on R via Rstudio. Students begin by studying the structure of data, learning to identify cases and variables and to organize unstructured data into a “tidy” format. Next, they learn to “read” tidy datafiles into Rstudio. The course ends with students learning some predictive modeling using Classification and Regression Trees. In between, they study some inference using randomization-based methods.

To be precise, the students don’t learn straight-up R. They work within a package developed by the Mobilize team (primarily James Molyneux, Amelia McNamara, Steve Nolen, Jeroen Ooms, and Hongsuda Tangmunarunkit) called mobilizR, which is based pretty heavily on the mosaic package developed by Randall Pruim, Danny Kaplan and Nick Horton.  The idea with these packages is to provide beginners to R with a unified syntax and a set of verbs that relate more directly to the analysts’ goals. The basic structure for (almost) all commands is

WhatIWantToDo(yvariable~xvariables, dataset)

For example, to see the average walking distance recorded by a fitbit by day of the week:

 > mean(Distance~DOW,data=fitbitdec)
 Friday Monday Saturday Sunday Thursday Tuesday Wednesday 1.900000 3.690000 2.020909 2.419091 1.432727 3.378182 3.644545

The idea is to provide students with a simplified syntax that “bridges the gap” between beginners of R and more advanced users. Hopefully, this frees up some of the cognitive load required to remember and employ R commands so that students can think strategically and statistically about problems they are trying to solve.

The “bridge the gap” terminology comes from Amelia McNamara, who used the term in her PhD dissertation. One of the many really useful ideas Amelia has given us is the notion that the gap needs to be bridged. Much of “traditional” statistics education holds to the idea that statistical concepts are primarily mathematical, and, for most people, it is sufficient to learn enough of the mathematical concepts so that they can react skeptically and critically to others’ analyses. What is exciting about data science in education is that students can do their own analyses. And if students are analyzing data and discovering on their own (instead of just trying to understand others’ findings), then we need to teach them to use software in such a way that they can transition to more professional practices.

And now, dear readers, we get to the heart of the matter. That gap is really hard to bridge. One reason is that we know little to nothing about the terrain. How do students learn coding when applied to data analysis? How does the technology they use mediate that experience? How can it enhance, rather than inhibit, understanding of statistical concepts and the ability to do data analysis intelligently?

In other words, what’s the learning trajectory?

Tim rightly points to CODAP, the Common Online Data Analysis Platform,  as one tool that might help bridge the gap by providing students with some powerful data manipulation techniques. And I recently learned about data.world, which seems another attempt to help bridge the gap.  But Amelia’s point is that it is not enough to give students the ability to do something; you have to give it to them so that they are prepared to learn the next step. And if the end-point of a statistics education involves coding, then those intermediate steps need to be developing students’ coding skills, as well as their statistical thinking. It’s not sufficient to help studemts learn statistics. They must simultaneously learn computation.

So how do we get there? One important initial step, I believe, is to really examine what the term “computational thinking” means when we apply it to data analysis. And that will be the subject of an upcoming summer blog.