Thinking with technology

Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational  psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for  presenting findings, and this means there’s much lively, constructive discussion about works in progress.

I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One  recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2.  Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.

What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.

Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.)  This technology imposes a sort of discipline on students’ thinking. Using the  technology is easy enough  that they can be creative, but the technology is rigid enough that their mistakes are made apparent.  This means that mistakes are cheap, and attempts to repair mistakes are easily made.  And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.

I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker.  But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas.  But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise.  It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability.  More impressive still, the students were able to use these models to solve problems.  In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.

Tinkerplots is aimed at a younger audience than the one I teach.  But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.

Here’s Looking At You!

What do we fear more?  Losing data privacy to our government, or to corporate entities?  On the one hand, we (still) have oversight over our government.  On the other hand, the government is (still) more powerful than most corporate entities, and so perhaps better situated to frighten.

In these times of Snowden and the NSA, the L.A. Times ran an interesting story about just what tracking various internet companies perform.  And it’s alarming. (“They’re watching your every move.”, July 10, 2013). Interestingly, the story does not seem to appear on their website as of this posting.)  Like the government, most of these companies claim that (a) their ‘snooping’ is algorithmic; no human sees the data and (b) their data are anonymized.  And yet…

To my knowledge, businesses aren’t required to adhere to, or even acknowledge, any standards or practices for dealing with private data.  Thus, a human could snoop on particular data.  We are left to ponder what that human will do with the information.  In the best case scenario, the human would be fired, as, according to the L.A. Times, Google did when it fired an engineer for snooping on emails of some teenage girls.

But the data are anonymous, you say?  Well, there’s anonymous and then there’s anonymous.  As LaTanya Sweeney taught us in the 90’s, knowing a person’s zipcode, gender, and date of birth is sufficient to uniquely identify 85% of Americans.  And the L.A. Times reports a similar study where just four hours of anonymized tracking data was sufficient to identify 95% of all individuals examined.  So while your name might not be recorded, by merging enough data files, they will know it is you.

This article fits in really nicely with a fascinating, revelatory book I’m currently midway through:  Jaron Lanier‘s Who Owns The Future? A basic theme of this book is that internet technology devalues products and goods (files) and values  services (software).  One process through which this happens is that we humans accept the marvelous free stuff that the internet provides (free google searches, free amazon shipping, easily pirated music files) in exchange for allowing companies to snoop. The companies turn our aggregated data into dollars by selling to advertisers.

A side affect of this, Lanier explains, is that there is a loss of social freedom.  At some point, a service such as Facebook gets to be so large that failing to join means that you are losing out on possibly rich social interactions.  (Yes, I know there are those who walk among us who refuse to join Facebook.  But these people are probably not reading this blog, particularly since our tracking ‘bots tell us that most of our readers come from Facebook referrals.  Oops.  Was I allowed to reveal that?)  So perhaps you shouldn’t complain about being snooped on since you signed away your privacy rights. (You did read the entire user agreement, right?  Raise your hand if you did.  Thought so.)  On the other hand, if you don’t sign, you become a social pariah.  (Well, an exaggeration.  For now.)

Recently, I installed Ghostery, which tracks the automated snoopers that follow me during my browsing.  Not only “tracks”, but also blocks.  Go ahead and try it.  It’s surprising how many different sources are following your every on-line move.

I have mixed feelings about blocking this data flow. The data-snooping industry is big business, and is responsible, in part, for the boom of stats majors and, more importantly, the boom in stats employment.  And so indirectly, data-snooping is paying for my income.  Lanier has an interesting solution:  individuals should be paid for their data, particular when it leads to value.  This means the era of ‘free’ is over–we might end up paying for searches and for reading wikipedia.  But he makes a persuasive case that the benefits exceed the costs.  (Well, I’m only half-way through the book.  But so far, the case is persuasive.)

DataFest 2013

DataFest is growing larger and larger.  This year, we hosted an event at Duke (Mine organized this) with teams from NCSU and UNC, and at UCLA (Rob organized) with teams from Pomona College, Cal State Long Beach, University of Southern California, and UC Riverside.  We are very grateful to Vaclav Petricek at eHarmony for providing us with the data, which consisted of roughly one million “user-candidate” pairs, and a couple of hundred variables including “words friends would use to describe you”, ideal characteristics in a partner, the importance of those characteristics, and the all-important ‘did she email him’ and ‘did he email her’ variables.

The students had a great time, and worked hard for 48 hours to prepare short presentations for the judges.  This is the third year we’ve done this, and I’m growing impressed with the growing technical skills of the students.  (Which makes our life a lot easier, as far as providing help goes.)  Or maybe it’s just that I’ve been lucky enough to get more and more “VIP Consultants” (statisticians from off-campus) and talented and dedicated grad students to help out, so that I  can be comfortably oblivious to the technical struggles.  Or all of the above.

One thing I noticed that will definitely require some adjustment to our curriculum:  Our students had a hard time generating interesting questions from these data.  Part of the challenge is to look at a large, rich dataset and think “What can I show the world that the world would like to know?”  Too many students went directly to model-fitting, without making visuals or engaging in the content of the materials (a surprise, since we thought they would find this material much more easily-engageable than last year’s micro-lending transaction data), or strategizing around some Big Questions.  They managed to pull it off in the end, most of them, but would have done better to brainstorm some good questions to follow, and would have done much better to start with the visuals.

One of the fun parts of DataFest is the presentations.  Students have only 5 minutes and 2 slides to convince the judges of their worthiness.  At UCLA, because we were concerned about having too many teams for the judges to endure, we had two rounds.  First, a “speed dating” round in which participants had only 60 seconds and one slide.  We surprised them by announcing, at the start, that to move onto the next round, they would have to merge their team with one other team, and so these 60-second presentations should be viewed as pitches to potential partners.  We had hoped that teams would match on similar-themes or something, and this did happen; but many matches were between teams of friends.  The “super teams” were then allowed to make a 5-minute presentation, and awards were given to these large teams. The judges gave two awards for Best Insight (one to a super-team from Pomona College and another to a super-team from UCLA) and a Best Visualization (to the super-team from USC).  We did have two inter-collegiate super-teams (UCLA/Cal State Long Beach and UCLA/UCR) make it to the final round.

If you want to host your own DataFest, drop a line to Mine or me and we can give you lots of advice.  And if you sit on a large, interesting data set we can use for next year, definitely drop us a line!

Datasets handpicked by students

I’m often on the hunt for datasets that will not only work well with the material we’re covering in class, but will (hopefully) pique students’ interest. One sure choice is to use data collected from the students, as it is easy to engage them with data about themselves. However I think it is also important to open their eyes to the vast amount of data collected and made available to the public. It’s always a guessing game whether a particular dataset will actually be interesting to students, so learning from the datasets they choose to work with seems like a good idea.

Below are a few datasets that I haven’t seen in previous project assignments. I’ve included the research question the students chose to pursue, but most of these datasets have multiple variables, so you might come up with different questions.

1. Religious service attendance and moral beliefs about contraceptive use: The data are from a February 2012 Pew Research poll. To download the dataset, go to http://www.people-press.org/category/datasets/?download=20039620. You will be prompted to fill out some information and will receive a zipped folder including the questionnaire, methodology, the “topline” (distributions of some of the responses), as well as the raw data in SPSS format (.sav file). Below I’ve provided some code to load this dataset in R, and then to clean it up a bit. Most of the code should apply to any dataset released by Pew Research.

# read data
library(foreign)
d_raw = as.data.frame(read.spss("Feb12 political public.sav"))

# clean up
library(stringr)
d = lapply(d_raw, function(x) str_replace(x, " \\[OR\\]", ""))
d = lapply(d, function(x) str_replace(x, "\\[VOL. DO NOT READ\\] ", ""))
d = lapply(d, function(x) str_replace(x, "\222", "'"))
d = lapply(d, function(x) str_replace(x, " \\(VOL.\\)", ""))
d$partysum = factor(d$partysum)
levels(d$partysum) = c("Refused","Democrat","Independent","Republican","No preference","Other party")

The student who found this dataset was interested examining the relationship between religious service attendance and views on contraceptive use. The code provided below can be used to organize the levels of these variables in a meaningful way, and to take a quick peek at a contingency table.

# variables of interest
d$attend = factor(d$attend, levels = c("More than once a week","Once a week", "Once or twice a month", "A few times a year", "Seldom", "Never", "Don't know/Refused"))
d$q40a = factor(d$q40a, levels = c("Morally acceptable","Morally wrong", "Not a moral issue", "Depends on situation", "Don't know/Refused"))
table(d$attend, d$q40a)

2. Social network use and reading: Another student was interested in the relationship between number of books read in the last year and social network use. This dataset is provided by the Pew Internet and American Life Project. You can download a .csv version of the data file at http://www.pewinternet.org/Shared-Content/Data-Sets/2012/February-2012–Search-Social-Networking-Sites-and-Politics.aspx. The questionnaire can also be found at this website. One of the variables of interest, number of books read in the past 12 months (q2), is  recorded using the following scheme:

  • 0: none
  • 1-96: exact number
  • 97: 97 or more
  • 98: don’t know
  • 99: refused

This could be used to motivate a discussion about the importance doing exploratory data analysis prior to jumping into running inferential tests (like asking “Why are there no people who read more than 99 books?”) and also pointing out the importance of checking the codebook.

3. Parental involvement and disciplinary actions at schools: The 2007-2008 School Survey on Crime and Safety, conducted by the National Center for Education Statistics, contains school level data on crime and safety. The dataset can be downloaded at http://nces.ed.gov/surveys/ssocs/data_products.asp.  The SPSS formatted version of the data file (.sav) can be loaded in R using the read.spss() function in the foreign library (used above in the first data example). The variables of interest for the particular research question the student proposed are parent involvement in school programs (C0204) and number of disciplinary actions (DISTOT08), but the dataset can be used to explore other interesting characteristics of schools, like type of security guards, whether guards are armed with firearms, etc.

4. Dieting in school-aged children: The Health Behavior in School-Aged Children is an international survey on health-risk behaviors of children in grades 6 through 10. The 2005-2006 US dataset can be found at http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/28241. You will need to log in to download the dataset, but you can do so using a Google or a Facebook account. There are multiple versions of the dataset posted, and the Delimited version (.tsv) can be easily loaded in R using the read.delim() function. The student who found this dataset was interested in exploring the relationship between race of the student (Q6_COMP) and whether or not the student is on a diet to lose weight (Q30). The survey also asks questions on body image, substance use, bullying, etc. that may be interesting to explore.

One common feature of the above datasets is that they are all observational/survey based as it’s more challenging to find experimental (raw) datasets online. Any suggestions?

participatory sensing

The Mobilize project, which I recently joined, centers a high school data-science curriculum around participatory sensing data.  What is participatory sensing, you ask?

I’ve recently been trying to answer this question, with mixed success.  As the name suggests, PS data has to do with data collected from sensors, and so it has a streaming aspect to it.  I like to think of it as observations on a living object.  Like all living objects, whatever this thing is that’s being observed, it changes, sometimes slowly, sometimes rapidly. The ‘participatory’ means that it takes more than one person to measure it. (But I’m wondering if you would allow ‘participatory’ to mean that the student participates in her own measurements/collection?) Initially, in Mobilize,  PS meant  specially equipped smart-phones to serve as sensors.  Students could snap pictures of snack-wrappers, or record their mood at a given moment, or record the mood of their snack food.  A problem with relying on phones is that, as it turns out, teenagers aren’t always that good with expensive equipment.  And there’s an equity issue, because what some people consider a common household item, others consider rare and precious.  And smart-phones, although growing in prevalence, are still not universally adopted by high school students, or even college students.

If we ditch the gadgetry, any human being can serve as a sensor.  Asking a student to pause at a certain time of day to record, say, the noise level, or the temperature, or their frame of mind, or their level of hunger, is asking that student to be  a sensor.  If we can teach the student how to find something in the accumulated data about her life that she didn’t know, and something that she finds useful, then she’s more likely to develop what I heard Bill Finzer call a “data habit of mind”.  She’ll turn to data next time she has a question or problem, too.

Nothing in this process is trivial.  Recording data on paper is one thing: but recording it in a data file requires teaching students about flat-files (which, again something I’ve learned from Bill, is not necessarily intuitive), and teaching students about delimiters between variables, and teaching them, basically, how to share so that someone else can upload and use their data.  Many of my intro-stats college students don’t know how to upload a data file into the computer, so that I now teach it explicitly, with high, but not perfect, rates of success.  And that’s the easy part.  How do we help them learn something of value about themselves or their world?

I’m open to suggestions here. Please.  One step seems to be to point them towards a larger context in which to make sense of their data.  This larger context could be a social network, or a community, or larger datasets collected on large populations.  And so students might need to learn how to compare their (rather paltry by comparison) data stream to a large national database (which will be more of a snapshot/panel approach, rather than a data-stream).  Or they will need to learn to merge their data with their classmates, and learn about looking for signals among variation, and comparing groups.

This is scary stuff.  Traditionally, we teach students how to make sense of *our* data.  And this is less scary because we’ve already made sense of the data and we know how to point the student towards making the “right”  conclusions.  But these PS data have not before been analyzed.  Even if we the teacher may have seen similar data, we have not seen these data.  The student is really and truly functioning as a researcher, and the teacher doesn’t know the conclusion.  What’s more disorienting, the teacher doesn’t have control of the method.  Traditional, when we talk about ‘shape’ of a distribution, we trot out data sets that show the shapes we want the students to see.  But if the students are gathering their own data, is the shape of a distribution necessarily useful? (It gets scarier at a meta-level: many teachers are novice statisticians, and so how do we teach the teachers do be prepared to react to novel data?)

So I’ll sign off with some questions.  Suppose my classroom collects data on how many hours they sleep a night for, say, one month. We create a data file to include each student’s data.  Students do not know any statistics–this is their first data experience.  What is the first thing we should show them?  A distribution? Of what? What concepts do students bring to the table that will help them make sense  of longitudinal data?  If we don’t start with distributions, should we start with an average curve? With an overly of multiple time-series plots (“spaghetti plots”)?  And what’s the lesson, or should be the lesson, in examining such plots?

Facebook Analytics

WolframAlpha has a tool that will analyze your Facebook network. I saw this awhile ago, but HollyLynne reminded me of this recently, and I tried it out. You need to give the app(?) permission to access your account (which I am sure means access to your data for Wolfram), after which you are given all sorts of interesting, pretty info. Note, you can also opt to have Wolfram track your data in order to determine how your network is changing.

Some of them are kind of informative, but others are not. Consider this scatterplot(???)-type plot that was entitled “Weekly Distribution”. Tufte could include this in his next book of worthless graphs.

MSP2681f8g112466i1797600003i1a9g2303dbh6b6

There are other analyses that are more useful. For example, I learned that my post announcing the Citizen Statistician blog was the most liked post I have, while the post showing photographic evidence that I held a baby as far back as 1976 was the most commented.

This plot was also interesting…too bad it is a pie chart (sigh).

MSP12421be34933bi49gcie0000299f2dbe6gb3406g

There is also a ton of other information, such as which friend has the most friends (Jayne at 1819), your youngest and oldest friends based on the reported birthdays, photos that are tagged the most, word clouds of your posts, etc.

This network was my favorite of them all. It shows the social insiders and outsiders in my network of friends, and identifies social connectors, neighbors, and gateways.

MSP1331ib228faf1ibdbhd00003d6006784d62952h

Once again, kind of a cool tool that works with the existing data, but there does not seem to be a way to obtain the data in a workable format.

Big Data Is Not the New Oil

Our colleague and dear friend John Holcomb sent an email to Rob and I in which he asked if we had heard the phrase “Big data is the new oil”. Neither of us had, but according to Jer Thorp, ad executives are uttering this phrase upwards of 100 times a day.

Jer’s article is worth a read. While he points out in the title that big data is not the new oil, he astutely suggests that the oil/data metaphor does work to an extent. After describing data as a human resource (a thesis of his TED talk), Jer makes, and expounds on, three points that resonated with me

  1. People need to understand and experience data ownership.
  2. We need to have a more open conversation about data and ethics.
  3. We need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.

I am not sure which “we” he is referring to, but I might argue that society at large needs to have this conversation, and more importantly, the data users/statisticians/executives that make decisions to collect the data need to be having these conversations. Read the article at the Harvard Business Review Blog Network.

NCAA Basketball Visualization

It is time for the NCAA Basketball Tournament. Sixty-four teams dream big (er…I mean 68…well actually by now, 64) and schools like Iona and Florida Gulf Coast University (go Eagles!) are hoping that Robert Morris astounding victory in the N.I.T. isn’t just a flash in the pan.

My favorite part is filling out the bracket–see it below. (Imagine that…a statistician’s favorite part of the whole thing is making predictions.) Even President Obama filled out a bracket [see it here].

Andy's Bracket

My method for making predictions, I use a complicated formula that involves “coolness” factors of team mascots, alphabetical order (but only conditional on particular seedings), waving of hands, and guesswork. But, that was because I didn’t have access to my student Rodrigo Zamith’s latest blog post until today.

Rodrigo has put together side-by-side visualizations of many of the pertinent basketball statistics (e.g., points scored, rebounds, etc.) using the R package ggplot2. This would have been very helpful in my decisions where the mascot measure failed me and I was left with a toss-up (e.g., Oklahoma vs. San Diego State).

Preview of the March 22 Game between Minnesota and UCLA

Rodrigo has also made the data, not only from the 2012-2013 season available from his blog, but also the previous two seasons as well. Check it out at Rodrigo’s blog!

Now, all I have to do is hang tight until the 8:57pm (CST) game on March 22. Judging from the comparisons, it will be tight.

 

Your Flowing Data Defended

I had the privilege last week of listening to the dissertation defense of UCLA Stat’s newest PhD: Nathan Yau.  Congratulations, Nathan!

Nathan runs the very popular and fantastic blog Flowing Data, and his dissertation is about, in part, the creation of his app Your Flowing Data.  Essentially, this is a tool for collecting and analyzing personal data–data about you and your life.

One aspect of the thesis I really liked is a description of types of insight he found from a paper by Pousman, Stasko and Mateas (2007): Casual information visualization: Depictions of Data in every day life. (IEEE Transactions on Visualization and Computer Graphics, 13(6): 1145-1152.)  Nathan quotes four types of insights:

  • Analytic Insight.  Nathan describes these as ‘traditional’ statistical insights obtained from statistical models.
  • Awareness insight. “…remaining aware of data streams such as the weather, news…” People are simply aware that these everyday streams exist and so know to seek them for information when needed
  • Social Insight. Involvement in social networks help people define a place for themselves in relation to particular social contexts.
  • Reflective Insight.  Viewers take a step back from data and can reflect on something they were perhaps unaware of, or have an emotional reaction.

With respect to my Walk to Venice Beach, I think it would be interesting to see how experiences such as that can be leveraged into insights in these categories.  Although these insights are not hierarchical, it would also be interesting to see how these fit into understandings of statistical thinking and reasoning.  For example, some stats ed researchers are grappling with the role of ‘informal’ vs. ‘formal’ statistical inference, and I see the last three insights as supporting informal inference (when inference is called for at all.)

Nathan has lots to say about the role that developers can play in assisting people in gaining insight from data.  Our job, I believe, is to think carefully about the role that educators can play in strengthening these insights.  We spend too much time on the first insight, I think, and not enough time on the others.  But the others are what students will remember and use from their stats class.

Data Diary Assignment

My colleague Mark Hansen used to assign his class to keep a data diary. I decided to try it, to see what happened.  I asked my Intro Stats class (about 180 students) to choose a day in the upcoming week, and during the day, keep track of every event that left a ‘data trail.’  (We had talked a bit in class about what that meant, and about what devices were storing data.)  They were asked to write a paragraph summarizing the data trail, and to imagine what could be gleaned should someone have access to all of their data.

The results were interesting. The vast majority “got” it.  The very few who didn’t either kept too detailed a log (example: “11:01: text message, 11:02: text, 11:03: googled”, etc) or simply wrote down their day’s activities and said something vague like, “had someone been there with a camera, they would have seen me do these things.”

But those were very few (maybe 2 or 3).  The rest were quite thoughtful.  The sort of events included purchases (gas, concert tickets, books), meal-card swipes, notes of CCTV locations, social events (texts, phone calls), virtual life (facebook postings, google searches), and classroom activities (clickers, enrollments).  Many of the students were  to my reckoning, sophisticated, about the sort of portrait that could be painted. They pointed out that with just one day’s data, someone could have a pretty clear idea of their social structure.  And by pooling the classes data or the campus’s data, a very clear idea of where students were moving and, based on entertainment purchases, where they planned to be in the future.  They noted that gas purchase records could be used to infer whether they lived on campus or off campus and even, roughly, how far off.

Here’s my question for you:  what’s the next step?  Where do we go from here to build on this lesson?  And to what purpose?