Personal Data Apps

Fitbit, you know I love you and you’ll always have a special place in my pocket.  But now I have to make room for the Moves app to play a special role in my capture-the-moment-with-data existence.

Moves is an ios7 app that is free.  It eats up some extra battery power and in exchange records your location and merges this with various databases and syncs it up to other databases and produces some very nice “story lines” that remind you about the day you had and, as a bonus, can motivate you to improved your activity levels.  I’ve attached two example storylines that do not make it too embarrassingly clear how little exercise I have been getting. (I have what I can consider legitimate excuses, and once I get the dataset downloaded, maybe I’ll add them as covariates.)  One of the timelines is from a day that included an evening trip to Disneyland. The other is a Saturday spent running errands and capped with dinner at a friend’s.  Its pretty easy to tell which day is which.

movings1movings2

But there’s more.  Moves has an API, thus allowing developers to tap into their datastream to create apps.  There’s an app that exports the data for you (although I haven’t really had success with it yet) and several that create journals based on your Moves data.  You can also merge Foursquare, Twitter, and all the usual suspects.

I think it might be fun to have students discuss how one could go from the data Moves collects to creating the storylines it makes.  For instance, how does it know I’m in a car, and not just a very fast runner?  Actually, given LA traffic, a better question is how it knows I’m stuck in traffic and not just strolling down the freeway at a leisurely pace? (Answering these questions requires another type of inference than what we normally teach in statistics. )  Besides journals, what apps might they create with these data and what additional data would they need?

The Future of Inference

We had an interesting departmental seminar last week, thanks to our post-doc Joakim Ekstrom, that I thought would be fun to share.  The topic was The Future of Statistics discussed by a panel of three statisticians.  From left to right in the room: Songchun Zhu (UCLA Statistics), Susan Paddock (RAND), and Jan DeLeeuw (UCLA Statistics).  The panel was asked about the future of inference: waxing or waning.

The answers spanned the spectrum from “More” to “Less” and did so, interestingly enough, as one moved left to right in order of seating.  Songchun staked a claim for waxing, in part because  he knows of groups that are hiring statisticians instead of computer scientists because statisticians’ inclination to cast problems in an inferential context makes them more capable of finding conclusions in data, and not simply presenting summaries and visualizations.  Susan felt that it was neither waxing nor waning, and pointed out that she and many of the statisticians she knows spend much of their time doing inference.  Jan said that inference as an activity belongs in the substantive field that raised the problem.  Statisticians should not do inference.  Statisticians might, he said, design tools to help specialists have an easier time doing inference. But the inferential act itself requires intimate substantive knowledge, and so the statistician can assist, but not do.

I think one reason that many stats educators might object to this because its hard to think of how else to fill the curriculum.  That might have been an issue when most students took a single Introductory course in their early twenties and then never saw statistics again.  But now we must think of the long game, and realize that students begin learning statistics early.  The Common Core stakes out one learning pathway, but we should be looking ahead, and thinking of future curricula, since the importance of statistics will grow.

If statistics is the science of data, I suggest we spend more time thinking about how to teach students to behave more like scientists.  And this means thinking seriously about how we can  develop their sense of curiosity.  The Common Core introduces the notion of a ‘statistical question’– a question that recognizes variability.  To the statisticians reading this, this needs no more explanation.  But I’ve found it surprisingly difficult to teach this practice to math teachers teaching statistics.  I’m not sure, yet, why this is.  Part of the reason might be that in order to answer a statistical question such as “What is the most popular favorite color in this class” we must ask the non-statistical question “What is your favorite color.”  But there’s more to it than that.  A good statistical question isn’t as simple as the one I mentioned, and leads to discovery beyond the mere satisfaction of curiosity.  I’m reminded of the Census at Schools program that encouraged students to become Data Detectives.

In short, its time to think seriously about teaching students why they should want to do data analysis.  And if we’re successful, they’ll want to learn how to do inference.

So what role does inference play in your Ideal Statistics Curriculum?

City Hall and Data Hunting

The L.A. Times had a nice editorial on Thursday (Oct 30) encouraging City Hall to make its data available to the public.  As you know, fellow Citizens, we’re all in favor of making data public, particularly if the public has already picked up the bill and if no individual’s dignity will be compromised.  For me this editorial comes at a time when I’ve been feeling particularly down about the quality of public data.  As I’ve been looking around for data to update my book and for the Mobilize project, I’m convinced that data are getting harder, and not easier. to find.

More data sources are drying up, or selling their data, or using incredibly awkward means for displaying their public data.  A basic example is to consider how much more difficult it is to get, say, a sample of household incomes from various states for 2010 compared to the 2000 census.

Another example is gasbuddy.com, which has been one of my favorite classroom examples.  (We compare the participatory data in gasbuddy.com, which lists prices for individual stations across the U.S., with the randomly sampled data the federal government provides, which gives mean values for urban districts. One data set gives you detailed data, but data that might not always be trustworthy or up-to-date. The other is highly trustworthy, but only useful for general trends and not for, say, finding the nearest cheapest gas. )  Used to be you could type in a zip code and have access to a nice data set that showed current prices, names and locations of gas stations, dates of the last reported price, and the username of the person who reported the price.  Now, you can scroll through an unsorted list of cities and states and get the same information only for the 15 cheapest and most expensive stations.

About 2 years ago I downloaded a very nice, albeit large, data set that included annual particulate matter ratings for 333 major cities in the US.  I’ve looked and looked, but the data.gov AirData site now requires that I enter the name of each city in one at a time, and download very raw data for each city separately.  Now raw data are good things, and I’m glad to see it offered. But is it really so difficult to provide some common sensically aggregated data sets?

One last example:  I stumbled across this lovely website, wildlife crossing, which uses participatory sensing to maintain a database of animals killed at road crossings.  Alas, this apparently very clean data set is spread across 479 separate screens.  All it needs is a “download data” button to drop the entire file onto your hard disk, and they could benefit from many eager statisticians and wildlife fans examining their data.  (I contacted them and suggested this, and they do seem interested in sharing the data in its entirety. But it is taking some time.)

I hope Los Angeles, and all governments, make their public data public. But I hope they have the budget and the motivation to take some time to think about making it accessible and meaningful, too.

Community Colleges and the ASA

Rob will be be participating in this event, organized by Nicholas Horton:

CONNECTION WITH COMMUNITY COLLEGES: second in the guidelines for undergraduate statistics programs webinar series

The American Statistical Association endorses the value of undergraduate programs in statistical science, both for statistical science majors and for students in other majors seeking a minor or concentration. Guidelines for such programs were promulgated in 2000, and a new workgroup is working to update them.

To help gather input and identify issues and areas for discussion, the workgroup has organized a series of webinars to focus on different issues.

Connection with Community Colleges
Monday, October 21st, 6:00-6:45pm Eastern Time

Description: Community colleges serve a key role in the US higher education system, accounting for approximately 40% of all enrollments. In this webinar, representatives from community colleges and universities with many community college transfers will discuss the interface between the systems and ways to prepare students for undergraduate degrees and minors in statistics.

The webinar is free to attend, and a recording will be made available after the event.  To sign up, please email Rebecca Nichols (rebecca@amstat.org).

More information about the existing curriculum guidelines as well as a survey can be found at:

http://www.amstat.org/education/curriculumguidelines.cfm

Crime data and bad graphics

I’m working on the 2nd edition of our textbook, Gould & Ryan, and was looking for some examples of bad statistical graphics.  Last time, I used FBI data and created a good and bad graphic from the data. This time, I was pleased to see that the FBI provided its own bad graphic.fbi crime bad graph

This shows a dramatic decrease in crime over the last 5 years.  (Not sure why 2012 data aren’t yet available.) Of course, this graph is only a bad graph if the purpose is to show the rate of decrease.  If you look at it simply as a table of numbers, it is not so bad.

Here’s the graph on the appropriate scale.

fbi crimes improved

Still, a decrease worth bragging about.  But, alas, somewhat less dramatic.

Statistics, the government shutdown, and causality.

There’s a  statistical meme that is making its way into pundits’ discussions (as we might politely call them) that is of interest to statistics educators.  There are several variations, but the basic theme is this:  because of the government shutdown, people are unable to benefit from the new drugs they receive by participating in clinical trials.  The L.A Times went so far as to publish an editorial from a gentleman who claimed that he was cured by his participation in a clinical trial.

Now if they had said that future patients are prevented from benefiting from what is learned from a clinical trial, then they’d nail it.  Instead, they seem to be overlooking the fact that some patients will be randomized to the control group, and probably get the same treatment as if there were no trial at all.  And in many trials (a majority?), the result will be that the experimental treatment had little or no effect beyond the traditional treatment.  And in a very small number of cases, the experimental effect will be found to have serious side effects.  And so the pundits should really be telling us that the government shutdown prevents patients from a small probability of a benefitting from experimental treatment.

All snarkiness aside, I think the prevalence of this meme points to the subtleties of interpreting probabilistic experiments, in which outcomes contain much variability, and so conclusions must be stated in terms of group characteristics.  This came out in the SRTL discussion in Minnesota this summer, when Maxinne Pfannkuch, Pip Arnold, and Stephanie Budgett at the University of Auckland  presented their work leading towards a framework for describing students’ understanding of causality.  I don’t remember very well the example they used, but it was similar to this (and was a real-life study):   patients were randomized to receive either fish oil or vegetable oil in their diet.  The goal of the study was to determine if fish oil lowered cholesterol.  At the end of the study, the fish oil group had a slightly lower average cholesterol levels.  A typical interpretation was, “If I take fish oil, my cholesterol will go down.”

One problem with this interpretation is that it ignored the within-group variation.  Some of patients in the fish oil group saw their cholesterol go up; some saw little or no change.  The study’s conclusion is about group means, not about individuals.  (There were other problems, too.  This interpretation ignores the existence of the control group: we don’t really know if fish oil improves cholesterol compared to your current diet; we know only that it tends to go down in comparison to a vegetable-oil diet.  Also, we know the effects only for those who participated in the study. We assume they were not special people, but possibly the results won’t hold for other groups.)

Understanding causality in probabilistic settings (or any setting) is a challenge for young students and even adults.  I’m very excited to see such a distinguished group of researchers begin  to help us understand.  Judea Pearl, at UCLA, has done much to encourage statisticians to think about the importance of teaching causal inference.  Recently, he helped the American Statistical Association establish the Causality in Statistics Education prize, won this year by Felix Elwert, a sociologist at the University of Wisconsin-Madison.  We still have a ways to go before we understand how to best teach this topic at the undergraduate level and even further before we understand how to teach it at earlier levels.  But, as the government shut down has shown, understanding probabilistic causality is an important component of statistical literacy.

Thinking with technology

Just finished a stimulating, thought-provoking week at SRTL —Statistics Research Teaching and Learning conference–this year held in Two Harbors Minnesota, right on Lake Superior. SRTL gathers statistics education researchers, most of whom come with cognitive or educational  psychology credentials, every two years. It’s more of a forum for thinking and collaborating than it is a platform for  presenting findings, and this means there’s much lively, constructive discussion about works in progress.

I had meant to post my thoughts daily, but (a) the internet connection was unreliable and (b) there was just too much too digest. One  recurring theme that really resonated with me was the ways students interact with technology when thinking about statistics.
Much of the discussion centered on young learners, and most of the researchers — but not all — were in classrooms in which the students used TinkerPlots 2.  Tinkerplots is a dynamic software system that lets kids build their own chance models. (It also lets them build their own graphics more-or-less from scratch.) They do this by either dropping “balls” into “urns” and labeling the balls with characteristics, or through spinners which allow them to shade different areas different colors. They can connect series of spinners and urns in order to create sequences of independent or dependent events, and can collect outcomes of their trials. Most importantly, they can carry out a large number of trials very quickly and graph the results.

What I found fascinating was the way in which students would come to judgements about situations, and then build a model that they thought would “prove” their point. After running some trials, when things didn’t go as expected, they would go back and assess their model. Sometimes they’d realize that they had made a mistake, and they’d fix it. Other times, they’d see there was no mistake, and then realize that they had been thinking about it wrong.Sometimes, they’d come up with explanations for why they had been thinking about it incorrectly.

Janet Ainley put it very succinctly. (More succinctly and precisely than my re-telling.)  This technology imposes a sort of discipline on students’ thinking. Using the  technology is easy enough  that they can be creative, but the technology is rigid enough that their mistakes are made apparent.  This means that mistakes are cheap, and attempts to repair mistakes are easily made.  And so the technology itself becomes a form of communication that forces students into a level of greater precision than they can put in words.

I suppose that mathematics plays the same role in that speaking with mathematics imposes great precision on the speaker.  But that language takes time to learn, and few students reach a level of proficiency that allows them to use the language to construct new ideas.  But Tinkerplots, and software like it, gives students the ability to use a language to express new ideas with very little expertise.  It was impressive to see 15-year-olds build models that incorporated both deterministic trends and fairly sophisticated random variability.  More impressive still, the students were able to use these models to solve problems.  In fact, I’m not sure they really know they were building models at all, since their focus was on the problem solving.

Tinkerplots is aimed at a younger audience than the one I teach.  But for me, the take-home message is to remember that statistical software isn’t simply a tool for calculation, but a tool for thinking.

TISE Special Edition: 2012 IASE Roundtable

Every couple of years, the International Association of Statistics Education hosts a Roundtable discussion, wherein researchers, statisticians, and curriculum developers are gather from around the world to share ideas. The 2012 Roundtable, held in Cebu City, the Philippines, focused on the role of Technology in Statistics Education, and so, after a very long time editing (for me and Jennifer Kaplan) and re-writing (for our authors), we are now ready to present the Roundtable Special Edition.  The articles cover the spectrum, K-12, introductory statistics, and beyond.  Versions of these articles appeared in the Proceedings, but the versions published here are peer-reviewed, re-written, and re-written again.  Topics include : designing computer games to teach data science, measuring the attitude of teachers towards technology in their classroom, how to decide which features make a successful on-line course, how to best teach students to use statistical packages, some exciting innovations for teaching inference and experimental design, as well as descriptions of exciting developments in statistics education in Kenya, Malaysia, and more!

Here’s Looking At You!

What do we fear more?  Losing data privacy to our government, or to corporate entities?  On the one hand, we (still) have oversight over our government.  On the other hand, the government is (still) more powerful than most corporate entities, and so perhaps better situated to frighten.

In these times of Snowden and the NSA, the L.A. Times ran an interesting story about just what tracking various internet companies perform.  And it’s alarming. (“They’re watching your every move.”, July 10, 2013). Interestingly, the story does not seem to appear on their website as of this posting.)  Like the government, most of these companies claim that (a) their ‘snooping’ is algorithmic; no human sees the data and (b) their data are anonymized.  And yet…

To my knowledge, businesses aren’t required to adhere to, or even acknowledge, any standards or practices for dealing with private data.  Thus, a human could snoop on particular data.  We are left to ponder what that human will do with the information.  In the best case scenario, the human would be fired, as, according to the L.A. Times, Google did when it fired an engineer for snooping on emails of some teenage girls.

But the data are anonymous, you say?  Well, there’s anonymous and then there’s anonymous.  As LaTanya Sweeney taught us in the 90′s, knowing a person’s zipcode, gender, and date of birth is sufficient to uniquely identify 85% of Americans.  And the L.A. Times reports a similar study where just four hours of anonymized tracking data was sufficient to identify 95% of all individuals examined.  So while your name might not be recorded, by merging enough data files, they will know it is you.

This article fits in really nicely with a fascinating, revelatory book I’m currently midway through:  Jaron Lanier‘s Who Owns The Future? A basic theme of this book is that internet technology devalues products and goods (files) and values  services (software).  One process through which this happens is that we humans accept the marvelous free stuff that the internet provides (free google searches, free amazon shipping, easily pirated music files) in exchange for allowing companies to snoop. The companies turn our aggregated data into dollars by selling to advertisers.

A side affect of this, Lanier explains, is that there is a loss of social freedom.  At some point, a service such as Facebook gets to be so large that failing to join means that you are losing out on possibly rich social interactions.  (Yes, I know there are those who walk among us who refuse to join Facebook.  But these people are probably not reading this blog, particularly since our tracking ‘bots tell us that most of our readers come from Facebook referrals.  Oops.  Was I allowed to reveal that?)  So perhaps you shouldn’t complain about being snooped on since you signed away your privacy rights. (You did read the entire user agreement, right?  Raise your hand if you did.  Thought so.)  On the other hand, if you don’t sign, you become a social pariah.  (Well, an exaggeration.  For now.)

Recently, I installed Ghostery, which tracks the automated snoopers that follow me during my browsing.  Not only “tracks”, but also blocks.  Go ahead and try it.  It’s surprising how many different sources are following your every on-line move.

I have mixed feelings about blocking this data flow. The data-snooping industry is big business, and is responsible, in part, for the boom of stats majors and, more importantly, the boom in stats employment.  And so indirectly, data-snooping is paying for my income.  Lanier has an interesting solution:  individuals should be paid for their data, particular when it leads to value.  This means the era of ‘free’ is over–we might end up paying for searches and for reading wikipedia.  But he makes a persuasive case that the benefits exceed the costs.  (Well, I’m only half-way through the book.  But so far, the case is persuasive.)