On the first day of an intro stats or intro data science course I enjoy giving some accessible real data examples, instead of spending the whole time going over the syllabus (which is necessary in my opinion, but somewhat boring nonetheless).
One of my favorite examples is How to Tell Someone’s Age When All You Know Is Her Name from FiveThirtyEight. As an added bonus, you can use this example to get to know some students' names. I usually go through a few of the visualizations in this article, asking students to raise their hands if their name appears in the visualization. Sometimes I also supplement this with the Baby Name Voyager, it’s fun to have students offer up their names so we can take a look at how their popularity has changed over the years.
Another example I like is the Locals and Tourists Flickr Photos. If I remember correctly I saw this example first in Mark Hanson’s class in grad school. These maps use data from geotags on Flickr: blue pictures are taken by locals, red pictures are by tourists, and yellow pictures might be by either. This one of Manhattan is one most students will recognize, and since many people know where Times Square and Central Park are, both of which have an abundance of red - tourist - pictures. And if your students watch enough Law & Order they might also know where Rikers Island is they might recognize that, unsurprisingly, no pictures are posted from that location.
However if I were teaching a class this coming Fall, I would add the following analysis of Donald Trump’s tweets to my list of examples. If you have not yet seen this analysis by David Robinson, I recommend you stop what you’re doing now and go read it. It’s linked below:
Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half
I’m not going to re-iterate the post here, but the gist of it is that the @realDonaldTrump account tweets from two different phones, and that
the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.
I think this post would be a fantastic and timely first day of class example for a stats / data analysis / data science course. It shows a pretty easy to follow analysis complete with the R code to reproduce it. It uses some sentiment analysis techniques that may not be the focus of an intro course, but since the context will be familiar to students it shouldn’t be too confusing for them. It also features techniques one will likely cover in an intro course, like confidence intervals.
As a bonus, many popular media outlets have covered the analysis in the last few days (e.g. see here, here, and here), and some of those articles might be even easier on the students to begin with before delving into the analysis in the blog post. Personally, I would start by playing this clip from the CTV News Channel featuring an interview with David to provide the context first (a video always helps wake students up), and then move on to discussing some of the visualizations from the blog post.