PDF and Citation Management

A new academic year looms. This means a new crop of graduate students will begin their academic training. PDF management is a critical tool that all graduate students need to use and the sooner the better. Often these tools go hand-in-hand with a citation management system, which is also critical for graduate students.

Using a citation management software makes scholarly work easier and more effective. First and foremost, these tools allow you to automatically cite references for a paper in a wide range of bibliographic styles. They also allow you to organize, evaluate, annotate, and search within your citation collection and share your references with others. Often they also sync across machines and devices allowing you to access your database wherever you are.

There are several tools available for PDF/citation management, including:

Some of these are citation managers only (BibDesk). Many allow you to also manage your PDF files as well; naming, organizing, and moving your files to a central repository on your computer. Some allow for annotation within the software as well. There are several online comparisons of some of the different systems ( e.g., Penn LibrariesUW Madison Library, etc.) From my experience, students tend to choose either Mendelay or Zotero—my guess is because they are free.

There is a lot to be said for free software, and both Zotero and Mendelay seem pretty solid. However, as a graduate student you should understand that you are investing in your future. This type of tool, I think it is fair to say, you will be using daily. Spending money on a tool that has the features and UI that you will want to use is perfectly ok and should even be encouraged.

Another consideration for students who are beginning the process is to find out what your advisor(s), and research groups use. Although many are cross-compatible, using and learning the tool is easier with a group helping you.

What Do I Use?

I use Papers. It is not free (a student license is ~$50). When I started using Papers, Mendelay and Zotero were not available. I actually have since used both Mendelay and Zotero for a while, but then ultimately made the decision this summer to switch back to Papers. It is faster and more importantly to me, has better search functionality, both across and within a paper.

I would like to use Sente (free for up to 100 references), but the search function is very limited. In my opinion, Sente has the best UI..it is sleek and minimalist and reading a paper is a nice experience.

My Recommendation…

Ultimately, use what you are comfortable with and then, actually use it. Take the time to enter ALL the meta-data for PDFs as you accumulate them. Don’t imagine you will have time to do it later…you won’t. Being organized with your references from the start will keep you more productive later.


Very brief first day of class activity in R

New academic year has started for most of us. I try to do a range of activities on the first day of my introductory statistics course, and one of them is an incredibly brief activity to just show students what R is and what the RStudio window looks like. Here it is:

Generate a random number between 1 and 5, and introduce yourself to that many people sitting around you:

sample(1:5, size = 1)

It’s a good opportunity to have students access RStudio once, talk about random sampling, and break up the class session and have them introduce themselves to their fellow classmates. I usually do the activity too, and use it as an opportunity to personally introduce myself to a few students and to meet them.

If you’re interested in everything else I’m doing in my introductory statistics course you can find the course materials for this semester at http://bit.ly/sta101_f15 and find the source code for all publicly available materials like slides, labs, etc. at https://github.com/mine-cetinkaya-rundel/sta101_f15. Both of these will be updated throughout the semester. Feel free to grab whatever you find useful.

Interpreting Cause and Effect

One big challenge we all face is understanding what’s good and what’s bad for us.  And it’s harder when published research studies conflict. And so thanks to Roger Peng for posting on his Facebook page an article that led me to this article by Emily Oster:  Cellphones Do Not Give You Brain Cancer, from the good folks at the 538 blog. I think this article would make a great classroom discussion, particularly if, before showing your students the article, they themselves brainstormed several possible experimental designs and discussed strengths and weaknesses of the designs. I think it is also interesting to ask why no study similar to the Danish Cohort study was done in the US.  Thinking about this might lead students to think about cultural attitudes towards wide-spread data collection.

Fitbit Revisited

Many moons ago we wrote about a bit of a kludge to get data from a Fitbit (see here). Now it looks as though there is a much better way. Cory Nissen has written an R package to scrape Fitbit data and posted it on GitHub. He also wrote a blog post on his blog Stats and Things announcing the package and demonstrating its use. While I haven’t tried it yet, it looks pretty straight-forward and much easier than anything else i have seen to date.

Model Eliciting Activity: Prologue

I’m very excited/curious about tomorrow: I’m going to lead about 40 math and science teachers in a data-analysis activities, using one of the Model Eliciting Activities from the University of Minnesota Catalysts for Change Project. (One of our bloggers, Andy, was part of this project.) Specifically, we’re giving them the arrival-delay times for five different airlines into Chicago O’Hare. A random sample of 10 from each airline, and asking them to come up with rules for ranking the airlines from best to worst.

I’m curious to see what they come up with, particularly whether  the math teachers differ terribly from the science teachers. The math teachers are further along in our weekend professional development program than are the science teachers, and so I’m hoping they’ll identify the key characteristics of a distribution (all together: center, spread, shape; well, shape doesn’t play much of a role here) and use these to formulate their rankings. We’ve worked hard on helping them see distributions as a unit, and not a collection of individual points, and have seen big improvements in the teachers, most of whom have not taught statistics before.

The science teachers, I suspect, will be a little bit more deterministic in their reasoning, and, if true to my naive stereotype of science teachers, will try to find explanations for individual points. Since I haven’t worked as much with the science teachers, I’m curious to see if they’ll see the distribution as a whole, or instead try to do point-by-point comparisons.

When we initially started this project, we had some informal ideas that the science teachers would take more naturally to data analysis than would the math teachers. This hasn’t turned out to be entirely true. Many of the math teachers had taught statistics before, and so had some experience. Those who hadn’t, though, tended to be rather procedurally oriented. For example, they often just automatically dropped outliers from their analysis without any thought at all, just because they thought that that was the rule. (This has been a very hard habit to break.)

The math teachers also had a very rigid view of what was and was not data. The science teachers, on the other hand, had a much more flexible view of data. In a discussion about whether photos from a smart phone were data, a majority of math teachers said no and a majority of science teachers said yes. On the other hand, the science teachers tend to use data to confirm what they already know to be true, rather than use it to discover something. This isn’t such a problem with the math teachers, in part because they don’t have preconceptions of the data and so have nothing to confirm. In fact, we’ve worked hard with the math teachers, and with the science teachers, to help them approach a data set with questions in mind. But it’s been a challenge teaching them to phrase questions for their students in which the answers aren’t pre-determined or obvious, and which are empirically oriented. (For example: We would like them to ask something like “what activities most often led to our throwing away redcycling into the trash bin?” rather than “Is it wrong to throw trash into the recycling bin?” or “Do people throw trash into the recycling bin?”)

So I’ll report back soon on what happened and how it went.

Annual Review of Reading

It is that time of year…time to review the previous year; make top 10 lists; and resolve to be a better person in 2015. I will tackle the first, but only of my reading habits. In 2014 I read 46 books for a grand total of 17,480 pages. (Note: I do not count academic books for work in this list, only books I read for recreation.) This is a yearly high, at least since I have been tracking this data on GoodReads (since late 2010). You can read an older annual report of reading here.

Year Books Pages
2011 45 15,332
2012 29 9,203
2013 45 15,887
2014 46 17,480

Since I have accumulated four years worth of data, I thought I might do some comparative analysis of my reading over this time period.

When am I reading?

plot2The trend displayed here was somewhat surprising when I looked at it—at least related to the decline in reading over the summer months. Although, reflecting on it, it maybe should not have been as surprising. There is a slight uptick around the month of May (when spring semester ends) and the decline begins in June/July. Not only do summer classes begin, but I also try to do a few house and garden projects over the summer months. This uptick and decline are still visible when a plot of the number of pages (rather than the number of books) is examined, albeit much smaller (1,700 pages in May and 1,200 pages in the summer months). This might indicate I read longer books in the summer. For example, one of the books I read this last summer was Neal “I don’t know the meaning of the word ‘brevity'” Stephenson’s Reamde, which clocked in at a mere 1,044 pages.

Was I reading books that I ultimately enjoyed?

plot3I also plotted my monthly average rating (on a five-point scale) for the four years of data. This plot shows that 2014 is an anomaly. I apparently read trash in the summer (which is what you are supposed to do). The previous three years I read the most un-noteworthy books in the fall. Or, I just rated them lower because school had started again.

Am I more critical than other readers? Is this consistent throughout the year?

I also looked at how other GoodReads readers had rated those same books. The months represent when I read the book. (I didn’t look at when the book was read by other readers, although that would be interesting to see if time of year has an effect on rating.) The scale on the y-axis is the residual between my rating and the average GoodReads rating. My ratings are generally close to the average, sometimes higher, sometimes lower. There are, however, many books that I rated much lower than average. The loess smooth suggests that July–November is when I am most critical relative to other readers.


Notes and thoughts from JSM 2014: Student projects utilizing student-generated data

Another August, another JSM… This time we’re in Boston, in yet another huge and cold conference center. Even on the first (half) day the conference schedule was packed, and I found myself running between sessions to make the most of it all. This post is on the first session I caught, The statistical classroom: student projects utilizing student-generated data, where I listened to the first three talks before heading off to catch the tail end of another session (I’ll talk about that in another post).

Samuel Wilcock (Messiah College) talked about how while IRBs are not required for data collected by students for class projects, the discussion of ethics of data collection is still necessary. While IRBs are cumbersome, Wilcock suggests that as statistic teachers we ought to be aware of the process of real research and educating our students about the process. Next year he plans to have all of his students go through the IRB process and training, regardless of whether they choose to collect their own data or use existing data (mostly off the web). Wilcock mentioned that, over the years, he moved on from thinking that the IRB process is scary to thinking that it’s an important part of being a stats educator. I like this idea of discussing in the introductory statistics course issues surrounding data ethics and IRB (in a little more depth than I do now), though I’m not sure about requiring all 120 students in my intro course to go through the IRB process just yet. I hope to hear an update on this experiment next year from to see how it went.

Next, Shannon McClintock (Emory University) talked about a project inspired by being involved with the honor council of her university, when she realized that while the council keeps impeccable records of reported cases, they don’t have any information on cases that are not reported. So the idea of collecting student data on academic misconduct was born. A survey was designed, with input from the honor council, and Shannon’s students in her large (n > 200) introductory statistics course took the survey early on in the semester. The survey contains 46 questions which are used to generate 132 variables, providing ample opportunity for data cleaning, new variable creation (for example thinking about how to code “any” academic misconduct based on various questions that ask about whether a student has committed one type of misconduct or another), as well as thinking about discrepant responses. These are all important aspects of working with real data that students who are only exposed to clean textbook data may not get a chance practice. It’s my experience that students love working with data relevant to them (or, even better, about them), and data on personal or confidential information, so this dataset seem to hit both of those notes.

Using data from the survey, students were asked to analyze two academic outcomes: whether or not student has committed any form of academic misconduct and an outcome of own choosing, and presented their findings in n optional (some form of extra credit) research paper. One example that Shannon gave for the latter task was defining a “serious offender”: is it a student who commits a one time bad offense or a student who habitually commits (maybe nor so serious) misconduct? I especially like tasks like this where students first need to come up with their own question (informed by the data) and then use the same data to analyze it. As part of traditional hypothesis testing we always tell students that the hypotheses should not be driven by the data, but reminding them that research questions can indeed be driven by data is important.

As a parting comment Shannon mentioned that the administration at her school was concerned that students finding out about high percentages of academic offense (survey showed that about 60% of students committed a “major” academic offense) might make students think that it’s ok, or maybe even necessary, to commit academic misconduct to be more successful.

For those considering the feasibility of implementing a project like this, students reported spending on average 20 hours on the project over the course of a semester. This reminded me that I should really start collecting data on how much time my students spend on the two projects they work on in my course — it’s pretty useful information to share with future students as well as with colleagues.

The last talk I caught in this session was by Mary Gray and Emmanuel Addo (American University) on a project where students conducted an exit poll asking voters whether they encountered difficulty in voting, due to voter ID restrictions or for other reasons. They’re looking for expanding this project to states beyond Virginia, so if you’re interested in running a similar project at your school you can contact Emmanuel at addo@american.edu. They’re especially looking for participation from states with particularly strict voter ID laws, like Ohio. While it looks like lots of work (though the presenters assured us that it’s not), projects like these that can remind students that data and statistics can be powerful activism tools.

Is Data Science Real?

Just came back from the International Conference on Teaching Statistics (ICOTS) in Flagstaff, AZ filled with ideas.  There were many thought-provoking talks, but what was even better were the thought-provoking conversations.  One theme, at least for me, is just what is this thing called Data Science?  One esteemed colleague suggested it was simply a re-branding.  Other speakers used it somewhat perjoratively, in reference  to outsiders (i.e. computer scientists).   Here are some answers from panelists at a discussion on the future of technology in statistics education.  All paraphrases are my own, and I take responsibility for any sloppiness, poor grammar, etc.

Webster West took the High Statistician point of view—one shared by many, including, on a good day, myself: Data Science consists of those things that are involved in analyzing data.  I think most statisticians when reading this will feel like Moliere’s Bourgeois Gentleman, who was pleasantly surprised to learn he’d been speaking prose all his life.  But I think there’s more to it then that, because probably many statisticians don’t consider data scraping, data cleaning, data management as part of data analysis.

Nick Horton offered that data mining was an activity that could be considered part of data science.  And he sees data mining as part of statistics.  Not sure all statisticians would agree, since for many of us, data mining is a swear word used to refer to people who are lucky enough to discover something but have no idea why it was discovered.  But he also offered a broader definition:  using data to answer a statistical question.   Which I quite like.  It leaves open the door to many ways of answering the question; it doesn’t require any particular background or religion, it simply means that those activities used to bring data to bear in answering a statistical question.

Bill Finzer relied on set theory:  data science is a partial union of math and statistics, subject matter knowledge, and computational thinking and programming in the service of making discoveries from data.  I’ve seen similar definitions and have found such a definition to be very useful in thinking about curriculum for a high school data science course.  It doesn’t contradict Nick’s definition, but is a little more precise.  As always, Bill has a knack for phrasing things just right without any practice.

Deb Nolan answered last, and I think I liked her answer the best.  Data science encompasses the entire data analysis cycle, and addresses the issue you face in terms of working with data within that cycle, and the skills needed to complete that cycle.  (I like to use this simplified version of the cycle:  ask questions–>collect/consider/prepare data –>analyze data–> interpret data–>ask questions, etc.)

One reason I like Deb’s answer is that its the answer we arrived at in our Mobilize group that’s developing the Introduction to Data Science curriculum for Los Angeles Unified School District.  (With a new and improved webpage appearing soon! I promise!)  Lots of computational skills appear explicitly in the collect/prepare data bit of the cycle, but in fact, algorithmic thinking — thinking about processes of reproducibility and real-time analyses–can appear in all phases.

During this talk I had an epiphany about my own feelings towards a definition. The epiphany was sparked by an earlier talk by Daniel Frischemeier on the previous day, but brought into focus by this panel’s discussion.   (Is it possible to have a slow epiphany?)

Statistics educators have been big proponents of teaching “statistical thinking”, which is basically an approach to solving problems that involve uncertainty/variation and data.  But for many of us, the bit of problem solving in which a computer is involved is ignored in our conceptualization of statistical thinking.  To some extent, statistical thinking is considered to be independent of computation.  We’d like to think that we’d reach the same conclusions regardless of which software we were using.  While that’s true, I think it’s also true that our approach to solving the problem may be software dependent.  We think differently with different softwares because different softwares enable different thought processes, in the same way that a pen and paper enables different processes then a word processor.

And so I think that we statisticians become data scientists the moment we reconceptualize statistical thinking to include using the computer.

What does this have to do with Daniel’s talk?   Daniel has done a very interesting study in which he examined the problem solving approach of students in a statistics class.  In this talk, he offered a model for the expert statistician problem solving process.  Another version of the data analysis cycle, if you will.  His cycle (built solidly on foundations of others) is Real Problem –> Statistical activity –> Software use–> Reading off/Documentation (interpreting) –> conclusions –> reasons (validation of conclusions)–> back to beginning.

I think data scientists are those who would think that the “software use” part of the cycle was subsumed by the statistical activity part of the cycle. In other words, when you approach data cleaning, data organizing, programming, etc. as if they were a fundamental component of statistical thinking, and not just something that stands in the way of your getting to real data analysis, then you are doing data science.  Or, as my colleague Mark Hansen once told me, “Teaching R  *is* teaching statistics.”  Of course its possible to teach R so that it seems like something that gets in the way of (or delays) understanding statistics.  But it’s also possible to teach it as a complement to developing statistical understanding.

I don’t mean this as a criticism of Daniel’s work, because certainly it’s useful to break complex activities into smaller parts.  But I think that there is a figure-and-ground issue, in which statisticians have seen modeling and data analysis as the figure, and the computer as the ground.  But when our thinking unites these views, we begin to think like data scientists.  And so I do not think that “data science” is just a rebranding of statistics. It is a re-consideration of statistics that places greater emphasis on parts of the data cycle than traditionally statistics has placed.

I’m not done with this issue.  The term still bothers me.  Just what is the science in data science?  I feel a refresher course in Popper and Kuhn is in order.  Are we really thinking scientifically about data?  Comments and thoughts welcome.

Lively R

Next week, the UseR conference comes to UCLA.  And in anticipation, I thought a little foreshadowing would be nice.  Amelia McNamara, UCLA Stats grad student and rising stats ed star, shared with me a new tool that has the potential to do some wonderful things.  LivelyR is a work-in-progress that is, in the words of its creators, a “mashup of R with packages of Rstudio.” The result is a highly interactive.  I was particularly struck by and intrigued by the ‘sweeping’ function, which visually smears graphics across several parameter values.  The demonstration shows how this can help understand the effects of bin-width and off-set changes on a histogram so that a more robust sense of the sample distribution shines through.

R is beginning to become a formidable educational tool, and I’m looking forward to learning more at UseR next week. For those of you in L.A. who can attend, Aron Lunzer will be talking about LivelyR at 4pm on Tuesday, July 1.