PDF and Citation Management

A new academic year looms. This means a new crop of graduate students will begin their academic training. PDF management is a critical tool that all graduate students need to use and the sooner the better. Often these tools go hand-in-hand with a citation management system, which is also critical for graduate students.

Using a citation management software makes scholarly work easier and more effective. First and foremost, these tools allow you to automatically cite references for a paper in a wide range of bibliographic styles. They also allow you to organize, evaluate, annotate, and search within your citation collection and share your references with others. Often they also sync across machines and devices allowing you to access your database wherever you are.

There are several tools available for PDF/citation management, including:

Some of these are citation managers only (BibDesk). Many allow you to also manage your PDF files as well; naming, organizing, and moving your files to a central repository on your computer. Some allow for annotation within the software as well. There are several online comparisons of some of the different systems ( e.g., Penn LibrariesUW Madison Library, etc.) From my experience, students tend to choose either Mendelay or Zotero—my guess is because they are free.

There is a lot to be said for free software, and both Zotero and Mendelay seem pretty solid. However, as a graduate student you should understand that you are investing in your future. This type of tool, I think it is fair to say, you will be using daily. Spending money on a tool that has the features and UI that you will want to use is perfectly ok and should even be encouraged.

Another consideration for students who are beginning the process is to find out what your advisor(s), and research groups use. Although many are cross-compatible, using and learning the tool is easier with a group helping you.

What Do I Use?

I use Papers. It is not free (a student license is ~$50). When I started using Papers, Mendelay and Zotero were not available. I actually have since used both Mendelay and Zotero for a while, but then ultimately made the decision this summer to switch back to Papers. It is faster and more importantly to me, has better search functionality, both across and within a paper.

I would like to use Sente (free for up to 100 references), but the search function is very limited. In my opinion, Sente has the best UI..it is sleek and minimalist and reading a paper is a nice experience.

My Recommendation…

Ultimately, use what you are comfortable with and then, actually use it. Take the time to enter ALL the meta-data for PDFs as you accumulate them. Don’t imagine you will have time to do it later…you won’t. Being organized with your references from the start will keep you more productive later.


Very brief first day of class activity in R

New academic year has started for most of us. I try to do a range of activities on the first day of my introductory statistics course, and one of them is an incredibly brief activity to just show students what R is and what the RStudio window looks like. Here it is:

Generate a random number between 1 and 5, and introduce yourself to that many people sitting around you:

sample(1:5, size = 1)

It’s a good opportunity to have students access RStudio once, talk about random sampling, and break up the class session and have them introduce themselves to their fellow classmates. I usually do the activity too, and use it as an opportunity to personally introduce myself to a few students and to meet them.

If you’re interested in everything else I’m doing in my introductory statistics course you can find the course materials for this semester at http://bit.ly/sta101_f15 and find the source code for all publicly available materials like slides, labs, etc. at https://github.com/mine-cetinkaya-rundel/sta101_f15. Both of these will be updated throughout the semester. Feel free to grab whatever you find useful.

R packages for undergraduate stat ed

The other day on the isostat mailing list Doug Andrews asked the following question:

Which R packages do you consider the most helpful and essential for undergrad stat ed? I ask in great part because it would help my local IT guru set up the way our network makes software available in our computer classrooms, but also just from curiosity.

Doug asked for a top 10 list, and a few people have already chimed in with great suggestions. I thought those not on the list might also have good ideas, so, with Doug’s permission, I’m reposting the question here.

Here is my top 10 (ok, 12) list:
(Links go to vignettes or pages I find to be quickest / most useful references for those packages, but if you know of better resources, let me know and I’ll update.)

  1. knitr / rmarkdown – for reproducible data analysis with literate programming, great set of tools that students can use from day 1 in intro stats all the way through to writing their undergrad theses
  2. dplyr – for most data manipulation tasks, with the added benefit of piping (via magrittr)
  3. ggplot2 – easy faceting allows for graphing multivariate relationships more easily than with base R (lattice is also good for that, but IMO ggplot2 graphics look more modern and lattice has a much steeper learning curve)
  4. openintro – or packages that come with the textbooks you use, great for pulling up any dataset from the text and building on it in class (a new version coming soon to fully complement 3rd edition of OpenIntro Statistics)
  5. mosaic – for consistent syntax for functions used in intro stat
  6. googlesheets – for loading data directly from Google spreadsheets
  7. lubridate – if you ever need to work with any date fields
  8. stringr – for text parsing and manipulation
  9. rvest – for scraping data off the web
  10. readr / data.table – for loading large datasets & default stringsAsFactors = FALSE

And the following suggestions from Randall Prium complement this list nicely:

  • readxl – for reading Excel data
  • tidyr – for converting between wide and long formats and for the very useful extract_numeric()
  • ggvisggplot2 “done right” and tuned for interactive graphics
  • htmlwidgets – this is actually a collection of packages for plots: see leaflet for maps and dygraphs for time series, for example

Note that most of these packages are for data manipulation and visualization. Methods specific packages that are useful / essential for a particular undergraduate program might depend on the focus of that program. Some packages that so far came up in the discussion are:

  • lme4 – for mixed models
  • pwr – for showing sample size and power calculations

This blog post is meant to provide a space for continuing this discussion, so I’ll ask the question one more time: Which R packages do you consider the most helpful and essential for undergrad stat ed? Please add your responses to the comments.


PS: Thanks to Michael Lopez for suggesting that I post this list somewhere.
PPS: I should really be working on my fast-approaching JSM talk.