Research Hack: Paper and Reference Management

Research Hacks are a series of blog posts about some of the tools, applications, and computer programs that I use in my workflow. Some of these I began using when I was a graduate student, and others I have picked up more recently. This is the second post in the series (see the first post Feedreaders and Aggregators.)

Electronically managing the absurdly large volume of articles, reports, book chapters and other writings that academics procure is a huge way to save time and increase production. My initial way to manage these files (often PDFs) was to include them in a folder that corresponded to a particular project or paper. Because I could never find the article again (Spotlight was a long way from working well at this point), I often had multiple copies of the same paper residing on my computer. This also meant that I had multiple annotations across these papers.

When one summer I realized that I had 11 copies of a paper on covariational reasoning (the topic of my dissertation) on my computer I laughed at the absurdity of this system and vowed to fix it. This is when I found Papers.

Papers (now in its second version—Papers2) is a management system for a person’s “research library” (as they refer to it). It is sort of like iTunes for PDF files. You have a “library” of files (only one place on your computer) and these are displayed in the Papers application (just like iTunes). You can then have “playlists” in which you put these files, but without creating multiple copies! For example, you could have “playlists” containing the references for each paper you are currently writing.

Screenshot of the Papers2 application

The search feature is great. If you are an organization nut like myself, you can also input all sorts of meta-data (publication type, tag words, photos, links to supplementary material, etc.). Papers can also output references for BibTeX or Endnote and has integration with Scrivener and Word. There are limited annotation tools within Papers (although more in v2.1) at this point, but rumor has it that is a big part of the future. There are also several workarounds using Dropbox, Skim, etc. Lastly there are iPhone and iPad apps for Papers that I think are beautiful. Reading articles on the iPad is one of the coolest things ever.

Papers on the iPad

Unfortunately Papers is not free (but there is a substantial discount for students). Also, as far as I know, it is only available for Mac users. There are several other management and reference systems available as well. Two of those are Zotero and Mendeley.

Each system has features that are really cool and some that aren’t as well developed. Why did I choose Papers? At the time it was pretty much there only one choice at the time that existed in a state that actually worked. (I seem to recall Mendeley was just released as a beta version.) Would I make the same choice now? I am not sure, but I think so. My second choice would be Zotero. (I am a little concerned about what will happen to Mendeley now that it has been purchased by Elsevier.)

No matter what choice you make, let me make several suggestions.

  1. Begin using it immediately.
  2. Begin entering meta-data for every paper you have right away. Don’t be chincy here. Yes, I know it is time-consuming, but that just gets worse as you accumulate more and more articles. Some of this can be automated depending on the recency of the paper, etc.
  3. Learn how to use it to input references into a paper.
  4. Figure out a workflow for paper annotation (taking notes, highlighting, etc.)

Summer is a wonderful time to learn a new software program or computing language. Happy computing!

Research Hack: Feedreaders and Aggregators

I have been thinking for several years that I should put together a series of blog posts about some of the tools, applications, and computer programs that I use in my workflow. Some of these I began using when I was a graduate student, and others I have picked up more recently.

I wanted to initially do this to share these tools with our graduate students at the University of Minnesota. It seems crazy if you say it to a graduate student, but a person in academia never has any more time than she does as a graduate student; thus making it a perfect time to learn new skills and develop excellent habits. Sharing these ideas on this blog is even better. Perhaps others can weigh in and offer alternatives or (frankly) better ideas than what I posit here.

According to Wikipedia, a “life hack is any productivity trick, shortcut, skill, or novelty method to increase productivity and efficiency.” The tools I will write about have made my work and research life easier and more productive and thus I have dubbed them “research hacks”. I hope they do the same for you.

The first type of application that I wanted to write about is a feedreader. This may seem an odd choice for the initial research hack, especially given my love of R and RStudio, but I think it is apropos. Reading research and staying abreast of current work is the lifeline of academics and researchers. Feedreaders make this easier.

In a nutshell, feedreaders scan websites for new information and then present that new information in digestible chunks, often aggregating the feeds from many websites into one application. Imagine a single “journal” that showed you all of the abstracts from the journals that you read! Not only that, but it also could show you “abstracts” of any blog entries for the blogs that you read. And “abstracts” of the major news stories from the newspapers you read.

There are several options available for a feedreader depending on the device/computer system that you use. While I don’t endorse one over any other, I will tell you what I use. (If I don’t there will be questions, and it may give you a place to start if this kind of tool is new to you.) On my Mac, I use the program Vienna. On my iPad, which is where I do most of my browsing and blog reading, I use Flipboard.

Screen Shot 2013-04-22 at 8.06.55 AM

A screenshot of Vienna running on my Mac.

Once you have a feedreader that you like, you can enter in new subscriptions. These are just website or blog URLs. If the website has a feed, the reader will detect it and add the website to your subscriptions. Then, anytime you open the reader, it will alert you as to whether the website or blog has a new post, which you can then read.

Which websites or blogs should you subscribe to? This is a matter of personal taste and also, for researchers, coverage. Onlinemathdegrees.com published a list of 100 statistics sites that might be a good place to start. Below, I list a few subscriptions that I have:

I also subscribe to blogs written by students and friends, blogs that I find interesting (e.g., The Long and Short of it All: A Dachshund Dog News Magazine), blogs that I find funny or creative (McSweeney’s), aggregators about things I like (e.g., books, design, gardening), and pretty much anything else I want to keep tabs on. I wish more journals employed feeds so that I could keep up with them that way.

New Issue of JSE

Michelle Everson just announced that the March 2013 issue of the Journal of Statistics Education (JSE) is now available online.  You can get to that issue from the homepage of JSE (http://www.amstat.org/publications/jse/).  This month, JSE also introduces some new features

  • Department on Research in K-12 Statistics Education
  • JSE webinar series (beginning June, 2013)
  • New Facebook group
  • New Twitter account

Visit JSE online and enjoy the new issue!

Dear Gmail…

I recently added a free application/service that analyzes my email called Gmail Meter. This service sends me a comprehensive weekly report full of summaries and plots that indicate how I use Gmail.

The first thing I learned is that Wednesdays are for emailing and I seem to respond in a timely manner, on average, to emails sent to me…when I actually respond (I have a 24.58% response rate. Yikes!) Wednesdays I only teach one class (at 4:40pm) this semester, but I have a morning meeting so I am on campus and generally have time to respond to emails that I may not have gotten to.

Summary of my Gmail

The plot of my daily email traffic shows that most email is sent to me during the day (typical work hours), while my email times tend to be prior to classes in the morning and after my evening courses. Also, it is clear I am sending far less that I receive. It appears I am doing my part to lower my email footprint!
chartI seem to be more prompt on my email responses (for the most part) than others who respond to me. What is interesting, is that people who respond to me are in primarily very quick (<4hrs) or take more than a day to get back to me. This fits with the behavior I expect from most academics. chart-2In the emails I send, I tend to be terse. Generally, I try to avoid long emails to people since when long emails are sent to me I tend to get cranky. (I recognize that sometimes it can’t be avoided.) I actually am quite pleased that the mode here is less than 10 words. (Again, yay for my footprint!)

I am not quite as happy to see that the mode for emails sent to me is the category indicating more than 200 words. Some of this is because of the university committees  I sit on. For example, the University of Minnesota Senate sends many emails. These emails often are lengthy because of the inclusion of bylaws and articles to the University Constitution that we will be voting on. That being said, I agree with this email charter which begs us all to keep it short.chart-3What kind of media attachments are taking up space in my Gmail box? It seems that most are Microsoft Word documents. Again, given my collaboration with other academics and feedback to students this makes sense to me. Since I have a Mac and most of my colleagues still work on PC, I send many documents as PDF files. My guess is that if this were sent to me a few years ago, the number of attachments would have been even higher. Our research group has slowly worked toward using sites like Dropbox to share documents. (Next stop…some versioning system.)chart-4Now for the plot that made me stop and write this post. Almost 90% of the email I received this week hit the trash can. Also a small percentage is still in my inbox. I am trying to achieve Inbox Zero, but just haven’t made it yet. I am currently down to xxx emails in my inbox. I signed up for the Mailbox app which should help with this goal when I check email on my phone, but like the Tempo app that Rob signed up for, there is a reservation system in place. Unlike Rob, my spot in the Mailbox line is nowhere near the bottom (last I looked 632,889 people in front of me) despite having reserved my place in line several weeks ago.chart-1I also receive information on the week’s top emailers to me (Joan) and the top recipients of my mail (one of my students); top conversation threads, a scatterplot of the number of words per email in a thread versus the rank of the email in the thread (was it the 1st email sent, 2nd, etc.). As one might expect there is a strong, negative relationship here. It also produces a word cloud based on the subjects and bodies of all messages sent or received directly. Lastly, it conditions emails received with attachments on whether they came from inside or outside the organization (University of Minnesota).

It is not clear that you can obtain the raw data, although it is not clear that you can’t either. There are of course ways to obtain the meta-data that Gmail Meter is using by scraping it using a program such as Python (see here). My guess is that you could also do this with R 9perhaps using the curl and XML packages). They have several feature requests for making Google Meter more customizable which would make it even cooler.

Upcoming events: Rob Gould and Chris Franklin on Google+ Hangouts on Air

Update: the event has ended, but can be watched via YouTube

Google+ Policy by the Numbers is airing a K-12 statistics education discussion on Nov. 28 at 4 pm EST via Hangout on Air. With the ever-increasing number of students taking AP Statistics each year and the inclusion of statistics in the Common Core State Standards for Mathematics, Franklin and Gould will address the value of statistical literacy, the increasing interest, and the challenges. Please tune in Nov. 28 at 4 p.m. EST at the Policy By the Numbers Google+ page.  I thank the American Statistical Association for working with Google+ to arrange the event.

Computing Skills, Nunchaku Skills, Bow Skills…

I have been thinking for quite some time about the computing skills that graduate students will need as they exit our program. It is absolutely clear to me (not necessarily all of my colleagues) that students need computing skills. First, a little background…

I teach in the Quantitative Methods in Education program within the Educational Psychology Department at the University of Minnesota. After graduating, many of our students take either academic jobs, a job working in testing companies (e.g., Pearson, College Board, etc.), or consulting gigs.

I have been at conferences, read blogs, and papers in which the suggestions of students learning computing skills have been posited. I am convinced of this need at 100.4%, 95%CI = [100.3%, 100.5%].

The more practical issue is what computing skills should these students learn and how deeply? And, how should they learn them (e.g., in a class, on their own, as part of an independent study)?

The latter question is important enough to merit its own post (later), so I will not address that here. Below I will begin a list of the computing skills that I believe the Quantitative Methods students should learn, and I hope readers will add to it. I use the word computing rather broadly as a matter of intention. I also do not list these in any particular order at this point, other than how they come to mind.

  • At least on programming language (probably R)
    • In my mind two or three would be better depending on the content focus of the student (Python, C++, Perl)
  • LaTeX
  • Knitr/Sweave (I used to say Sweave, but Knitr is easier initially)
  • HTML/HTML5
  • CSS
  • KML
    • I think students should also know about PHP and Javascript. Perhaps they don’t have to be fluent in them, but they are important to know about. For example, to learn D3 (a visualization toolkit) it would behoove a student to learn Javascript.
  • Markdown/R Markdown. These are again, easy to learn and could help students transition to easily learning Knitr. It could also lead to learning and using Slidify.
  • Regular Expressions
  • SQL
  • XML
  • JSON
  • XPATH
  • BibTeX (or some program to work with references….Mendeley, EndNote, something…)
  • Some other statistical programs. Some general (e.g., SAS, SPSS); some specific (MPLUS, LISREL, OpenMX, AMOS, ConQuest, WinSteps, BUGS,  etc.)
  • Unix/Linux and Shell Scripting

I think students could learn many of these at a lesser level. The basics and using them to solve simpler problems. In this way there is at least exposure. Interested students could then take it upon themselves (with faculty encouragement) to learn more about specific computing skills that are important for their own research.

What have I missed?

Red Bull Stratos Mission Data

Yesterday (October 14, 2012), Felix Baumgartner made history by becoming the first person to break the speed of sound during a free fall. He also set some other records (e.g., longest free fall, etc.) during the Red Bull Stratos Mission–which was broadcast live on the internet. Kind of cool, but imagine the conversation that took place daydreaming this one…

Red Bull Creative Person: What if we got some idiot to float up into the stratosphere in a space capsule and then had him step out of it and free fall four minutes breaking the sound barrier?

Another Red Bull Creative Person: Great idea! Lets’ also broadcast it live on the internet.

Well anyway, after the craziness ensued, It was suggested on Facebook that, “I think this data should be on someone’s blog!”. Rising to the bait, I immediately looked at the mission page,  but the data was no longer there. Thank goodness for Wikipedia [Red Bull Stratos Mission Data]. The data can be copied and pasted into an Excel sheet, or read in to R using the readHTMLTable() function from the XML package.

mission <- readHTMLTable(
  doc = "http://en.wikipedia.org/wiki/Red_Bull_Stratos/Mission_data",
  header = TRUE
  )

We can then write it to an external file, I called it Mission.csv and put it on my desktop, using the read.csv() function.

write.csv(mission,
  file = "/Users/andrewz/Desktop/Mission.csv",
  row.names = FALSE,
  quote = FALSE
  )

Opening the new file in a text editor, we see some issues to deal with (these are also apparent from looking at the data on the Wikipedia page).

  • The first line is the first table header, Elevation Data, which spanned three columns in the Wikipedia page. Delete it.
  • The last row are the re-printed variable names. Delete it.
  • Change the variable names in the current first row to be statistical software compliant (e.g., remove the commas and spaces from each variable). My first row looks like the following:
Time,Elevation,DeltaTime,Speed
  • Remove the commas from the values in the last column. With a comma separated value (CSV) file, they are trouble.
  • There are nine rows which have parentheses around their value in the last column. I don’t know what this means. For now, I will remove those values.

The file can be downloaded here.

Then you can plot (or analyze) away to your heart’s content.

# read in data to R
mission <- read.csv(file = "/Users/andrewz/Desktop/Mission.csv")

# Load ggplot2 library
library(ggplot2)

# Plot speed vs. time
ggplot(data = mission, aes(x = Time, y = Speed)) +
  geom_line()

# Plot elevation vs. time
ggplot(data = mission, aes(x = Time, y = Elevation)) +
  geom_line()

Since I have no idea what these really represent other than what the variable names tell me, I cannot interpret these very well. Perhaps someone else can.

An Apropos Talk for this Blog

Jeffrey Breen just gave a talk entitled “Tapping the Data Deluge with R” to the Boston Predictive Analytics Meetup. He suggests there are two types of data in this world

  1. Data you have, and
  2. Data you don’t have…yet.

In the talk Jeffrey provided a nice overview of several methods for importing data into R, including:

  • Reading CSV files
  • Reading XLS files
  • Reading data formats from other statistics packages (e.g., SPSS, Stata, etc.)
  • Reading email data
  • Reading online data files
  • Web scraping data
  • Using APIs to access data

He also touches on some of the R packages that are useful for adding supplementary data to enrich an analysis (e.g., zipcode).