GitHub workflow for data science project proposals
Over the past few years I’ve been working on moving from a mindset of end-of-semester project to semester-long project. Inevitably students end up doing lots of work as the deadline approaches at the end of the semester (and I can’t blame them, that’s how I work around deadlines too, and how just about anyone I know works), but creating opportunities for them to get started on their projects earlier in the semester is very important.
A natural way of doing this is with a project proposal about mid-way through the semester. A challenging aspect of this is that students have to propose a project before they’ve learned all methods and techniques, but there is simply no way to do it any other way, and I find that providing feedback along the lines of “and we’ll also learn about X which you can apply here” can help mitigate this issue.
This blog post is about how I implement project proposals in my Introduction to Data Science course. This is the second year I’m using this approach fully, and I feel like it’s working quite well. Two important details about the course that relate to projects are:
- students learn R and turn in all assignment write-ups as R Markdown documents in GitHub repositories, and
- students work in teams on weekly computing labs as well as the semester-long project.
You can read about the curriculum here, see the course webpage for this semester here, and specifically the project instructions here.
Project proposal assignment
I create a GitHub repository for each team with the following structure.
.github/
|-- ISSUE_TEMPLATE/
|-- proposal-feedback.md
|-- workflows/
|-- allowed-files.yaml
|-- knit.yaml
data/
|-- README.md
presentation/
|-- presentation.Rmd
proposal/
|-- proposal.Rmd
README.md
README.Rmd
For this phase students are only expected to work in the data/
and proposal/
folders.
- Any data used for the project gets placed in the
data/
folder and the codebook for the dataset(s) are added to theREADME.md
file in that folder. - The proposal write-up goes in
proposal.Rmd
and ultimately students knit this document and push it to their repos along with the output,proposal.md
.
Project proposal review
The proposals are marked out of 10 points, since they make up 10% of the overall project score. A breakdown of the 10 points is outlined below. There is a lot more that goes into the evaluation than just these bullet points, especially because these are open-ended projects and we’re encouraging students to be creative with their projects, but these were points I kept in mind to ensure consistency in grading.
(3 pts)
data/
- Contains dataset(s) to be used in the project
- Instructions removed from
README
and codebook added - Metadata on the data clearly stated, e.g. “The dataset has/contains/is >comprised of/etc. R observations, each representing […], and C columns.”
(5 pts)
proposal/
- (1 pt) Structure:
.Rmd
file is updated and is in the repo.md
file is updated and is in the repo- Figures are visible in the
.md
file - Uses section headings to organise each part
- Contents:
- (1 pt) Part 1: Introduction
- Research question(s) clear
- Cases are stated
- Variables to be used are explained
- (1 pt) Part 2: Data
glimpse()
orskim()
output available
- (2 pts) Part 3: Data analysis plan
- Response and explanatory variable(s) clear
- Comparison groups clear
- Preliminary EDA results and interpretation
- Stats methods to be used outlined
- Hypothesized results stated and inline with / can reasonably be supported by the preliminary EDA
- (1 pt) Part 1: Introduction
- (1 pt) Structure:
(1 pt) Workflow:
- Data read in from
data/
folder - Reasonable number of commits
- Meaningful commit messages (or at least not an abundance of not meaningful ones)
- No unexpected/disallowed files
- Data read in from
(1 pt) Teamwork: All members committed to the repo
After the proposal deadline I evaluated each of the proposals – there are 65 teams in the course, so this was no small endeavour!
After reading through a few projects I took the time to write issue templates for common feedback and pushed the templates to teams’ repos using ghclass::repo_add_file()
, which improved my process for giving feedback and greatly reduced the time each review takes. In many cases I didn’t use these issue templates verbatim, but it was nice to have starter text for things I wanted to point out like how to structure the codebook, reading in the data from the data folder (as opposed to replicating it in the proposal folder), structuring the research question, etc. Note that I also used a template for the issue containing marks and information on the next steps. This included instructions for how to close issues with commits, which students hadn’t previously done in class.
I use issue templates for providing feedback for all assignments in the course – it’s a convenient place to add a rubric that all markers can follow and that can be transparently shared with the students along with their final marks. It also makes the markers’ jobs easier and helps keep feedback consistent across multiple markers. See here if you want to learn more about GitHub issue templates.
Project re-proposal
When reviewing proposals, I like being as constructive as possible with my comments as students will be working on their projects for the remainder of the course and the project makes up 20% of their overall marks (so it’s no small deal for them). The last thing I want is for students to assume they were on the right track, only to find out at the end that they weren’t.
I also want to make sure students actually review the feedback and feel motivated to improve their projects based on the feedback. To address both of these goals, we also have a re-proposal stage. In this stage students are asked to address each of the issues on their proposal, close them with specific commits, and finally tag me on their repos to say they’re ready for another look at their work.
This stage is optional, so teams who feel happy with the score they got can just take the feedback and not put in the work right away to improve their projects. But for teams who want to improve their score, it gives them an opportunity to do so. Their final proposal score is the average of the scores they receive in the two phases. Based on past experience, most teams get close to a 10/10 at the re-proposal stage.
From the instructor’s perspective it’s very fulfilling to see students deliberately work on specific improvements to their projects. And while I don’t have student feedback on this (yet, I’ll ask students about it at the end of the semester), students’ commit messages suggest that they’re, at least, not frustrated by the experience.
Students closing these issues one-by-one (roughly 5 issues per repo) generates a lot of emails for me, but that was honestly a nice distraction from election doomscrolling!
I also used this week’s code-along session to go over some of the common issues I observed in the proposals as well as to demo how to close issues with commits. Even though I wrote out instructions for this, it seemed worthwhile to demo the process since they hadn’t done this before. I also demoed how to use here::here()
to define the path to data. So far in the class we didn’t really need to use this approach because student assignments were always structured with the data/
folder at the same level as the R Markdown file they write in. The folder structure I outlined above created challenges when knitting the R Markdown document vs. running the code in the console. You can find a recording of the code-along session here.1
What’s next?
For me, I need to re-read a bunch of project proposals, which sounds draining, but I know students’ projects will be better for it at the end. Knowing that it will help them (and, let’s be frank, knowing that it will help generate better project presentations to listen to and mark at the end of the semester) is what motivates me to do it.
For the students, once they get their re-proposal feedback, they’ll move on to making progress on their projects. After the next project milestone in a couple weeks, students will do peer review on each others’ work. The final deliverables for the project are a 5 minute presentation (and a slide deck made with xaringan) and an “executive summary” which goes in the README of their project repo. Before they finish their projects they will enable ghpages for their repos, giving them a landing page for their project with their summary and a link to their slides. And at the end of the semester teams will have the option to make their project repos public.2
This has been a nice distraction from marking those re-proposals. I better get back to it…
I tweeted before that my fake student account is named Florence Nightingale. The fake student team is called Florence and the Machine. I’m pretty proud of this, and it gets a chuckle from the students 😃.↩︎
Think of it as a souvenir from the course!↩︎