track versions of a project with git

overview

you take snapshots of your project once in a while. one “commit” = one snapshot
git stores the changes between snapshots, not the whole files
git stores its data (changes) in a special .git directory
you can easily restore the whole project to a previous snapshot, then get back to the latest snapshot
each collaborator has the project on her/his local machine, and another remote copy of the project is on GitHub. Collaborators can “pull” from GitHub and “push” to GitHub.

jump to:

examples
commits, staging area, working directory
commit messages
looking at history
moving / deleting tracked files
what (not) to track
using older commits to fix mistakes

first examples

history of repository for the course website:

git log --abbrev-commit --graph --pretty=oneline --all --decorate

example when it was useful to get back to an old version:

email about recovering old version

we can get the old version of this file in a few clicks:

go to the project on github
click on “863 commits” (or whatever the number is) near top left
scroll down to a commit that seems to have affected our file of interest

scrolling git history

click on “Browse the repository at this point in the history” to see all the files as they were just before the change that affected our file (June 8th). Tada!
notice the commit SHA near the top: “Tree: 019f0ff78d”. Click on it to get back to the current version of the files, typically “Branch: master”

let’s create a repository from our corn SNPs project

cd ~/Documents/private/st679/zmays-snps
ls -lR
git init # initialize repo: creates .git/
ls -a
git status
git add readme.md data/readme # git now tracks these files
git status
echo "Zea Mays SNP Calling Project" >> readme.md
cat readme.md
git status # readme.md is tracked, previous version in staging area, new version not

commits, staging area, working directory

commit – … – commit – staging area – working directory: tracked files, untracked files

git diff # differences between new version and staged area (if present) or last commit
git add readme.md # adding new edits to staging area
git status        # readme.md only in staging area
git diff          # no differences btw working dir and staging area
git diff --staged # diff between staging area and last commit

now take the snapshot

git commit -m "initial commit, main readme only"

With git commit only, an editor will show up to let you edit your commit message. If you get a weird-behaving editing window (vim), type :q! (to quit without saving) then change your git configuration to use nano instead of vim:

git config --global core.editor nano

commit messages

first line: title
- informative. forbidden: “update”, “continued”, “new code”, “misc”, “edits”
- 50 of fewer characters is strongly recommended
if more explanations are needed: add one blank line
then your explanation paragraph

informativeness: helps to recover old versions (example above)
separation of title vs paragraph: good example here, suboptimal example here

looking at history

let’s check:

git show   # shows last commit: title, paragraph, diffs: change "hunks"
git status # nothing in staging area, but some files not tracked
git log
git log --pretty=oneline

in a commit message: the first line has a special role, must be kept short
by the way: git log uses less to view your git history

now add more edits:

echo "Project started 2020-09-24" >> readme.md
git diff
git commit -a -m "added project info to main readme"
git log
git log --pretty=oneline --abbrev-commit

option -a in git commit: to add all changes in tracked file to the commit.

use git to move or delete tracked files

git mv data/readme data/readme.md
git status
git commit -m "added markdown extension to data readme"
git log

much more complicated alternative to git mv: (do not run!! just to convince you that git mv is much better)

mv data/readme data/readme.md # not good: new file name not tracked
git add data/readme    # track the deletion of data/readme
git add data/readme.md # tract the addition of data/readme.md
git status # git's best explanation of these changes is a file rename

what files (not) to track / commit

track:

scripts
text documentation with metadata: explain where the data are archived, how to reproduce result files
notebooks (code + explanations + interpretations) in text format (md or html): but preferably only final version of ‘compiled’/knitted version.

do not track:

large files that can be reproduced by the pipeline
large data files: if can be obtained from outside archive
binary files: document where they were obtained or how to recompile
pdf and figures: document how they can be reproduced
MS Word documents: they are not plain text files

We can tell git to ignore files that we do not want to track.

touch .gitignore
echo "data/seqs/*.fastq" >> .gitignore
cat .gitignore
git status # fastq files not listed anymore. but need to track .gitignore
git add .gitignore
git commit -m "added .gitignore, to ignore large fastq data files"
git status # all good
git log

using older commits to fix mistakes

echo "todo: ask sequencing center about adapters" > readme.md
cat readme.md # oops
git status    # git tells us how to undo our change
git checkout -- readme.md # to checkout 'readme.md' from the last commit
git restore readme.md     # same, with more recent version of git
cat readme.md # yes!
git status

What if the mistake has been staged?

echo "todo: ask sequencing center about adapters" > readme.md
git add readme.md
git status  # again, follow git's instructions
git reset HEAD readme.md
git status
cat readme.md # mistake still there, but unstaged
git checkout -- readme.md
cat readme.md # yes!
git status

previous & top & next