project management & markdown format
Set up and manage an analysis project
- well organized
- well documented
key to
- collaborate with yourself in 6 months
- re-run some analyses or run some new analyses in 6 months because the reviewers ask to
- collaborate with others
- communicate with your advisor
- detect errors
directory structure
- at the onset: create a few directories (
mkdir
) and emptyreadme.md
files (touch
) - all files in a project: live in a single directory, with a clear name (don’t scatter your files)
- separate directories for:
- data
- scripts
- binaries (of other people’s programs that you used)
- results, or analysis
- figures
- manuscript
- or even: subprojects, each with its own script & analysis directories
data
: all in the same directory, or separate directorydata_clean
for intermediate data, or subdirectoriesdata/original
anddata/clean
. Never edit raw data. Use a script to clean the original data file, then save the resulting clean data.results
oranalysis
: if the pipeline is complex, use different directories for different kinds of results (intermediate, or different analyses), or subdirectoriesscripts
: if there are many, organize them in subdirectories.- use relative paths in scripts: you can move your entire directory somewhere else (e.g. on your collaborator’s laptop) and things will still work.
- short scripts to double-check quality, make a quick figure, etc: may be in
results
. figures
: not everybody will agree. Can make your life easier to modify a figure for a publication (if asked by reviewers) or for a presentation 6 months later.
recall advice on file names
example with different structure, because
many different subsets of analyses: readme
file(s) are extremely important
example: SNP calling in corn (Zea mays)
cd ~/Documents/private/st679
mkdir zmays-snps
cd zmays-snps
mkdir data
mkdir data/seqs scripts analysis
ls -l
project documentation
document:
- data provenance (metadata): who sent which file, when, how? downloaded from where, when, how (e.g. MySQL vs UCSC Genome Browser)?
- binaries (programs, R packages): url, version, date installed (version for R, for R studio, for R packages)
- methods and workflows: everything that would be needed if you had to re-run the whole thing. Copy-paste full command lines to re-generate clean data files, intermediate files & results.
how:
- use plain text readme files. Can easily be
- read, searched, edited from command lines (good if working on remote server)
- portable, light
- text files written in 1960s: still readable.
files from 15-year-old word processor: might be difficult to open or edit.
- Microsoft Word is not good for analysis project documentation
- typically one
readme.md
file per directory. Explain what the directory contains, where from & when, how it got there.
touch readme.md data/readme
touch
updates the modification time of a file or
creates a file if it doesn’t already exist.
use brace expansion to create a directory structure in one step:
echo dog-{gone,bowl,bark}
mkdir -p zmays-snps/{data/seqs,scripts,analysis}
let’s create some empty files to illustrate more uses of wildcards. supposedly from 3 corn samples: A, B and C, and 2 files for each because paired-end sequencing data: read pair R1 or R2.
cd data
touch seqs/zmays{A,B,C}_R{1,2}.fastq
ls seqs/
ls seqs/zmaysB*
ls seqs/zmays[AB]_R1.fastq
ls seqs/zmays[A-C]_R1.fastq
ls seqs/zmaysA_R{1..2}.fastq
ls -lR
markdown for project notebooks
light-weight markup format. plain text, extension .md
.
Many “dialects” e.g. R markdown .Rmd
.
Original markdown reference.
interlude: advantage of plain text files for reports and data: Excel errors
- easy to read the plain text format
- easy to track changes
- easy to render as pdf, html, etc. GitHub and Dropbox do it automatically.
example readme: click ‘Raw’ to see what the file truly contains (text only, no beautification)
basic syntax:
markdown syntax | result |
---|---|
*emphasis* | emphasis |
**bold** | bold |
`inline code` | inline code |
<http://website.com/link> | http://website.com/link |
[link text](http://website.com/link) | link text |
![text](path/to/image.png) | image with alternative text “text” |
# chapter 1 | level-1 header |
## section 1.1 | level-2 header |
### paragraph 1.1.1 | level-3 header |
level-1 and level-2 headers can also be obtained like this:
chapter 1
=========
section 1.1
----------
numbered or bulleted lists:
* first point, itemized
- second point
2. indentation is necessary
2. nested list
1. numbers can be messed up, see how it's rendered below
which gives this:
- first point, itemized
- second point
- indentation is necessary
- nested list
- only the first number is used, see how it’s rendered below
to get code blocks, indent with 4 spaces (or 8 spaces if within a list):
this will be a block.
can be used for quotes as well.
or use 3 backticks, possibly followed by the language name:
```r
foo <- function(x){x+1} # R function “foo”, just adds 1
foo(2)
```
which gives this:
foo <- function(x){x+1} # R function "foo", just adds 1
foo(2)
This other code block:
```julia
function foo(x) # Julia function “foo”, just adds 1
x+1
end
foo(2)
```
gives this:
function foo(x) # Julia function "foo", just adds 1
x+1
end
foo(2)
With ```
instead of ```julia
at the beginning,
the code block would be rendered without color highlights.
to force a newline, end your line with 2 spaces
example:
try to select this text
with your mouse, to see
where markdown would go
to a newline.
rendering a markdown file to other formats
Many online viewers will render markdown automatically, like
github, box, dropbox, osf. In VS Code, click on the “preview” icon.
We can also create new files: pdf, html, etc.
cd ~/Documents/private/st679/bds-files/chapter-02-bioinformatics-projects
less notebook.md
pandoc notebook.md
pandoc notebook.md > notebook.html
open notebook.html # xdg-open on Linux
pandoc -o notebook.html notebook.md
pandoc -o notebook.pdf notebook.md
open notebook.pdf
pandoc -o notebook.tex notebook.md
less notebook.tex