introduction to the Unix shell
GUI (graphical user interface): easy but not reproducible.
CLI (command line interface) or REPL (read-evaluate-print loop):
steep learning curve but reproducible and powerful.
The shell is an incredibly powerful tool:
- Gary Bernhardt: Unix is like a chainsaw. Can kill trees, and people. The Unix shell can do great things, but power comes with danger: it’s unsafe!
- example:
$ rm -rf tmp-data/aligned-reads*
deletes all large alignment files aligned-reads_1
to aligned-reads_1000
in old temporary directory tmp-data
.
rm
is to remove files & directories
-r
will do it recursively (enter each directory within each directory)
-f
will “force” removal without asking you confirmation for each individual file
*
in the shell will match anything
with unintended extra space:
$ rm -rf tmp-data/aligned-reads *
rm: tmp-data/aligned-reads: No such file or directory
deletes your entire current directory (ouch!)
Later in the semester, watch the shell script example from Gary Bernhardt’s talk at 16:21-18:40 : you should understand exactly how it does what it does.
modularity
This is the Unix philosophy: Write programs that do one thing and do it well.
Write programs to work together. Write programs to handle text streams,
because that is a universal interface.
—- Doug McIlory
- modularity
- pipes: STDIN –> myprogram –> STDOUT
Advantages to modularity:
- easier to spot errors, and fix them
- experiments with alternative choices at one step in the pipeline
- choose appropriate tool for each step, e.g. C++ -> Python -> R
- modules possible reusable for other tasks later on
text streams
to process a stream of data rather than holding it all in memory.
Example: concatenate two data files. Open both in editor, copy one and paste into the other?
- may not have enough memory
- manual operation: error-prone and not reproducible.
Instead: print the files’s content to standard output stream and redirect this stream from our terminal to the file we wish to save the combined results to.
cd bds-files/chapter-03-remedial-unix/
cat tb1-protein.fasta
cat tga1-protein.fasta
cat tb1-protein.fasta tga1-protein.fasta
cat tb1-protein.fasta tga1-protein.fasta > zea-proteins.fasta
ls -lrt
-l
: in list format
-r
: in reverse order
-t
: ordered by time
Streams process data without storing huge amounts in our computers’ memory: very efficient