more shell tools, sed
- earlier summary of useful commands
- commands we don’t tend to use everyday
(but perhaps we should):
cut
,sort
,column
,basename
,dirname
,tree
- stream editor: sed
useful shell commands
cut
if not done already, clone these data from the “Bioinformatics Data Skills” book by Vince Buffulo (outside any existing git repo!).
cd bds-files/chapter-07-unix-data-tools
ls -lh # 1.6M = size of Mus_musculus.GRCm38.75_chr1.bed
head Mus_musculus.GRCm38.75_chr1.bed # chromosome number, start & end position
cut -f 1 Mus_musculus.GRCm38.75_chr1.bed | sort | uniq # check chromosome 1 only
cut -f 2 Mus_musculus.GRCm38.75_chr1.bed | head -n 3
Each row is for a genomic “feature”,
same as in the associate .gtf
file.
We can check the .bed
and .gtf
files have the same number of lines
except for lines of metadata (how?)
quick look at .gtf
file:
less -S Mus_musculus.GRCm38.75_chr1.gtf
other ways to use options for cut
:
-f2
,-f 1,3
,-f1-3
-c2
to cut (extract) the second character, not the second field (column)-d
to change the delimiter between columns fields instead of tab:-d ' '
for a space,-d,
-d ,
or-d ','
for a comma (csv files).
example from research: data used in
Baum et al. 2016
originally from Perelman et al. 2011.
54 genes in 178 primate species, from 186 individuals
(different subspecies).
to get the example file: clone lecture-examples (and ‘git pull’ to get updates later in the semester)
cd ../../classroom-repos/lecture-examples/
less -S combined.nex
wc combined.nex
data cleaning: needed to get the list of individuals and to identify species with multiple subspecies; we needed to remove ‘duplicates’ to keep a single representative per species. step I used to get the current list of “taxa” in my cleaned file:
grep '_' combined.nex | cut -f 1 -d ' ' > taxa
wc taxa
head taxa
sort
-k
to sort by specific columns (keys). example below:
-k1,1
to sort by keys in columns 1 to 1,
then -k2,2
to resolve ties by to sorting columns 2 to 2, and
n
to sort that 2nd column numerically,
r
to sort that 2nd column it in reverse order.
-c
to check if the file is sorted already (fast)
-t,
or -t ","
to change the separator to a comma instead of tab (default)
cd ../../bds-files/chapter-07-unix-data-tools
sort -k1,1 -k2,2n example.bed
sort -k1,1 -k2,2nr example.bed
sort -k1,1 -k2,2n -r example.bed
sort -k1,1 -k2,2n -c example.bed
sort -t, -nc Mus_musculus.GRCm38.75_chr1_bed.csv # assumes GNU sort
sort -t, -n Mus_musculus.GRCm38.75_chr1_bed.csv | head # GNU sort
sort -t, -nr Mus_musculus.GRCm38.75_chr1_bed.csv | head
exercise: extract & count all the features (e.g. “exon”) for gene ENSMUSG00000033793
from file Mus_musculus.GRCm38.75_chr1.gtf
.
[1 line = 1 feature, the feature name is in the 3rd column]
exercise: extract all the combinations of distinct feature names (like “gene” or “exon”) and strands (+ or -), with their counts (sorted). [the strand in the 7th column]
column
column
formats tabular data to visualize in the terminal
cd ../../bds-files/chapter-07-unix-data-tools
head Mus_musculus.GRCm38.75_chr1.gtf # format for genomic features. 1 line = 1 feature, e.g. 1 gene. start & end positions are 1-based
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | cut -f1-8 | head
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | cut -f 1-8 | column -t | head
column -s"," -t Mus_musculus.GRCm38.75_chr1_bed.csv | head
-s","
sets the separator to a comma instead of a tab (default)
basename & dirname
basename
and dirname
extract the file/folder name and its path
from a string (the file/folder need not exist).
-s suffix
: to removed known suffix (like an extension)
pwd
basename $(pwd)
dirname $(pwd)
basename "relative/path/to/myfile.txt"
dirname "relative/path/to/myfile.txt"
basename "/absolute/path/to/myfolder"
dirname "/absolute/path/to/myfolder"
dirname "myfile.txt" # current directory: .
basename -s "txt" "relative/path/to/myfile.txt"
basename -s "txt" "relative/path/to/myfile.txt"
basename -s ".txt" "relative/path/to/myfile.txt"
basename -s "le.txt" "relative/path/to/myfile.txt"
tree
freebie: tree directory_name
shows the content of a folder
in a tree-like format (I do love trees).
to get tree
on macOS: brew install tree
stream editor: sed
edits the file without having to load it all in memory! most important use: to substitute things
sed s/pattern/replacement/ filename > newfile # do NOT redirect to input file!
sed -i s/pattern/replacement/ filename # for in-place replacement
s///
to replace first occurrence of a match,
s///g
to replace globally (all instances),
s///i
and s///gi
for case-insensitive search
option -E
for Extended (not basic) regular expressions
-n
option to not print every line
p flag to print: s///p
print if there is a match
warning: unlike grep
, sed
does not recognize “enhanced” (Perl-like)
expressions like \d
(digit), \s
(space) or \w
(word character).
use classes instead: like [0-9]
or [a-zA-Z_]
.
cat chroms.txt # "chrom1" in first column
sed 's/chrom/chr/' chroms.txt
we can capture and re-use a match: with ()
to capture a pattern
and \i
to print the ith match.
example: transform a file with lines of the form “chromosomename:startposition-endposition” to a tabular table “chromosomename startposition endposition” (3 columns separated with tabs)
echo "chr12:74-431" | gsed -E 's/^(chr[^:]+):([0-9]+)-([0-9]+)/\1\t\2\t\3/'
(We have to use GNU sed, gsed
on my machine, to have \t
be understood as a tab,
see grep section to install it.)
Other ways to do the same thing below:
echo "chr12:74-431" | gsed 's/[:-]/\t/g' # why g flag?
echo "chr12:74-431" | gsed 's/:/\t/' | gsed 's/-/\t/'
echo "chr12:74-431" | gsed -e 's/:/\t/' -e 's/-/\t/'
echo "chr12:74-431" | tr ':-' '\t' # tr = translate
The first way is more specific, so probably more robust. Always check for weird errors.
exercise: extract the unique transcript names in the data file
Mus_musculus.GRCm38.75_chr1.gtf
: string after “transcript_id”.
less Mus_musculus.GRCm38.75_chr1.gtf # type /transcript_id to search and highlight instances
option 1: search for (and output) “transcript_id” and the following string,
extract this string, remove the quotes, remove duplicates.
recall uniq
removes consecutive duplicates.
option 2: search for lines not starting by #, then replace ‘transcript_id “anything but double quote, captured”’ by what was captured (inside the double quotes).
grep -E -o 'transcript_id "\w+"' Mus_musculus.GRCm38.75_chr1.gtf |
cut -f2 -d" " | sed 's/"//g' | sort | uniq | head # or redirect: > newFileName.txt
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf |
sed -E 's/.*transcript_id "([^"]+)".*/\1/' | head # wait: doesn't work: extract lines!
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf |
sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p' | sort | uniq | head
sed
can do much more. just one example, to print lines 20 to 23:
sed -n '20,23p' Mus_musculus.GRCm38.75_chr1.gtf | cut -f1-5 | column -t
greedy matching: problem if we try to capture between quotes like this
"(.+)"
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf |
sed -E -n 's/.*transcript_id "(.+)".*/\1/p' | head -n 1
grep "^[^#].*transcript_id" Mus_musculus.GRCm38.75_chr1.gtf | head -n 1
echo 'before transcript_id "E0160944"; gene_name "Gm16088" after' > greedy_example.txt # simpler example
cat greedy_example.txt
sed -E 's/.*transcript_id "(.+)".*/\1/' greedy_example.txt
sed -E 's/.*transcript_id "([^"]+)".*/\1/' greedy_example.txt
sed -E 's/transcript_id ".*([^"]+)".*/\1/' greedy_example.txt # what happened here? explain