python dictionaries & list comprehension; running external programs
- dictionaries
- more on list comprehension
- run external program from within python
- about characters
dictionaries (hashes)
dictionaries are similar to lists, except that items not unordered.
items: key-value pairs, where keys must be strings or numbers
(or of immutable types)
like word-definition pairs in a dictionary.
examples:
d = {"blue":10, "green":20}
d # may be printed in different order
d = {}
d["blue"]=10 # it's weird that we use [] here and not {}
d["green"]=20
d
# access the value of a key, or simply check if a key exists
"blue" in d
d.get("blue")
d["blue"]
"red" in d
d["red"] # oops
d.get("red")
d.get("red") == None
d.get("red", 5)
# remove things
del d["green"] # deletes element (binding)
d
a = d.pop("blue") # deletes, but returns value
d
example:
- count the number of occurrences of k-mers (words of length k) in a string
- do not create entries for a k-mer if it does not occur
- here: use k=2
dna = "TCAATAGGTGGTCGTTGTTT"
k2mer = {} # variable names cannot start with a digit
bases = ["A","C","G","T"]
for nuc1 in bases:
for nuc2 in bases:
mycount = dna.count(nuc1+nuc2)
if mycount:
k2mer[nuc1+nuc2] = mycount
trimer_count = {} # variable names cannot start with a digit
bases = ["A","C","G","T"]
for i in range(0,(len(dna)-2)):
trimer = dna[i:(i+3)]
if trimer in trimer_count:
trimer_count[trimer] += 1
else:
trimer_count[trimer] = 1
trimer_count
trimer_count["TTT"]
trimer_count["GCC"] # key error
trimer_count.get("GCC", 0) # default to 0 if key not found
for trinucleotide in trimer_count:
print("count for " + trinucleotide + ":", trimer_count[trinucleotide])
for k,v in trimer_count.items(): # k,v for key,value. (k,v) is a tuple
print("count for " + k + ":", v)
for tn in sorted(trimer_count.keys()): # k,v for key,value
print("count for " + tn + ":", trimer_count[tn])
for tn in trimer_count:
if trimer_count[tn] == 2:
print(tn, "appears twice")
dictionary sometimes called “hash”: why?
efficient implementations map keys to unique integers using a hash function,
then can use an array with these integers.
consequence: key advantage of dictionaries over lists:
“immediate” lookup of a value, independent of the size of the dictionary.
sets
like dictionaries, but keys only, no values
s1 = {"blue","green"}
for color in s1:
print(color)
s1.add("green") # was already there: no change
s1
s1.add("orange")
s1
"red" in s1
"blue" in s1
s1.remove("green")
a = s1.pop()
s
s2 = s1.union({"green","blue"})
s3 = {"green","purple"}
s2.intersection(s3)
s2.difference(s3) # s2 \ s3
set(trimer_count.keys())
more on list comprehension
concise notation, usually easy to read (for a human)
[xxx for y in z]
[xxx for y in z if uuu]
where z
is a list, dictionary, “range” or other iterable;
xxx
and uuu
typically depend on y
.
[k for k in trimer_count] # list of keys, here list of trimers present in the sequence
[k for k,v in trimer_count.items() if v>1] # trimers appearing 2+ times
paramvalues = [10 ** i for i in range(-3,2)] # from "range" object
[v**2 for v in paramvalues]
[v**2 for v in paramvalues if v >= 0.1] # from list, with condition
d = {'xtolrel':0.01, 'xtolabs':0.001, 'Nfail':50} # d for dictionary
d # note possibly different order
[d[k]*2 for k in d] # k for key
[d[k]*2 for k in d if k.startswith("xtol")] # with condition
[[k, v*2] for k,v in d.items()] # k for key, v for value. returns a list
# extract a subset of values
x = [18, 1, 54, 0, 2, 72]
wanted = [True, True, False, False, True, False]
[x[i] for i in range(len(x)) if wanted[i]]
works to creates dictionaries or sets too:
dict( [[k, v*2] for k,v in d.items()] ) # wasteful
{ k:v*2 for k,v in d.items() } # same result
{ k for k,v in d.items() } # set: like a dict but keys only, no values
{ k for k in d } # same as above
set(d.keys()) # same result
{ k for k in d if re.search(r'tol',k)}
nested for loops in list comprehension:
paramvalues = [a * (10 ** i) for i in range(-3,2) for a in [1,2]]
running external programs
module subprocess
: more portable than os.system()
.
import subprocess
subprocess.call("date -u", shell=True) # return exist status: 0 if good
to capture the output within python as a string:
subprocess.check_output("date")
res = subprocess.check_output(["date", "+%B"])
res
type(res)
res.decode("utf-8")
res = subprocess.check_output("ps -u ane | grep python", shell=True)
res
print(res.decode("utf-8"))
better, e.g. to capture standard output and standard error separately:
print(subprocess.run("/bin/date")) # does not capture output
res = subprocess.run("/bin/date +%B", shell=True, stdout=subprocess.PIPE)
res = subprocess.run(["/bin/date", "+%B"], stdout=subprocess.PIPE)
print("return code: ", res.returncode,
"\nstdout: ", res.stdout,
"\nstderr:", res.stderr)
res = subprocess.run(["ls", "wrong/path"], stdout=subprocess.PIPE) # stderr to screen
res = subprocess.run(["ls", "wrong/path"], capture_output=True)) # stderr captured too
about characters
Unicode: code that maps more than 120,000 characters to integers, which then need to be coded with some number of 0/1 bits.
unicode | character | binary | hexadecimal | utf-8 | ||
---|---|---|---|---|---|---|
0 | null ~ string terminator | 0 | 0 | 00000000 | ||
4 | end of file | 100 | 4 | 00000100 | ||
9 | tab \t | 1001 | 9 | 00001001 | ||
10 | newline \n | 1010 | a | 00001010 | ||
48 | 0 | 110000 | 30 | 00110000 | ||
65 | A | 1000001 | 41 | 01000001 |
The first 2⁷ = 128 unicode characters are the ASCII characters: need at least 7 bits
UTF-8 encoding uses
1 byte (8 bits) for ASCII characters,
but up to 4 bytes (32 bits) in general.
byte strings: each character coded with 1 byte only (8 bits), i.e.
as integer in 0-255. ord("0")
gives integer 48, chr(48)
gives
character ‘0’, ord("\t")
gives 9.
Try bstr = bytes(b"01239abc\n")
then bstr[8]
.
ord("\0") # null character
ord("\n")
ord("\r")
ord("é") # unicode character number 233
ord("🙂") # 128578
len(bytes("🙂", encoding='utf-8')) # 4 bytes
a = bytes("é", encoding='utf-8')
len(a) # coded with 2 bytes using UTF8 encoding
a # b'\xc3\xa9'
a[0] # 195 = 12*16 + 3 > 128: not ASCII
a[1] # 169 = 10*16 + 9
bin(a[0]) # 0b11000011
bin(a[1]) # 0b10101001
# so UFT-8 code for é is: 1100001110101001
hexadecimal/hex: “numbers” in base 16=24: to represent a “nibble”
(4 bits, half a byte). 0-9, a-f
(recall git commit SHAs, e.g “bec2817eb21e17c49a355878de577a91b9c6c5b6”).
Try: 0x0
, 0x9
, 0xb
, 0xf
, 0x0f
, 0x2b
(2*16+11), 0xff
(15*16+15).
To write in base 2, use 0b
instead of 0x
, like 0b101011
(32+8+2+1).
To get the binary or hexadecimal representation: bin(43)
or hex(43)
.