previous & next


simple patterns and tools

useful functions on strings for simple things: .strip, .split, .join, .replace, .index, .find, .count, startswith, .endswith, .upper, .lower,

a = "hello world"
print(a.startswith("h"))
print(a.startswith("he"))
print("h" in a)
print("low" in a)
print("lo w" in a)
print("aha".find("a"))
print("hohoho".find("oh"))

and see dir("") for string methods (because "" is a string).

print(type(""))
print(dir(""))

regular expressions

use the re library and its functions re.search, re.findall, re.sub, re.split etc.
recall regular expression syntax

  • r'' to write the regular expression pattern, for “raw” strings: to read a \n as slash and an n, not as a newline character.
  • multipliers are greedy by default: *, +, ?. Add ? to make them non-greedy
  • info from match objects: .group, .start, .end
    when pattern not found: match object is None: False when converted to a boolean
import glob
filenames = glob.glob('*.csv')
print(filenames)

import re
mo = re.search(r'i.*n',filenames[0]) # multiplier * is greedy
print(mo)  # match object, stores much info. search: first instance only.
print(mo.group()) # what matched
print(mo.start()) # where match starts: indexing start at 0
print(mo.end())   # where it ends: index *after*!

mo = re.search(r'i.*?n',filenames[0])
print(mo)
print(mo.group())
print(mo.start())
print(mo.end())

When there is no match, the matched object is None and interpreted as False in boolean context:

sequences = ["ATCGGGGATCGAAGTTGAG", "ACGGCCAGUGUACN"]
for dna in sequences:
    mo = re.search(r'[^ATCG]', dna)
    if mo:
        print("non-ACGT found in sequence",dna,": found", mo.group())

by the way, compare with the less efficient code:

for dna in sequences:
    if re.search(r'[^ATCG]', dna):
        mo = re.search(r'[^ATCG]', dna)
        print("non-ACGT found in sequence",dna,": found", mo.group())

finding all instances:

print(re.findall(r'i.*n',filenames[0])) # greedy. non-overlapping matches
mo = re.findall(r'i.*?n',filenames[0])  # non-greedy
print(mo)
mo
for f in filenames:
    if not re.search(r'^i', f): # if no match: search object interpreted as False
        print("file name",f,"does not start with i")

search and replace: re.sub

  • capture with parentheses in the regular expression
  • captured elements in .group(1), .group(2) etc. in the match object
  • recall captured elements with \1, \2 etc. in a regular expression, to use them in a replacement for example
re.sub(r'^(\w)\w+-(\d+)\.csv', r'\1\2.csv', filenames[0])
for i in range(0,len(filenames)):
    filenames[i] = re.sub(r'^(\w)\w+-(\d+)\.csv', r'\1\2.csv', filenames[i])
print(filenames)
taxa = ["Drosophila melanogaster", "Homo sapiens"]
for taxon in taxa:
    mo = re.search(r'^([^\s]+) ([^\s]+)$', taxon)
    if mo:
        genus = mo.group(1)
        species = mo.group(2)
        print("genus=" + genus + ", species=" + species)

print(taxon)
print(mo) # variables defined inside "for" are available outside
print(mo.start(1))
print(mo.start(2))

next: abbreviate genus name to its first letter, and replace space by underscore:

taxa_abbrev = []
for taxon in taxa:
    taxa_abbrev.append(
        re.sub(r'^(\S).* ([^\s]+)$', r'\1_\2', taxon)
    )
print(taxa_abbrev)

split according to a regular expression

  • removes the matched substrings
  • returns an array
coolstring = "Homo sapiens is pretty super"
re.split(r's.p', coolstring)
re.split(r's.*p', coolstring)
re.split(r's.*?p', coolstring)

previous & next