previous & next


  • dealing with “big data” in Julia
  • exactly what “big data” is evolves over time
  • current constraints can be
    • many observations on a relatively simple structure
    • complex models fit to moderately large data sets
    • iterative methods with vague stopping rules
      • MCMC (Markov Chain Monte Carlo)
      • many machine-learning approaches
  • GWAS (Genome-Wide Association Studies) data
    • two allele types at n SNP (single-nucleotide polymorphism) sites
    • m individuals
    • Recent arrays allow for n > 106
    • Some studies also have m ≈ 106 or > 1012 obs.
  • 3 possiblities (mm, mM, MM) or missing at each position

  • Often stored as a PLINK binary biallelic genetype table
    • each obs as 2 bits, i.e. 4 obs per byte
    • column-major order - obs. on same SNP are adjacent
    • columns are padded to a full byte
    • two “magic numbers” at the start of the file.
    • even this compacted format can be terabytes in size
  • Initial analysis can be summary - mean, variance, minor-allele frequency

  • Later may want to construct (empirical) “genetic relationship matrix”

  • For small studies can read data as some type of integer and work with those. Won’t work for large studies.

  • May be able to stream the data but don’t want to do many passes in that case.

  • Memory-mapped files provide an alternative
    • Can also allow for parallel processing
    • Substantial advantage in using a read-only data file
using Pkg
Pkg.add(PackageSpec(url="https://github.com/dmbates/BEDFiles.jl", rev="staticslices"))

previous & next