Julia for GWAS and mixed models

dealing with “big data” in Julia
exactly what “big data” is evolves over time
current constraints can be
- many observations on a relatively simple structure
- complex models fit to moderately large data sets
- iterative methods with vague stopping rules
  - MCMC (Markov Chain Monte Carlo)
  - many machine-learning approaches
GWAS (Genome-Wide Association Studies) data
- two allele types at n SNP (single-nucleotide polymorphism) sites
- m individuals
- Recent arrays allow for n > 10⁶
- Some studies also have m ≈ 10⁶ or > 10¹² obs.
3 possiblities (mm, mM, MM) or missing at each position
Often stored as a PLINK binary biallelic genetype table
- each obs as 2 bits, i.e. 4 obs per byte
- column-major order - obs. on same SNP are adjacent
- columns are padded to a full byte
- two “magic numbers” at the start of the file.
- even this compacted format can be terabytes in size
Initial analysis can be summary - mean, variance, minor-allele frequency
Later may want to construct (empirical) “genetic relationship matrix”
For small studies can read data as some type of integer and work with those. Won’t work for large studies.
May be able to stream the data but don’t want to do many passes in that case.
Memory-mapped files provide an alternative
- Can also allow for parallel processing
- Substantial advantage in using a read-only data file

using Pkg
Pkg.add(PackageSpec(url="https://github.com/dmbates/BEDFiles.jl", rev="staticslices"))