Julia for GWAS and mixed models
- dealing with “big data” in Julia
- exactly what “big data” is evolves over time
- current constraints can be
- many observations on a relatively simple structure
- complex models fit to moderately large data sets
- iterative methods with vague stopping rules
- MCMC (Markov Chain Monte Carlo)
- many machine-learning approaches
- GWAS (Genome-Wide Association Studies) data
- two allele types at
SNP (single-nucleotide polymorphism) sites m
individuals- Recent arrays allow for
> 106 - Some studies also have
≈ 106 or > 1012 obs.
- two allele types at
3 possiblities (mm, mM, MM) or missing at each position
- Often stored as a PLINK binary biallelic genetype table
- each obs as 2 bits, i.e. 4 obs per byte
- column-major order - obs. on same SNP are adjacent
- columns are padded to a full byte
- two “magic numbers” at the start of the file.
- even this compacted format can be terabytes in size
Initial analysis can be summary - mean, variance, minor-allele frequency
Later may want to construct (empirical) “genetic relationship matrix”
For small studies can read data as some type of integer and work with those. Won’t work for large studies.
May be able to stream the data but don’t want to do many passes in that case.
- Memory-mapped files provide an alternative
- Can also allow for parallel processing
- Substantial advantage in using a read-only data file
using Pkg
Pkg.add(PackageSpec(url="https://github.com/dmbates/BEDFiles.jl", rev="staticslices"))