theory.toys – Genome-Wide Association Study

About GWAS

Genome-Wide Association Studies (GWAS) test hundreds of thousands to millions of genetic variants (SNPs) for association with a phenotype. GWAS has identified thousands of loci influencing complex traits like height, disease risk, and metabolic traits, revolutionizing our understanding of genetic architecture.

Manhattan Plot

The Manhattan plot displays association test results across the genome:

X-axis: Genomic position (organized by chromosome)
Y-axis: −log₁₀(p-value), where higher values = stronger association
Genome-wide significance: p < 5×10⁻⁸ (corrected for ~1 million independent tests)
Peaks: Regions harboring trait-associated variants

Statistical Power

The probability of detecting an association depends on:

Effect size (β): Change in phenotype per allele copy (larger = easier to detect)
Minor allele frequency (MAF): Rarer variants need larger samples (optimal ~0.5)
Sample size (N): More individuals = greater power
Heritability (h²): Proportion of variance explained by genetics

Test Statistic

For each SNP, we test association using linear regression:

Model: y = μ + βx + ε, where x is genotype (0, 1, or 2 copies of minor allele)
Null hypothesis: β = 0 (no association)
Test statistic: t = β̂ / SE(β̂), which follows t-distribution under H₀
P-value: Probability of observing effect this large by chance

Multiple Testing Correction

Testing millions of SNPs requires strict significance thresholds:

Bonferroni correction: α / M, where M = number of tests
Genome-wide threshold: 0.05 / 1,000,000 ≈ 5×10⁻⁸
Suggestive threshold: 1×10⁻⁵ (one false positive per genome scan)
Why so strict?: Prevents false discoveries from chance associations

Effect Size vs. Sample Size

GWAS power calculations show:

Large effects (β ≥ 0.5): Detectable with N ~ 1,000–5,000
Moderate effects (β ~ 0.2): Need N ~ 10,000–50,000
Small effects (β < 0.1): Require N > 100,000 (mega-GWAS)
Common vs. rare: Rare variants (MAF < 0.05) need much larger samples

Linkage Disequilibrium

In this simulation, SNPs are independent. In real GWAS:

LD structure: Nearby SNPs are correlated due to shared ancestry
Tag SNPs: Genotyped SNPs capture variation at untyped causal variants
Peak width: Association signal spreads across LD blocks (~100kb–1Mb)
Fine-mapping: Requires additional methods to pinpoint causal variant

Historical Context

GWAS emerged in the mid-2000s with the advent of SNP microarrays enabling affordable genome-wide genotyping. The Wellcome Trust Case Control Consortium (2007) demonstrated GWAS feasibility with the first large-scale studies of common diseases. Since then, GWAS meta-analyses have grown to hundreds of thousands of participants, identifying thousands of trait-associated loci and enabling polygenic risk scores for disease prediction.

Related Methods

QTL mapping: Linkage-based approach in experimental crosses (see QTL Scan toy)
Fine-mapping: Bayesian methods to identify causal variants within LD regions
PRS (Polygenic Risk Scores): Combine many small effects for prediction
Mendelian Randomization: Use genetic variants as causal instruments

Genome-Wide Association Study

Causal Variant

Study Design

Display Options