July 10, 2025

DNA Methylation (5mC)

  • chemical modification to an individual DNA base (Cytosine)
  • doesn’t change coding sequence

image source: https://www.epigentek.com

DNA Methylation at CpGs

The 5th base?

Preview: role in gene regulation

Evolution of methylation assays

Analysis of methylation arrays

  • Methylation microarray (e.g. Infinum 450K, Epic 850K) analysis has similarities to expression microarray analysis
    • Signal is continuous
    • 1-16 samples per array
  • Beta values: \[\beta = \frac{\text{methylated intensity}}{(\text{meth} + \text{unmeth intensity})}\]
  • Analysis: regression on \(logit(\beta)\)
  • Packages: minfi, bumphunter

image: https://en.wikipedia.org/wiki/Illumina_Methylation_Assay

Analysis of MeDIPseq

Analysis of bisulfite sequencing

  • Bisulfite sequencing (e.g. WGBS, RRBS)
    • basepair resolution
    • binomial count of methylated/unmethylated reads
  • Packages:
    • bsseq: smoothed t-test
    • dss: approximate beta-binomial regression
    • dmrseq: approximate beta-binomial regression over regions

drawingdrawing

Bisulfite sequencing

Bisulfite sequencing

Preprocessing

Differential methylation

  • differentially methylated cytosine (DMC) or locus (DML):
    • array: test each probe
    • bisulfite sequencing: test each cytosine
  • differentially methylated region (DMR):
    • array: test groups of neighboring probes
    • bisulfite sequencing: test groups of cytosines

DMC test of bisulfite read counts

  • For cytosine \(i\) and sample \(j\) in condition \(s\), we have:
  • \(M_{ij}\) reads corresponding to methylation
  • \(N_{ij}\) total reads covering \(i\)
  • \(p_s\) be the methylation probability in condition \(s\)
  • Let \(M_{ij} \sim Binom(N_{ij}, p_{s})\)
  • Or equivalently, \(M_{ij}\) is the sum of \(N_{ij}\) draws from \(Bernoulli(p_s)\)
  • How to test whether \(p_1 = p_2\)?

Binomial (Logisitic) regression

  • Generalized linear model for probability of success \(p\): \[logit(p)=log\Big(\frac{p}{1-p}\Big) = \boldsymbol{X\beta}\]
  • Link function \(g(p) = log\big(\frac{p}{1-p}\big)\) describes relationship between mean parameter of response and linear predictor
  • No closed form; fit with iterative ML estimation
  • Interpret coefficients on original scale with inverse link function \(g^{-1}\): \[ p = \frac{e^{\boldsymbol{X\beta}}}{1+e^{\boldsymbol{X\beta}}} \]
  • In R: glm(cbind(successes, failures) ~ x, family="binomial")

Link functions for GLM

Why logit transformation?

GLM vs Linear Regression

  • Linear regression on transformed proportions \(z\) (lm()):
    • \(log\Big(\frac{p}{1-p}\Big) = \boldsymbol{z} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}\)
    • \(\boldsymbol{z} \sim N(\boldsymbol{X\beta},\sigma^2)\)
    • \(Var(\boldsymbol{z}) = \sigma^2\)
  • Generlized Linear regression on binary variables \(y\) (glm()):
    • \(\boldsymbol{y} \sim Bern(p)\), \(Var(\boldsymbol{y}) = p(1-p)\)
    • \(log\Big(\frac{p}{1-p}\Big) = g(p) = \boldsymbol{X\beta}\)
    • \(Var(y) = g^{-1}(\boldsymbol{X\beta})(1-g^{-1}(\boldsymbol{X\beta}))\)

Binomial regression with overdispersion

  • In ordinary logistic regression, \(p\) is assumed to be constant for all samples with in group
  • To model overdispersion, we might want to allow this quantity to vary
  • For example, let \[p \sim Beta(\alpha, \beta)\] \[M \sim Binom(N, p)\]
  • This is the Beta-Binomial distribution
  • Overdispersion parameter \(\phi = \frac{1}{\alpha + \beta + 1}\) contributes to increased variance
  • Variance \(\sigma^2 = p(1-p)(\alpha + \beta + N)\phi\)
  • In R: aod::betabin(cbind(successes,failures) ~ x, ~ x)

Pitfalls of binomial regression

  • What happens to logit link function \(log\big(\frac{p}{1-p}\big)\) if \(p=0\) or \(1\)?
    • binomial regression unstable for fully methylated or unmethylated cytosines
  • Computationally intensive to fit model at every cytosine
  • DSS: Park & Wu 2016 (https://doi.org/10.1093/bioinformatics/btw026)
    • Differential methylation under general experimental design
    • Alternate link function: \(arcsine(2p-1)\)
    • Approximate fitting with Generalized Least Squares (GLS)

Generalized Least Squares (GLS) in a nutshell

  • Hybrid of linear regression and generalized linear regression
  • Pro: stable & closed form estimates (fast)
  • Con: approximate
  • Key idea: flexible covariance structure allows for specification of approximate beta-binomial error

Individual cytosine differences

CpG 1

CpG 2

Previous approaches: grouping significant CpGs

FDR at the region level

\[ \text{False Discovery Rate (FDR)} = E\Big[\frac{FP}{FP + TP}\Big]\]

  • \(FDR_{CpG} = 2/10 = 0.20\)
  • \(FDR_{DMR} = 1/2 = 0.50\)

Accurate inference of DMRs

dmrseq: 2-stage approach

Smoothing

dmrseq output

  • de novo regions with FDR estimate
  • ranked by effect size of methylation difference
  • can adjust for covariates

Review: role in gene regulation

Correlation or causation?

First genome-wide study of causality

  • “Promoter DNA methylation is generally not sufficient for transcriptional inactivation”

Design of Ford et al. Study

Conclusion: methylation not generally sufficient

Reanalysis with dmrseq

  • original study used DSS
    • approach groups together individual CpGs
    • doesn’t account for correlation of nearby loci
    • doesn’t control error rate at the region level
  • main question : does accurate inference of methylation increase make a difference in the conclusion?

Results of the reanalysis

Results of the reanalysis

Significance: statistical \(\iff\) biological

Packages you’ll need

# binomial regression exercises:
BiocManager::install(c("ggplot2", "dplyr", "broom", "purrr", "aod"))
# regulatory capacity vignette:
BiocManager::install(c("R.utils", "data.table", "annotatr", "dmrseq",
                       "bsseq", "DelayedMatrixStats", "dendextend", 
                       "tidyr", "BiocParallel"))