Introduction to DNA methylation analysis

July 10, 2025

DNA Methylation (5mC)

chemical modification to an individual DNA base (Cytosine)
doesn’t change coding sequence

image source: https://www.epigentek.com

DNA Methylation at CpGs

The 5th base?

heritable upon cell division
dynamic

image source: https://slideplayer.com/slide/10719924/

Preview: role in gene regulation

Evolution of methylation assays

Laird 2010 (https://doi.org/10.1038/nrg2732)

Analysis of methylation arrays

Methylation microarray (e.g. Infinum 450K, Epic 850K) analysis has similarities to expression microarray analysis
- Signal is continuous
- 1-16 samples per array
Beta values: \[\beta = \frac{\text{methylated intensity}}{(\text{meth} + \text{unmeth intensity})}\]
Analysis: regression on \(logit(\beta)\)
Packages: minfi, bumphunter

image: https://en.wikipedia.org/wiki/Illumina_Methylation_Assay

Analysis of MeDIPseq

MeDIP-sequencing aims to sequence only methylated DNA
Analysis: differential abundance (counts) on genomic bins
Package: MEDIPS

image source: https://en.wikipedia.org/wiki/Methylated_DNA_immunoprecipitation

Analysis of bisulfite sequencing

Bisulfite sequencing (e.g. WGBS, RRBS)
- basepair resolution
- binomial count of methylated/unmethylated reads
Packages:
- bsseq: smoothed t-test
- dss: approximate beta-binomial regression
- dmrseq: approximate beta-binomial regression over regions

drawing drawing

Bisulfite sequencing

Preprocessing

QC (fastQC/multiQC)
adapter/quality trimming Trim Galore!
alignment with Bismark
read into R with bsseq package

Differential methylation

differentially methylated cytosine (DMC) or locus (DML):
- array: test each probe
- bisulfite sequencing: test each cytosine
differentially methylated region (DMR):
- array: test groups of neighboring probes
- bisulfite sequencing: test groups of cytosines

DMC test of bisulfite read counts

For cytosine \(i\) and sample \(j\) in condition \(s\), we have:
\(M_{ij}\) reads corresponding to methylation
\(N_{ij}\) total reads covering \(i\)
\(p_s\) be the methylation probability in condition \(s\)
Let \(M_{ij} \sim Binom(N_{ij}, p_{s})\)
Or equivalently, \(M_{ij}\) is the sum of \(N_{ij}\) draws from \(Bernoulli(p_s)\)
How to test whether \(p_1 = p_2\)?

Binomial (Logisitic) regression

Generalized linear model for probability of success \(p\): \[logit(p)=log\Big(\frac{p}{1-p}\Big) = \boldsymbol{X\beta}\]
Link function \(g(p) = log\big(\frac{p}{1-p}\big)\) describes relationship between mean parameter of response and linear predictor
No closed form; fit with iterative ML estimation
Interpret coefficients on original scale with inverse link function \(g^{-1}\): \[ p = \frac{e^{\boldsymbol{X\beta}}}{1+e^{\boldsymbol{X\beta}}} \]
In R: glm(cbind(successes, failures) ~ x, family="binomial")

Link functions for GLM

Thiele & Markussen 2012 (https://doi.org/10.1079/PAVSNNR20127016)

Why logit transformation?

GLM vs Linear Regression

Linear regression on transformed proportions \(z\) (lm()):
- \(log\Big(\frac{p}{1-p}\Big) = \boldsymbol{z} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}\)
- \(\boldsymbol{z} \sim N(\boldsymbol{X\beta},\sigma^2)\)
- \(Var(\boldsymbol{z}) = \sigma^2\)

Generlized Linear regression on binary variables \(y\) (glm()):
- \(\boldsymbol{y} \sim Bern(p)\), \(Var(\boldsymbol{y}) = p(1-p)\)
- \(log\Big(\frac{p}{1-p}\Big) = g(p) = \boldsymbol{X\beta}\)
- \(Var(y) = g^{-1}(\boldsymbol{X\beta})(1-g^{-1}(\boldsymbol{X\beta}))\)

Binomial regression with overdispersion

In ordinary logistic regression, \(p\) is assumed to be constant for all samples with in group
To model overdispersion, we might want to allow this quantity to vary
For example, let \[p \sim Beta(\alpha, \beta)\] \[M \sim Binom(N, p)\]
This is the Beta-Binomial distribution
Overdispersion parameter \(\phi = \frac{1}{\alpha + \beta + 1}\) contributes to increased variance
Variance \(\sigma^2 = p(1-p)(\alpha + \beta + N)\phi\)
In R: aod::betabin(cbind(successes,failures) ~ x, ~ x)

Pitfalls of binomial regression

What happens to logit link function \(log\big(\frac{p}{1-p}\big)\) if \(p=0\) or \(1\)?
- binomial regression unstable for fully methylated or unmethylated cytosines
Computationally intensive to fit model at every cytosine
DSS: Park & Wu 2016 (https://doi.org/10.1093/bioinformatics/btw026)
- Differential methylation under general experimental design
- Alternate link function: \(arcsine(2p-1)\)
- Approximate fitting with Generalized Least Squares (GLS)

Generalized Least Squares (GLS) in a nutshell

Hybrid of linear regression and generalized linear regression
Pro: stable & closed form estimates (fast)
Con: approximate
Key idea: flexible covariance structure allows for specification of approximate beta-binomial error

Individual cytosine differences

CpG 1

CpG 2

Previous approaches: grouping significant CpGs

Examples: BSmooth (Hansen et al. 2012), DSS (Park et al. 2016)

FDR at the region level

\[ \text{False Discovery Rate (FDR)} = E\Big[\frac{FP}{FP + TP}\Big]\]

\(FDR_{CpG} = 2/10 = 0.20\)
\(FDR_{DMR} = 1/2 = 0.50\)

Accurate inference of DMRs

Korthauer et al. 2018 (https://doi.org/10.1093/biostatistics/kxy007)
Key ideas:
- model methylation signal over region to account for correlated signal
- permutation to acheive accurate FDR control
Implemented in the dmrseq Bioconductor package

dmrseq: 2-stage approach

Smoothing

Red line: 35x coverage
Black line: subsampled to 5x coverage

Hansen et al. 2012 (https://doi.org/10.1186/gb-2012-13-10-r83)

dmrseq output

de novo regions with FDR estimate
ranked by effect size of methylation difference
can adjust for covariates

Review: role in gene regulation

Correlation or causation?

Korthauer et al. 2018 (https://doi.org/10.1093/biostatistics/kxy007)

First genome-wide study of causality

“Promoter DNA methylation is generally not sufficient for transcriptional inactivation”

Design of Ford et al. Study

Conclusion: methylation not generally sufficient

Reanalysis with dmrseq

original study used DSS
- approach groups together individual CpGs
- doesn’t account for correlation of nearby loci
- doesn’t control error rate at the region level
main question : does accurate inference of methylation increase make a difference in the conclusion?

Results of the reanalysis

Korthauer & Irizarry 2018 (https://doi.org/10.1101/381145)

Results of the reanalysis

Korthauer & Irizarry 2018 (https://doi.org/10.1101/381145)

Significance: statistical \(\iff\) biological

Korthauer & Irizarry 2018 (https://doi.org/10.1101/381145)

Packages you’ll need

# binomial regression exercises:
BiocManager::install(c("ggplot2", "dplyr", "broom", "purrr", "aod"))
# regulatory capacity vignette:
BiocManager::install(c("R.utils", "data.table", "annotatr", "dmrseq",
                       "bsseq", "DelayedMatrixStats", "dendextend", 
                       "tidyr", "BiocParallel"))