Let’s start with a very simple example, where we have two groups (goverened by \(x\)), each with a different probability of success. Let the probability of success equal \(p=(1-x)p_0 + xp_1\), so that
We’ll sample 50 draws from a binomial distribution, each with \(n=10\). In terms of DNA methylation at a particular loci, this would be 50 samples (25 in each group), each with coverage 10, where there’s a 20% methylation difference between the two groups.
library(ggplot2)
library(dplyr)
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y)),
aes(x=x, y=y/cov)) +
geom_point(position=position_jitter(height=0.02, width=0.07)) +
xlab("x") +
ylab("y/cov")
Now we fit a logistic regression model with \(x\) as a covariate, using the logit link \[ log(\frac{p}{1-p})\]
# Fit a logistic regression model
model0 <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model0)
##
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.50013 -0.58688 -0.05123 0.48348 2.43452
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3064 0.1280 -2.394 0.016668 *
## x 0.6786 0.1815 3.739 0.000185 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 53.720 on 49 degrees of freedom
## Residual deviance: 39.537 on 48 degrees of freedom
## AIC: 177.99
##
## Number of Fisher Scoring iterations: 4
We see that \(x\) is very predictive of \(y\), as we expect.
Recall that the estimated probability of success for the logistic regression model is the inverse logit function \[ \frac{e^{\beta_0 + x\beta_1}}{1 + e^{\beta_0 + x\beta_1}}.\] For model0, find the estimated probability of success when \(x=0\) and when \(x=1\).
# when x = 0
exp(coef(model0)[1]) / (1+exp(coef(model0)[1]))
## (Intercept)
## 0.424
# when x = 1
exp(sum(coef(model0))) / (1+exp(sum(coef(model0))))
## [1] 0.592
Try fitting an ordinary least squares (linear regression) model with lm on transformed proportions. Recall that this model assumes normally distributed error and does not explicitly model the count nature of the data. How does this model compare to the logistic model?
logit <- function(p){
log(p/(1-p))
}
linearfit <- lm(logit(y/cov) ~ x)
summary(linearfit)
##
## Call:
## lm(formula = logit(y/cov) ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05971 -0.42876 -0.05109 0.32659 1.76846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3266 0.1298 -2.516 0.015268 *
## x 0.7553 0.1836 4.115 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.649 on 48 degrees of freedom
## Multiple R-squared: 0.2608, Adjusted R-squared: 0.2454
## F-statistic: 16.93 on 1 and 48 DF, p-value: 0.0001513
Examine the distrition of the residuals of the previous model. Do they appear normal?
plot(density(residuals(linearfit)))
The previous example did not allow for any biological variability (only sampling variability). More realistically, we’ll sample each sample’s methylation probability as a random quantity, where the distributions between groups have a different mean.
To do so, we’ll use the beta distribution, since it is a natural fit for modeling proportions. Here, we let the probabilities used by the binomial sampling be equal to the probabilities: \[p=(1-x)p_0 + xp_1,\] where: \[p_i \sim Beta(\alpha_i,\beta_i).\] We set the hyperparameters as follows:
Since the mean of the beta distribution is \[E(p_i) = \frac{\alpha_i}{\alpha_i+\beta_i},\] the average probability of success for the first group (\(x=0\)) is 0.4, and the average probability of success for the second group (\(x=1\)) is 0.6. This is because \[E[p] = (1-x)E[p_0] + xE[p_1].\]
Then, we plot the outcomes \(y\) against the known value \(x\).
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- pmin((1-x)*rbeta(n,4,6) + x*rbeta(n,6,4), 1)
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y), p),
aes(x=x, y=y/cov)) +
geom_point(position=position_jitter(height=0.02, width=0.07)) +
xlab("x") +
ylab("y")
Now we fit a logistic regression model with \(x\) as a covariate. Note that the additional beta noise is not modeled.
# Fit a logistic regression model
model1 <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model1)
##
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5619 -0.9339 -0.1277 0.9894 2.3089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2249 0.1273 -1.767 0.077203 .
## x 0.6138 0.1812 3.388 0.000704 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 104.717 on 49 degrees of freedom
## Residual deviance: 93.101 on 48 degrees of freedom
## AIC: 225.76
##
## Number of Fisher Scoring iterations: 3
Now we see that \(x\) is still predictive of \(y\), however the coefficient has a slightly larger p-value, likely influenced by the extra beta noise in the binomial probabilities.
For model1, find the estimated probability of success when \(x=0\) and when \(x=1\).
# when x = 0
exp(coef(model1)[1]) / (1+exp(coef(model1)[1]))
## (Intercept)
## 0.444
# when x = 1
exp(sum(coef(model1))) / (1+exp(sum(coef(model1))))
## [1] 0.596
This result is just based on one random sample - to convince ourselves that this is a consistent effect, we could repeat these two model fits on many random samples (e.g. using the dplyr, broom, and purrr packages). Here we show results for 1,000 replicates.
library(dplyr)
library(broom)
library(purrr)
set.seed(10)
n <- 50
cov <- 10
B <- 1000
x <- c(rep(0,n/2), rep(1, n/2))
# functions to run one replicate for each model
m0_rep <- function(){
data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
mutate(p = 0.4 + 0.2*x,
cov = cov) %>%
mutate(y=rbinom(n, cov, p)) %>%
do(tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=.),
conf.int = TRUE))
}
m1_rep <- function(){
data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
mutate(p = pmin((1-x)*rbeta(n,4,6) + x*rbeta(n,6,4), 1),
cov = cov) %>%
mutate(y=rbinom(n, cov, p)) %>%
do(tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=.),
conf.int = TRUE))
}
# replicate B times
m0_all <- replicate(B, m0_rep(), simplify=FALSE) %>%
do.call("rbind", .) %>%
mutate(model="m0") %>%
mutate(n = sort(rep(1:B, 2)))
m1_all <- replicate(B, m1_rep(), simplify=FALSE) %>%
do.call("rbind", .) %>%
mutate(model="m1") %>%
mutate(n = sort(rep(1:B, 2)))
# combine and pull out relevant info
all <- rbind(m0_all, m1_all)
x <- filter(all, term == "x")
prob1 <- all %>% group_by(model,n) %>%
summarize(p = sum(estimate))
prob0 <- all %>% filter(term == "(Intercept)") %>%
mutate(p = estimate)
prob0 %>% ggplot(aes(x = model, y = exp(p) / (1+exp(p)))) +
geom_boxplot() +
ylab("p estimate for x=0") +
geom_hline(yintercept=0.4, linetype="dashed", size=0.8, color = "blue")
prob1 %>% ggplot(aes(x = model, y = exp(p) / (1+exp(p)))) +
geom_boxplot() +
ylab("p estimate for x=1") +
geom_hline(yintercept=0.6, linetype="dashed", size=0.8, color = "blue")
sum(all$p.value[all$model=="m0"] > 0.05)
## [1] 107
sum(all$p.value[all$model=="m1"] > 0.05)
## [1] 199
In order to account for the overdispersion in the binomial probabilities, let’s try fitting a beta-binomial regression model to the data instead.
library(aod)
dat <- data.frame(s = y, f = cov-y, x = factor(c(rep(0,n/2), rep(1, n/2))))
model2 <- betabin(cbind(s,f) ~ x, ~ x, data=dat)
summary(model2)
## Beta-binomial model
## -------------------
## betabin(formula = cbind(s, f) ~ x, random = ~x, data = dat)
##
## Convergence was obtained after 95 iterations.
##
## Fixed-effect coefficients:
## Estimate Std. Error z value Pr(> |z|)
## (Intercept) -2.230e-01 1.627e-01 -1.371e+00 1.705e-01
## x1 6.086e-01 2.373e-01 2.565e+00 1.032e-02
##
## Overdispersion coefficients:
## Estimate Std. Error z value Pr(> z)
## phi.x0 7.055e-02 4.421e-02 1.596e+00 5.526e-02
## phi.x1 8.874e-02 4.716e-02 1.882e+00 2.995e-02
##
## Log-likelihood statistics
## Log-lik nbpar df res. Deviance AIC AICc
## -1.055e+02 4 46 8.225e+01 2.189e+02 2.198e+02
Notice that the estimated coefficients are similar to model1. Also notice that the standard errors are larger, and therefore the p-value for the \(x\) covariate is larger. We can also see that new overdispersion parameters (\(\phi_{x=0}, \phi_{x=1}\)) are estimated. Recall that in the beta-binomial regression model, \[ \phi_i = \frac{1}{\alpha_i+\beta_i+1} \]
For model2, find the estimated probability of success when \(x=0\) and when \(x=1\).
# when x = 0
exp(coef(model2)[1]) / (1+exp(coef(model2)[1]))
## (Intercept)
## 0.4444839
# when x = 1
exp(sum(coef(model2))) / (1+exp(sum(coef(model2)[2])))
## [1] 0.5181786
What are the true values of the overdispersion parameters in this model? How close are the estimated overdispersion coefficients in model2?
# phi1 = 1/ (alpha1 + beta1 + 1)
phi1 <- 1/(4 + 6 + 1)
phi1
## [1] 0.09090909
# phi2 is the same as phi1 since the sum of alpha + beta are equal
summary(model2)@Phi$Estimate - phi1
## [1] -0.020356922 -0.002169733
We’ll explore how the beta-binomial regression model differs from logistic regression on the same dataset. Here, we’ll use a null comparison, where the \(x\) variable actually does not have any influence on the binomial probabilities. In terms of methylation, this would be a case where there’s no differential methylation. Ideally, the model will estimate the effect of \(x\) (\(\beta_1\)) close to zero.
set.seed(4)
n <- 20
cov <- 5
B <- 1000
x <- c(rep(0,n/2), rep(1, n/2))
# functions to run one replicate for each model
data_rep <- function(){
data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
mutate(p = rbeta(n,4,4),
cov = cov) %>%
mutate(y=rbinom(n, cov, p)) %>%
mutate(f=cov-y) %>%
mutate(x = as.factor(x))
}
dat <- replicate(B, data_rep(), simplify=FALSE)
m1_all <- lapply(dat, function(l)
tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=l),
conf.int = TRUE)) %>%
do.call("rbind", .) %>%
mutate(model="m1") %>%
mutate(n = sort(rep(1:B, 2))) %>%
select(estimate, p.value, n, term, model, std.error)
m2_all <- lapply(dat, function(l)
summary(betabin(cbind(y,f) ~ x, ~ x, data=l))@Coef) %>%
do.call("rbind", .) %>%
mutate(model="m2") %>%
mutate(n = sort(rep(1:B, 2))) %>%
mutate(term = rep(c("int", "x1"), B)) %>%
mutate(estimate=Estimate,
p.value = `Pr(> |z|)`,
std.error = `Std. Error`) %>%
select(estimate, p.value, n, term, model, std.error)
# combine and pull out relevant info
all <- rbind(m1_all, m2_all)
x <- filter(all, term == "x")
prob1 <- all %>% group_by(model,n) %>%
summarize(p = sum(estimate))
prob0 <- all %>% filter(term != "x1") %>%
mutate(p = estimate/std.error)
prob1 <- all %>% filter(term == "x1") %>%
mutate(p = estimate/std.error)
prob1 %>% ggplot(aes(x = model, y = p)) +
geom_boxplot() +
ylab("test statistic for x") +
geom_hline(yintercept=0, linetype="dashed", size=0.8, color = "blue")
sum(all$p.value[all$model=="m1"] < 0.1, na.rm = TRUE)
## [1] 339
sum(all$p.value[all$model=="m2"] < 0.1, na.rm = TRUE)
## [1] 195
We see that the beta-binomial regression model performs better. This is seen in the test statistic estimates for the \(x\) coefficient that are more tightly centered on zero, and the fewer number of rejections at the 0.1 level for a significant coefficient for \(x\).
Let’s to fit a model to some data which is perfectly separated (e.g. all successes are in one group and all failures in another, with group as the predictor).
set.seed(1)
n <- 50
cov <- 4
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.01 + 0.99*x
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y)),
aes(x=x, y=y/cov)) +
geom_point(position=position_jitter(height=0, width=0.07)) +
xlab("x") +
ylab("y/cov")
# Fit a logistic regression model
model.sep <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model.sep)
##
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.28355 -0.28355 0.00004 0.00004 2.18448
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.595 1.005 -4.572 4.83e-06 ***
## x 26.960 4359.342 0.006 0.995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 272.7402 on 49 degrees of freedom
## Residual deviance: 6.7016 on 48 degrees of freedom
## AIC: 12.428
##
## Number of Fisher Scoring iterations: 20
Notice how the estimate of the coefficient for \(x\) and its standard error are extremely large, which yields a \(p\)-value close to 1. This is because the true proportion difference attributable to \(x\) is close to 1. This means that the \[\frac{e^\beta}{1+e^\beta } \approx 1\] This is a problem, because it means that the solution for \(\beta\) approaches \(\infty\), and the MLE does not exist.
This instability is avoided by using an alternative link function, such as the arcsine link \[ arcsin(2p-1). \] Setting the inverse link function to 1 and solving gives \[ \frac{sin(\beta) + 1}{2} = 1\] which yields \(\beta = \pi/2\).
This link function is similar to the logit in that it transforms a (0,1) quantity in order to stabilize variance, but the transformation is less drastic in the extremes.
p <- sort(c(seq(0,0.01, by=1e-5), seq(0,1,by=0.01), seq(0.99,1, by=1e-5)))
plot(p, log(p/(1-p)), type = "l", ylab = "transformed", ylim=c(-4,4))
lines(p, asin(2*p-1), col="red")
legend("topleft", legend=c("logit", "arcsine"), col=c("black", "red"),
lty =c(1,1))
sessionInfo()
## R version 3.6.0 (2025-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] aod_1.3.1 purrr_0.3.2 broom_0.5.2 dplyr_0.8.3
## [5] ggplot2_3.2.0 BiocStyle_2.13.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.0
## [4] BiocManager_1.30.4 tools_3.6.0 digest_0.6.19
## [7] lattice_0.20-38 evaluate_0.14 tibble_2.1.3
## [10] gtable_0.3.0 nlme_3.1-139 pkgconfig_2.0.2
## [13] rlang_0.4.0 yaml_2.2.0 xfun_0.8
## [16] withr_2.1.2 stringr_1.4.0 knitr_1.23
## [19] generics_0.0.2 grid_3.6.0 tidyselect_0.2.5
## [22] glue_1.3.1 R6_2.4.0 rmarkdown_1.13
## [25] bookdown_0.11 tidyr_0.8.3 magrittr_1.5
## [28] MASS_7.3-51.4 backports_1.1.4 scales_1.0.0
## [31] htmltools_0.3.6 assertthat_0.2.1 colorspace_1.4-1
## [34] labeling_0.3 stringi_1.4.3 lazyeval_0.2.2
## [37] munsell_0.5.0 crayon_1.3.4