1 Logistic (Binomial) regression

Let’s start with a very simple example, where we have two groups (goverened by \(x\)), each with a different probability of success. Let the probability of success equal \(p=(1-x)p_0 + xp_1\), so that

  • If \(x=0\), then \(p=0.4\)
  • If \(x=1\), then \(p=0.6\)

We’ll sample 50 draws from a binomial distribution, each with \(n=10\). In terms of DNA methylation at a particular loci, this would be 50 samples (25 in each group), each with coverage 10, where there’s a 20% methylation difference between the two groups.

library(ggplot2)
library(dplyr)
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y)), 
            aes(x=x, y=y/cov)) +
  geom_point(position=position_jitter(height=0.02, width=0.07)) +
  xlab("x") + 
  ylab("y/cov") 

Now we fit a logistic regression model with \(x\) as a covariate, using the logit link \[ log(\frac{p}{1-p})\]

# Fit a logistic regression model
model0 <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model0)
## 
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.50013  -0.58688  -0.05123   0.48348   2.43452  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3064     0.1280  -2.394 0.016668 *  
## x             0.6786     0.1815   3.739 0.000185 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 53.720  on 49  degrees of freedom
## Residual deviance: 39.537  on 48  degrees of freedom
## AIC: 177.99
## 
## Number of Fisher Scoring iterations: 4

We see that \(x\) is very predictive of \(y\), as we expect.


1.1 Exercise

Recall that the estimated probability of success for the logistic regression model is the inverse logit function \[ \frac{e^{\beta_0 + x\beta_1}}{1 + e^{\beta_0 + x\beta_1}}.\] For model0, find the estimated probability of success when \(x=0\) and when \(x=1\).

# when x = 0
exp(coef(model0)[1]) /  (1+exp(coef(model0)[1]))
## (Intercept) 
##       0.424
# when x = 1
exp(sum(coef(model0))) /  (1+exp(sum(coef(model0))))
## [1] 0.592

Try fitting an ordinary least squares (linear regression) model with lm on transformed proportions. Recall that this model assumes normally distributed error and does not explicitly model the count nature of the data. How does this model compare to the logistic model?

logit <- function(p){
  log(p/(1-p))
}
linearfit <- lm(logit(y/cov) ~ x)
summary(linearfit)
## 
## Call:
## lm(formula = logit(y/cov) ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05971 -0.42876 -0.05109  0.32659  1.76846 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3266     0.1298  -2.516 0.015268 *  
## x             0.7553     0.1836   4.115 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.649 on 48 degrees of freedom
## Multiple R-squared:  0.2608, Adjusted R-squared:  0.2454 
## F-statistic: 16.93 on 1 and 48 DF,  p-value: 0.0001513

Examine the distrition of the residuals of the previous model. Do they appear normal?

plot(density(residuals(linearfit)))


2 Logistic regression with overdispersion

The previous example did not allow for any biological variability (only sampling variability). More realistically, we’ll sample each sample’s methylation probability as a random quantity, where the distributions between groups have a different mean.

To do so, we’ll use the beta distribution, since it is a natural fit for modeling proportions. Here, we let the probabilities used by the binomial sampling be equal to the probabilities: \[p=(1-x)p_0 + xp_1,\] where: \[p_i \sim Beta(\alpha_i,\beta_i).\] We set the hyperparameters as follows:

  • when \(x=0\): \(\alpha_0=4\) and \(\beta_0=6\)
  • when \(x=1\): \(\alpha_1=6\) and \(\beta_1=4\)

Since the mean of the beta distribution is \[E(p_i) = \frac{\alpha_i}{\alpha_i+\beta_i},\] the average probability of success for the first group (\(x=0\)) is 0.4, and the average probability of success for the second group (\(x=1\)) is 0.6. This is because \[E[p] = (1-x)E[p_0] + xE[p_1].\]

Then, we plot the outcomes \(y\) against the known value \(x\).

set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- pmin((1-x)*rbeta(n,4,6) + x*rbeta(n,6,4), 1)
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y), p), 
            aes(x=x, y=y/cov)) +
  geom_point(position=position_jitter(height=0.02, width=0.07)) +
  xlab("x") + 
  ylab("y") 

Now we fit a logistic regression model with \(x\) as a covariate. Note that the additional beta noise is not modeled.

# Fit a logistic regression model
model1 <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model1)
## 
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5619  -0.9339  -0.1277   0.9894   2.3089  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.2249     0.1273  -1.767 0.077203 .  
## x             0.6138     0.1812   3.388 0.000704 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 104.717  on 49  degrees of freedom
## Residual deviance:  93.101  on 48  degrees of freedom
## AIC: 225.76
## 
## Number of Fisher Scoring iterations: 3

Now we see that \(x\) is still predictive of \(y\), however the coefficient has a slightly larger p-value, likely influenced by the extra beta noise in the binomial probabilities.


2.1 Exercise

For model1, find the estimated probability of success when \(x=0\) and when \(x=1\).

# when x = 0
exp(coef(model1)[1]) /  (1+exp(coef(model1)[1]))
## (Intercept) 
##       0.444
# when x = 1
exp(sum(coef(model1))) /  (1+exp(sum(coef(model1))))
## [1] 0.596

2.2 Bootstrap comparison

This result is just based on one random sample - to convince ourselves that this is a consistent effect, we could repeat these two model fits on many random samples (e.g. using the dplyr, broom, and purrr packages). Here we show results for 1,000 replicates.

library(dplyr)
library(broom)
library(purrr)
set.seed(10)
n <- 50
cov <- 10
B <- 1000
x <- c(rep(0,n/2), rep(1, n/2))
# functions to run one replicate for each model
m0_rep <- function(){
  data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
    mutate(p = 0.4 + 0.2*x,
           cov = cov) %>%
    mutate(y=rbinom(n, cov, p)) %>%  
    do(tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=.), 
            conf.int = TRUE)) 
}
m1_rep <- function(){
  data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
    mutate(p = pmin((1-x)*rbeta(n,4,6) + x*rbeta(n,6,4), 1),
           cov = cov) %>%
    mutate(y=rbinom(n, cov, p)) %>%  
    do(tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=.), 
            conf.int = TRUE))
}
# replicate B times
m0_all <- replicate(B, m0_rep(), simplify=FALSE) %>% 
  do.call("rbind", .) %>%
  mutate(model="m0") %>%
  mutate(n = sort(rep(1:B, 2)))
m1_all <- replicate(B, m1_rep(), simplify=FALSE) %>% 
  do.call("rbind", .) %>%
  mutate(model="m1") %>%
  mutate(n = sort(rep(1:B, 2)))
# combine and pull out relevant info
all <- rbind(m0_all, m1_all)
x <- filter(all, term == "x")
prob1 <- all %>% group_by(model,n) %>%
  summarize(p = sum(estimate))
prob0 <- all %>% filter(term == "(Intercept)") %>%
  mutate(p = estimate)
prob0 %>% ggplot(aes(x = model, y = exp(p) / (1+exp(p)))) +
  geom_boxplot() +
  ylab("p estimate for x=0") + 
  geom_hline(yintercept=0.4, linetype="dashed", size=0.8, color = "blue")

prob1 %>% ggplot(aes(x = model, y = exp(p) / (1+exp(p)))) +
  geom_boxplot() +
  ylab("p estimate for x=1") + 
  geom_hline(yintercept=0.6, linetype="dashed", size=0.8, color = "blue")

sum(all$p.value[all$model=="m0"] > 0.05)
## [1] 107
sum(all$p.value[all$model=="m1"] > 0.05)
## [1] 199

3 Beta-binomial regression

In order to account for the overdispersion in the binomial probabilities, let’s try fitting a beta-binomial regression model to the data instead.

library(aod)
dat <- data.frame(s = y, f = cov-y, x = factor(c(rep(0,n/2), rep(1, n/2))))
model2 <- betabin(cbind(s,f) ~ x, ~ x, data=dat)
summary(model2)
## Beta-binomial model
## -------------------
## betabin(formula = cbind(s, f) ~ x, random = ~x, data = dat)
## 
## Convergence was obtained after 95 iterations.
## 
## Fixed-effect coefficients:
##               Estimate Std. Error    z value Pr(> |z|)
## (Intercept) -2.230e-01  1.627e-01 -1.371e+00 1.705e-01
## x1           6.086e-01  2.373e-01  2.565e+00 1.032e-02
## 
## Overdispersion coefficients:
##         Estimate Std. Error   z value   Pr(> z)
## phi.x0 7.055e-02  4.421e-02 1.596e+00 5.526e-02
## phi.x1 8.874e-02  4.716e-02 1.882e+00 2.995e-02
## 
## Log-likelihood statistics
##    Log-lik      nbpar    df res.   Deviance        AIC       AICc 
## -1.055e+02          4         46  8.225e+01  2.189e+02  2.198e+02

Notice that the estimated coefficients are similar to model1. Also notice that the standard errors are larger, and therefore the p-value for the \(x\) covariate is larger. We can also see that new overdispersion parameters (\(\phi_{x=0}, \phi_{x=1}\)) are estimated. Recall that in the beta-binomial regression model, \[ \phi_i = \frac{1}{\alpha_i+\beta_i+1} \]


3.1 Exercise

For model2, find the estimated probability of success when \(x=0\) and when \(x=1\).

# when x = 0
exp(coef(model2)[1]) /  (1+exp(coef(model2)[1]))
## (Intercept) 
##   0.4444839
# when x = 1
exp(sum(coef(model2))) /  (1+exp(sum(coef(model2)[2])))
## [1] 0.5181786

What are the true values of the overdispersion parameters in this model? How close are the estimated overdispersion coefficients in model2?

# phi1 = 1/ (alpha1 + beta1 + 1)
phi1 <- 1/(4 + 6 + 1) 
phi1
## [1] 0.09090909
# phi2 is the same as phi1 since the sum of alpha + beta are equal
summary(model2)@Phi$Estimate - phi1
## [1] -0.020356922 -0.002169733

3.2 Bootstrap comparison

We’ll explore how the beta-binomial regression model differs from logistic regression on the same dataset. Here, we’ll use a null comparison, where the \(x\) variable actually does not have any influence on the binomial probabilities. In terms of methylation, this would be a case where there’s no differential methylation. Ideally, the model will estimate the effect of \(x\) (\(\beta_1\)) close to zero.

set.seed(4)
n <- 20
cov <- 5
B <- 1000
x <- c(rep(0,n/2), rep(1, n/2))
# functions to run one replicate for each model
data_rep <- function(){
  data.frame(x = c(rep(0,n/2), rep(1, n/2))) %>%
    mutate(p = rbeta(n,4,4),
           cov = cov) %>%
    mutate(y=rbinom(n, cov, p)) %>%
    mutate(f=cov-y) %>%
    mutate(x = as.factor(x))
}
dat <- replicate(B, data_rep(), simplify=FALSE) 
m1_all <- lapply(dat, function(l)
  tidy(glm(cbind(y, cov-y) ~ x, family="binomial", data=l), 
            conf.int = TRUE)) %>%
  do.call("rbind", .) %>%
  mutate(model="m1") %>%
  mutate(n = sort(rep(1:B, 2))) %>%
  select(estimate, p.value, n, term, model, std.error)
m2_all <- lapply(dat, function(l) 
  summary(betabin(cbind(y,f) ~ x, ~ x, data=l))@Coef) %>%
  do.call("rbind", .) %>%
  mutate(model="m2") %>%
  mutate(n = sort(rep(1:B, 2))) %>%
  mutate(term = rep(c("int", "x1"), B)) %>%
  mutate(estimate=Estimate,
         p.value = `Pr(> |z|)`,
         std.error = `Std. Error`) %>%
  select(estimate, p.value, n, term, model, std.error)
# combine and pull out relevant info
all <- rbind(m1_all, m2_all)
x <- filter(all, term == "x")
prob1 <- all %>% group_by(model,n) %>%
  summarize(p = sum(estimate))
prob0 <- all %>% filter(term != "x1") %>%
  mutate(p = estimate/std.error)
prob1 <- all %>% filter(term == "x1") %>%
  mutate(p = estimate/std.error)
prob1 %>% ggplot(aes(x = model, y = p)) +
  geom_boxplot() +
  ylab("test statistic for x") + 
  geom_hline(yintercept=0, linetype="dashed", size=0.8, color = "blue")

sum(all$p.value[all$model=="m1"] < 0.1, na.rm = TRUE)
## [1] 339
sum(all$p.value[all$model=="m2"] < 0.1, na.rm = TRUE)
## [1] 195

We see that the beta-binomial regression model performs better. This is seen in the test statistic estimates for the \(x\) coefficient that are more tightly centered on zero, and the fewer number of rejections at the 0.1 level for a significant coefficient for \(x\).

4 Pitfalls of GLM

Let’s to fit a model to some data which is perfectly separated (e.g. all successes are in one group and all failures in another, with group as the predictor).

set.seed(1)
n <- 50
cov <- 4
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.01 + 0.99*x
y <- rbinom(n, cov, p)
ggplot(data.frame(x=factor(x),y=as.numeric(y)), 
            aes(x=x, y=y/cov)) +
  geom_point(position=position_jitter(height=0, width=0.07)) +
  xlab("x") + 
  ylab("y/cov") 

# Fit a logistic regression model
model.sep <- glm(cbind(y, cov-y) ~ x, family="binomial")
summary(model.sep)
## 
## Call:
## glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.28355  -0.28355   0.00004   0.00004   2.18448  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.595      1.005  -4.572 4.83e-06 ***
## x             26.960   4359.342   0.006    0.995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 272.7402  on 49  degrees of freedom
## Residual deviance:   6.7016  on 48  degrees of freedom
## AIC: 12.428
## 
## Number of Fisher Scoring iterations: 20

Notice how the estimate of the coefficient for \(x\) and its standard error are extremely large, which yields a \(p\)-value close to 1. This is because the true proportion difference attributable to \(x\) is close to 1. This means that the \[\frac{e^\beta}{1+e^\beta } \approx 1\] This is a problem, because it means that the solution for \(\beta\) approaches \(\infty\), and the MLE does not exist.

5 Session Information

sessionInfo()
## R version 3.6.0 (2025-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] aod_1.3.1        purrr_0.3.2      broom_0.5.2      dplyr_0.8.3     
## [5] ggplot2_3.2.0    BiocStyle_2.13.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1         pillar_1.4.2       compiler_3.6.0    
##  [4] BiocManager_1.30.4 tools_3.6.0        digest_0.6.19     
##  [7] lattice_0.20-38    evaluate_0.14      tibble_2.1.3      
## [10] gtable_0.3.0       nlme_3.1-139       pkgconfig_2.0.2   
## [13] rlang_0.4.0        yaml_2.2.0         xfun_0.8          
## [16] withr_2.1.2        stringr_1.4.0      knitr_1.23        
## [19] generics_0.0.2     grid_3.6.0         tidyselect_0.2.5  
## [22] glue_1.3.1         R6_2.4.0           rmarkdown_1.13    
## [25] bookdown_0.11      tidyr_0.8.3        magrittr_1.5      
## [28] MASS_7.3-51.4      backports_1.1.4    scales_1.0.0      
## [31] htmltools_0.3.6    assertthat_0.2.1   colorspace_1.4-1  
## [34] labeling_0.3       stringi_1.4.3      lazyeval_0.2.2    
## [37] munsell_0.5.0      crayon_1.3.4