flowchart LR
A[Pick binary response y] --> B[Pick predictors x1..xk]
B --> C[Fit glm family binomial]
C --> D[Read summary: coef, deviance]
D --> E[Interpret as log-odds or odds ratios]
E --> F[Classification metrics and ROC]
F --> G{Fit acceptable?}
G -->|No| H[Refine predictors or threshold]
H --> C
G -->|Yes| I[Predict probabilities and report]
16 Logistic Regression
16.1 Logistic Regression in Context
Logistic regression models a binary outcome: churn or stay, default or repay, click or skip. The linear regression of Chapter 15 is not directly applicable because its predicted values are not constrained to the zero-to-one interval that a probability needs. Logistic regression keeps the linear-in-coefficients structure but passes the linear predictor through a logit link so that the output is always a probability (Berkson 1944; Nelder and Wedderburn 1972).
Whether a customer churns, whether a loan is repaid, whether a ticket is resolved on first contact, whether a prospect converts. Each is a zero-or-one question, and each calls for a model whose output is a probability.
It is one member of the generalised linear model family. The family specifies a distribution for the response (binomial here) and a link function that connects the linear predictor to the mean of that distribution (the logit).
16.2 The Logistic Model
The model writes the log-odds of the positive outcome as a linear combination of predictors: log(p / (1 minus p)) equals beta0 plus beta1 x1 plus up to betak xk. Inverting the logit returns the probability p. Coefficients are estimated by maximum likelihood rather than ordinary least squares.
A coefficient of 0.5 in logistic regression does not say the probability rises by 0.5 per unit of x. It says the log-odds rise by 0.5, which means the odds are multiplied by exp(0.5), roughly 1.65. The probability change depends on where on the curve the predictor sits.
16.3 Fitting with glm
R fits the model with glm and family = binomial. The summary prints a coefficient table with z-statistics (not t), the null and residual deviance, and the AIC.
Each coefficient with its standard error and z-statistic, the null deviance (how well an intercept-only model fits), the residual deviance (how well the fitted model fits), the degrees of freedom, and the AIC. A large drop from null to residual deviance is the logistic analogue of a high R-squared.
16.4 Log-Odds and Odds Ratios
Raw coefficients are on the log-odds scale. Exponentiating a coefficient gives the odds ratio: the multiplicative change in odds per unit of the predictor. Odds ratios are easier to report to a business audience than log-odds.
An odds ratio of 1 means no effect. Above 1 means higher odds of the positive outcome per unit of the predictor; below 1 means lower odds. For tenure the odds ratio is below 1, which matches the intuition that longer-tenured customers are less likely to churn.
16.5 Deviance, AIC, and Pseudo R-Squared
Logistic regression has no direct R-squared. The closest analogues are the drop from null to residual deviance and pseudo R-squared measures such as McFadden’s (McFadden 1974).
Deviance drop is the absolute improvement over the null. AIC is the comparable criterion when competing models have different predictors. McFadden’s R-squared puts the deviance drop on a zero-to-one scale for interpretation; McFadden’s own guideline is that 0.2 to 0.4 already indicates an excellent fit, well below the thresholds that a linear-regression R-squared would suggest.
16.6 Categorical Predictors and Interactions
Factors expand into dummy variables exactly as in Chapter 15. Interactions use the same x1 * x2 syntax. The interpretation changes only in that everything sits on the log-odds scale.
An interaction coefficient says how the slope of tenure differs between tier levels on the log-odds scale. Exponentiating gives the ratio of odds ratios across groups. Chapter 18 treats this idea as moderation in depth.
16.7 Classification and the Confusion Matrix
Once probabilities are in hand, a classification rule turns them into a decision by cutting at a threshold, often 0.5. The resulting confusion matrix tabulates true positives, false positives, true negatives, and false negatives, which feed accuracy, precision, recall, and F1.
The 0.5 default only makes sense when false positives and false negatives are equally costly. Raise the threshold to reduce false positives, lower it to catch more positives. The right value is set by the cost of each kind of error, not by the model.
16.8 ROC Curve and AUC
The ROC curve sweeps the threshold from 0 to 1 and plots true-positive rate against false-positive rate. The area under the curve (AUC) summarises classifier quality across all thresholds: 0.5 is chance, 1 is perfect.
AUC is the probability that a randomly chosen positive case receives a higher predicted probability than a randomly chosen negative case. Values around 0.7 are acceptable, 0.8 good, above 0.9 excellent in most business settings.
16.9 Residual Diagnostics
Residuals in logistic regression are less informative row by row than in linear regression because each observation is a zero or one. The standard practice is to work with binned residuals (average residual in each prediction bin) and to check leverage for extreme rows.
Residuals scattered around zero across the probability range. Systematic drift at either end suggests a missing predictor or a non-linear effect of an existing one.
16.10 Variable Selection and Multicollinearity
Stepwise selection and VIF both work on glm the same way they worked on lm in Chapter 15. AIC is the default comparison criterion; BIC is available with k = log(n).
A stepwise routine that drops x2 when x1 is in the model is doing roughly what VIF diagnostics would suggest. Reviewing both keeps the selected model defensible.
16.11 Predicting New Cases
predict.glm returns log-odds by default; use type = "response" for probabilities. A class label is produced by thresholding the probability.
Report the probability and the decision separately. A stakeholder can then see the model’s confidence and the rule that translated it into an action.
16.12 Reporting a Logistic Regression
A logistic-regression report reuses the six-section skeleton from Chapters 11 to 15 and swaps the coefficient table for one that includes both raw coefficients and odds ratios.
- Question and binary response, (2) predictor set and sample, (3) diagnostic view and binned residuals, (4) coefficient table with odds ratios and confidence intervals, (5) classification metrics at the chosen threshold plus AUC, (6) business decision with threshold rationale. The skeleton lines up directly with Chapter 15’s regression report, which makes predictive and probabilistic studies comparable.
16.13 Summary
| Concept | Description |
|---|---|
| Model and Fit | |
| Logit model form | log(p/(1-p)) = linear combination of predictors |
| glm family binomial | Fit with glm(y ~ ..., family = binomial) |
| Deviance, AIC, pseudo R-squared | Deviance drop, Akaike criterion, McFadden's R-squared on a zero-to-one scale |
| Coefficient Interpretation | |
| Log-odds coefficient | Raw beta is the log-odds change per unit of the predictor |
| Odds ratio with CI | exp(beta) is the multiplicative change in odds; always report a CI |
| Categorical and interaction | Factor dummies and x1 * x2 work exactly as in lm, on the log-odds scale |
| Classification Performance | |
| Threshold choice | Business-cost decision, not a model parameter |
| Confusion matrix | Two-by-two tabulation of predicted versus actual class labels |
| ROC curve | Sweeps threshold and plots true-positive rate against false-positive rate |
| AUC | Area under the ROC curve; threshold-free summary of separability |
| Diagnostics and Selection | |
| Pearson and deviance residuals | Diagnostic residuals for GLMs; individually noisy because y is binary |
| Binned residual plot and leverage | Average residual in each prediction bin; the readable diagnostic |
| Stepwise AIC with VIF | Same tools as lm: stepwise selection plus variance inflation factors |
| Prediction and Reporting | |
| predict type response | Returns the predicted probability of the positive outcome |
| Six-section logistic report | Question, response, diagnostic, coefficients, classification, decision |