flowchart LR
A[Frame problem] --> B[Train and test split]
B --> C[Prepare numeric]
C --> D[Prepare categorical]
D --> E[Fit candidates]
E --> F[k-fold CV]
F --> G[Diagnostics]
G --> H[Holdout evaluation]
H --> I[Package and hand over]
19 Implementation of Methods with R
19.1 The Implementation Workflow
Chapters 15 to 18 each introduced one method. This chapter folds them into a single end-to-end pipeline in R: frame the problem, split the data, prepare the predictors, fit and compare candidate models, validate with resampling, check diagnostics, evaluate on a holdout, and package the analysis as a reusable function. The focus here is on the glue between steps, not the methods themselves.
The same eight boxes apply to a regression, a logistic model, or a moderated model. Making the pipeline explicit (rather than improvising from one project to the next) lets results be reviewed, reproduced, and audited without rereading code from scratch.
19.2 Problem Framing
The first decision is not a coding decision: it is whether the response is continuous (Chapter 15), binary (Chapter 16), measured through a mechanism (Chapter 17), or contingent on a boundary variable (Chapter 18). The data-type of Y, the theory behind the predictors, and the question the business wants answered together fix the method before a single line of R runs.
Continuous Y and no mechanism claim: linear regression. Binary Y: logistic regression. Continuous Y with a hypothesised intermediate variable on the causal path: mediation. Continuous or binary Y with a hypothesised condition under which the effect holds: moderation. Chapters 15-18 give the tools; the business question picks the chapter.
19.3 Train and Test Split
A holdout is non-negotiable in predictive modelling (Hastie, Tibshirani and Friedman 2009). The standard split reserves 70 to 80 percent for training and the remainder for an honest evaluation. Random indexing in base R is enough; no extra package is required.
The same seed plus the same data reproduces the split exactly. Record the seed in the report so a reviewer can reconstruct the training and test sets without guessing.
19.4 Preparing Numeric Predictors
Scaling is almost always worth the small cost: it makes coefficients comparable in magnitude, stabilises numerical optimisation, and removes scale-induced artefacts in regularised or distance-based methods. Critically, the scaling parameters must be learned on the training data and applied to the test set unchanged.
Fitting the scaler on the combined data before the split leaks information from the test set into training. The means and standard deviations must come from the training split only.
19.5 Preparing Categorical Predictors
Factor handling has two recurring pitfalls: a reference level chosen alphabetically that does not match the business default, and rare levels with too few rows to estimate reliably. Both are fixed with relevel() and explicit pooling before any model is fit.
All subsequent dummy coefficients are differences from the reference. Choosing a business-meaningful reference (the default tier, the control arm, the baseline channel) makes the coefficient table far easier to read.
19.6 Candidate Models and AIC Comparison
Rather than committing to a single functional form at the start, fit two or three defensible candidates and compare them on AIC (Akaike 1974). AIC balances fit against complexity and is directly comparable across nested and non-nested candidates with the same response.
A reduction in AIC smaller than roughly 2 is weak evidence. A reduction of 10 or more is substantial. Pair AIC with a test-set metric before choosing a winner.
19.7 k-Fold Cross-Validation
AIC and adjusted R-squared are training-set criteria. To estimate out-of-sample error before touching the holdout, partition the training data into k folds, fit on k minus one folds, and predict on the held-out fold (Stone 1974). Averaging the k out-of-fold errors gives a stable CV estimate.
Five-fold CV is fast enough to rerun during exploration and stable enough to compare nearby candidates. Ten folds reduces variance at roughly double the cost; leave-one-out is rarely worth the compute for typical business datasets.
19.8 Diagnostics of the Chosen Model
Once a candidate is picked, the four-panel diagnostic plot from Chapter 15 is the first check: residual-versus-fitted for linearity and variance, Q-Q for normality of residuals, scale-location for variance, and leverage for influence. For a glm, swap in binned residuals (Chapter 16).
A clean residual-versus-fitted panel does not excuse a heavy-tailed Q-Q. Pattern in any one panel is a reason to reconsider the functional form, add a predictor, or switch to a more suitable GLM.
19.9 Holdout Evaluation
The holdout is the final, honest measure of how the model will perform on data it has not seen. Predict on the test split once, compute the appropriate metric (RMSE for continuous Y, accuracy or AUC for binary Y), and report the number alongside the training and CV estimates for context.
A train RMSE far below the test RMSE suggests overfitting: the model has memorised training noise. The fix is fewer predictors, regularisation, or more data, not a nicer diagnostic plot.
19.10 Packaging as a Reusable Function
Once the pipeline is working, wrap it in a function. The function takes raw data and a formula, performs the split, prepares the predictors, fits the model, returns the fit along with metrics. It turns a one-off notebook into an audited routine that can be rerun on next month’s extract.
Once the workflow is a function, it can be called with a different dataset, a different formula, or a different seed without copy-and-paste. Bugs get fixed in one place. Argument defaults document the choices the team standardised on.
19.11 Reporting and Handover
The deliverable is not just the model object; it is the model plus everything a reviewer or a downstream team needs to run it again.
- Data source and extraction date, (2) response variable and predictor definitions, (3) preparation steps including scaler parameters and pooled levels, (4) final formula and coefficient table, (5) CV and holdout metrics, (6) diagnostic plots, (7) the fitting function and its seed, (8) known caveats and recommended refresh cadence. A short report that hits these eight points makes the model auditable and re-runnable by someone who was not in the room.
19.12 Summary
| Concept | Description |
|---|---|
| Framing | |
| Response-type decision | Continuous, binary, or mediated; the response type fixes the method |
| Method selector | lm for continuous, glm for binary, mediation for mechanism, moderation for boundary |
| Preparation | |
| Train and test split | 80 / 20 random split with a recorded seed |
| Numeric scaling on train only | Fit centre and scale on the training split; apply to test unchanged |
| Factor releveling | Choose a business-meaningful reference level with relevel() |
| Rare-level pooling | Collapse low-share levels into Other before fitting |
| Modelling | |
| AIC and BIC comparison | Lower AIC is better; a gap below 2 is weak, above 10 is strong |
| k-fold cross-validation | Fit on k-1 folds, predict on the held-out fold, average the errors |
| Evaluation | |
| Four-panel diagnostics | Residual vs fitted, Q-Q, scale-location, leverage for lm |
| Holdout metric | RMSE for continuous Y; AUC or accuracy for binary Y |
| Train-to-test gap | Large gap signals overfitting; reduce predictors or regularise |
| Delivery | |
| Wrapping the pipeline as a function | Reusable function turns a notebook into an audited routine |
| Handover checklist | Data source, definitions, prep, formula, metrics, diagnostics, function, caveats |
| Reproducibility seed | Record the seed so the split and the fit can be reproduced exactly |