flowchart LR
A[Frame question] --> B[Prepare data]
B --> C[Suitability checks]
C --> D[EFA fit]
D --> E[Scores and reliability]
E --> F[Clustering]
F --> G[Profile segments]
G --> H[Report and hand over]
22 Implementation of Advanced Methods with R
22.1 The Unsupervised Workflow
Chapters 20 and 21 introduced factor analysis and cluster analysis. This chapter folds both into a single end-to-end pipeline in R: frame the segmentation question, prepare the data, check suitability, fit the factor solution, compute scores and reliability, cluster on either the items or the factor scores, profile the clusters, and package the analysis as a reusable function. As in Chapter 19, the focus is on the glue between steps rather than the methods themselves.
The same eight boxes apply whether the deliverable is a factor solution (labels on constructs), a segment scheme (labels on observations), or the combination (clusters built on factor scores). Making the pipeline explicit lets a segmentation project be reviewed, reproduced, and refreshed without rereading code from scratch.
22.2 Problem Framing
The first decision is which kind of label the business wants. If the deliverable is a smaller set of constructs that summarise a battery of items (for example, distilling 20 survey questions into four drivers), the method is exploratory factor analysis (Chapter 20). If the deliverable is a set of groups of observations that behave similarly (customers, stores, products), the method is clustering (Chapter 21). When both are needed, the usual order is EFA first to get clean dimensions, then clustering on the factor scores to get clean segments on those dimensions.
Items need a smaller set of labels: EFA. Rows need a smaller set of labels: clustering. Items and rows both need labels and the items are noisy: EFA followed by clustering on the factor scores (see §9).
22.3 Data Preparation
Before any unsupervised fit, drop rows with missing values on the variables that will go into the model, z-scale numeric predictors so no variable dominates the distance or the correlation matrix, pool rare categories into Other, and guard against zero-variance columns that break both cor() and dist().
After preparation the numeric columns are z-scaled, the near-constant column has been dropped from num, and the rare levels of segment have been pooled into a single Other bucket that is safe to use downstream.
22.4 Suitability Checks
Chapter 20 introduced KMO and Bartlett’s test to confirm that a correlation matrix has enough shared variance for factor analysis. Clustering has no single equivalent, but the spread of pairwise distances is a fast informal check: if most distances are similar, there is no geometric structure to recover.
KMO above 0.70 and a significant Bartlett test clear the EFA side; a wide, non-degenerate distance distribution with visible spread clears the clustering side. A unimodal, narrow distance distribution would warn that no clusters are likely to stand out.
22.5 EFA Pipeline
The standard EFA pipeline is: eigenvalues-plus-scree to decide the number of factors, principal-axis extraction with a rotation, and inspection of the loading matrix against a salience cutoff. For this chapter the rotation is Varimax because the factors were built to be uncorrelated; in real projects Oblimin is used whenever a factor correlation of 0.30 or more is plausible.
Two eigenvalues exceed 1, the two-factor solution is fit, and setting sub-0.40 loadings to blank produces a clean simple-structure table that is ready for labelling. Items 1-3 define factor one; items 4-6 define factor two.
22.6 Scoring and Reliability
Each retained factor needs a score per observation and a reliability estimate so the construct can be used downstream. The regression method computes factor scores directly from the solution; unit-weighted composite scores (row means of items in the scale) are a more transparent alternative that usually correlates above 0.95 with the regression scores.
Regression and unit-weighted scores agree above 0.95 on each factor, and Cronbach alpha comfortably exceeds Nunnally’s 0.70 threshold, so either score version is defensible for the next step.
22.7 Clustering Pipeline
With prepared numeric features the clustering pipeline is: pick \(k\) by an explicit rule (elbow in within-cluster SS and maximum average silhouette, Chapter 21), fit k-means with several random starts, and carry forward the cluster assignments.
The silhouette rule selects \(k\) at its maximum, the WSS curve confirms the elbow at the same \(k\), and the final kmeans fit uses nstart = 25 to protect against a poor random start.
22.8 Cluster Profiling
Cluster IDs are not yet a deliverable. Profiling means computing the cluster means on the variables that went into the distance, the cluster size as a share of the total, and a short business label per cluster. A one-table profile is the minimum stakeholders need to accept or push back on the segmentation.
The profile table is the single artefact the business sees first. Cluster means on the prepared variables, counts, and shares belong on one row; a label and a short narrative then turn “cluster 2” into “north-east high-x/low-y”.
22.9 Combining Factors and Clusters
When the raw items are noisy or partly redundant, clustering on factor scores is usually cleaner than clustering on the items directly. The factor solution absorbs the correlated structure into a few orthogonal axes; clustering on those axes produces segments that are easier to label and less sensitive to a single item.
The cluster means on the factor scores read as a business story (for example, “high on factor one, low on factor two”) without the analyst having to eyeball six item-level numbers per cluster.
22.10 Packaging as a Reusable Function
The prep, EFA, and clustering code from the previous sections can be folded into a single function that takes a data frame and the number of factors and clusters and returns everything needed for the report: loadings, reliabilities, cluster sizes, and the profile table on factor scores.
Wrapping the pipeline this way turns a scripted notebook into an audited routine: the seed is fixed, the steps run in the same order every time, and the output is a single list that the report pulls from.
22.11 Reporting and Handover
A complete unsupervised report states the population and sample size, the variables used (and how they were scaled, with rare levels pooled), the suitability checks (KMO, Bartlett, distance spread), the EFA extraction and rotation, the loading table with a salience cutoff, per-scale reliability (alpha and, where loadings differ, omega), the rule used to pick \(k\), the clustering algorithm with its settings, and the cluster profile with business labels. Stakeholders expect at least one sensitivity check: does the solution survive a different seed, a different rotation, or \(k \pm 1\)? Fabrigar et al. (1999) and Kaufman and Rousseeuw (1990) are the textbook references reviewers expect; Hastie, Tibshirani and Friedman (2009) covers the broader validation landscape.
- Data source and extraction date, (2) variables used and preparation steps including pooled levels, (3) suitability evidence (KMO, Bartlett, distance spread), (4) EFA extraction, rotation, loadings, communalities, reliability, (5) clustering algorithm, k rule, and assignments, (6) profile table with labels, (7) the
fit_segments()function and seed, (8) known caveats and recommended refresh cadence.
22.12 Summary
| Concept | Description |
|---|---|
| Framing | |
| Unsupervised deliverable | Labels on items (EFA) or on rows (clusters) or both |
| Method selector: items vs rows | EFA distils items; clustering groups rows; combine when both noisy |
| Preparation | |
| Completeness filter | Drop rows with missing values on the modelling variables |
| Z-scaling numeric features | Scale numerics to unit spread before distance or correlation |
| Rare-level pooling | Collapse categories below a share threshold into Other |
| Zero-variance guard | Drop columns with sd near zero before cor() or dist() |
| EFA | |
| Suitability checks (KMO, Bartlett, distance spread) | Confirm enough shared variance and distance spread to segment |
| Factor extraction and rotation | Principal axis factoring with Varimax or Oblimin rotation |
| Reliability (alpha and omega) | Report alpha per scale; use omega when loadings are unequal |
| Clustering | |
| k selection via elbow and silhouette | Pick k where WSS bends and average silhouette peaks |
| Cluster profile table | Cluster means, sizes, shares, and business labels |
| Delivery | |
| Clustering on factor scores | Run k-means on factor scores, not raw items, for cleaner segments |
| Reusable fit function | Wrap prep, EFA, scoring, and clustering into one function with a seed |
| Handover checklist | Source, variables, suitability, fit, scores, cluster, profile, caveats |