22  Implementation of Advanced Methods with R

22.1 The Unsupervised Workflow

Chapters 20 and 21 introduced factor analysis and cluster analysis. This chapter folds both into a single end-to-end pipeline in R: frame the segmentation question, prepare the data, check suitability, fit the factor solution, compute scores and reliability, cluster on either the items or the factor scores, profile the clusters, and package the analysis as a reusable function. As in Chapter 19, the focus is on the glue between steps rather than the methods themselves.

flowchart LR
    A[Frame question] --> B[Prepare data]
    B --> C[Suitability checks]
    C --> D[EFA fit]
    D --> E[Scores and reliability]
    E --> F[Clustering]
    F --> G[Profile segments]
    G --> H[Report and hand over]

NoteSame pipeline, different outputs

The same eight boxes apply whether the deliverable is a factor solution (labels on constructs), a segment scheme (labels on observations), or the combination (clusters built on factor scores). Making the pipeline explicit lets a segmentation project be reviewed, reproduced, and refreshed without rereading code from scratch.

22.2 Problem Framing

The first decision is which kind of label the business wants. If the deliverable is a smaller set of constructs that summarise a battery of items (for example, distilling 20 survey questions into four drivers), the method is exploratory factor analysis (Chapter 20). If the deliverable is a set of groups of observations that behave similarly (customers, stores, products), the method is clustering (Chapter 21). When both are needed, the usual order is EFA first to get clean dimensions, then clustering on the factor scores to get clean segments on those dimensions.

TipA simple selector

Items need a smaller set of labels: EFA. Rows need a smaller set of labels: clustering. Items and rows both need labels and the items are noisy: EFA followed by clustering on the factor scores (see §9).

22.3 Data Preparation

Before any unsupervised fit, drop rows with missing values on the variables that will go into the model, z-scale numeric predictors so no variable dominates the distance or the correlation matrix, pool rare categories into Other, and guard against zero-variance columns that break both cor() and dist().

After preparation the numeric columns are z-scaled, the near-constant column has been dropped from num, and the rare levels of segment have been pooled into a single Other bucket that is safe to use downstream.

22.4 Suitability Checks

Chapter 20 introduced KMO and Bartlett’s test to confirm that a correlation matrix has enough shared variance for factor analysis. Clustering has no single equivalent, but the spread of pairwise distances is a fast informal check: if most distances are similar, there is no geometric structure to recover.

KMO above 0.70 and a significant Bartlett test clear the EFA side; a wide, non-degenerate distance distribution with visible spread clears the clustering side. A unimodal, narrow distance distribution would warn that no clusters are likely to stand out.

22.5 EFA Pipeline

The standard EFA pipeline is: eigenvalues-plus-scree to decide the number of factors, principal-axis extraction with a rotation, and inspection of the loading matrix against a salience cutoff. For this chapter the rotation is Varimax because the factors were built to be uncorrelated; in real projects Oblimin is used whenever a factor correlation of 0.30 or more is plausible.

Two eigenvalues exceed 1, the two-factor solution is fit, and setting sub-0.40 loadings to blank produces a clean simple-structure table that is ready for labelling. Items 1-3 define factor one; items 4-6 define factor two.

22.6 Scoring and Reliability

Each retained factor needs a score per observation and a reliability estimate so the construct can be used downstream. The regression method computes factor scores directly from the solution; unit-weighted composite scores (row means of items in the scale) are a more transparent alternative that usually correlates above 0.95 with the regression scores.

Regression and unit-weighted scores agree above 0.95 on each factor, and Cronbach alpha comfortably exceeds Nunnally’s 0.70 threshold, so either score version is defensible for the next step.

22.7 Clustering Pipeline

With prepared numeric features the clustering pipeline is: pick \(k\) by an explicit rule (elbow in within-cluster SS and maximum average silhouette, Chapter 21), fit k-means with several random starts, and carry forward the cluster assignments.

The silhouette rule selects \(k\) at its maximum, the WSS curve confirms the elbow at the same \(k\), and the final kmeans fit uses nstart = 25 to protect against a poor random start.

22.8 Cluster Profiling

Cluster IDs are not yet a deliverable. Profiling means computing the cluster means on the variables that went into the distance, the cluster size as a share of the total, and a short business label per cluster. A one-table profile is the minimum stakeholders need to accept or push back on the segmentation.

The profile table is the single artefact the business sees first. Cluster means on the prepared variables, counts, and shares belong on one row; a label and a short narrative then turn “cluster 2” into “north-east high-x/low-y”.

22.9 Combining Factors and Clusters

When the raw items are noisy or partly redundant, clustering on factor scores is usually cleaner than clustering on the items directly. The factor solution absorbs the correlated structure into a few orthogonal axes; clustering on those axes produces segments that are easier to label and less sensitive to a single item.

The cluster means on the factor scores read as a business story (for example, “high on factor one, low on factor two”) without the analyst having to eyeball six item-level numbers per cluster.

22.10 Packaging as a Reusable Function

The prep, EFA, and clustering code from the previous sections can be folded into a single function that takes a data frame and the number of factors and clusters and returns everything needed for the report: loadings, reliabilities, cluster sizes, and the profile table on factor scores.

Wrapping the pipeline this way turns a scripted notebook into an audited routine: the seed is fixed, the steps run in the same order every time, and the output is a single list that the report pulls from.

22.11 Reporting and Handover

A complete unsupervised report states the population and sample size, the variables used (and how they were scaled, with rare levels pooled), the suitability checks (KMO, Bartlett, distance spread), the EFA extraction and rotation, the loading table with a salience cutoff, per-scale reliability (alpha and, where loadings differ, omega), the rule used to pick \(k\), the clustering algorithm with its settings, and the cluster profile with business labels. Stakeholders expect at least one sensitivity check: does the solution survive a different seed, a different rotation, or \(k \pm 1\)? Fabrigar et al. (1999) and Kaufman and Rousseeuw (1990) are the textbook references reviewers expect; Hastie, Tibshirani and Friedman (2009) covers the broader validation landscape.

TipHandover checklist
  1. Data source and extraction date, (2) variables used and preparation steps including pooled levels, (3) suitability evidence (KMO, Bartlett, distance spread), (4) EFA extraction, rotation, loadings, communalities, reliability, (5) clustering algorithm, k rule, and assignments, (6) profile table with labels, (7) the fit_segments() function and seed, (8) known caveats and recommended refresh cadence.

22.12 Summary

Summary of the unsupervised implementation steps introduced in this chapter
Concept Description
Framing
Unsupervised deliverable Labels on items (EFA) or on rows (clusters) or both
Method selector: items vs rows EFA distils items; clustering groups rows; combine when both noisy
Preparation
Completeness filter Drop rows with missing values on the modelling variables
Z-scaling numeric features Scale numerics to unit spread before distance or correlation
Rare-level pooling Collapse categories below a share threshold into Other
Zero-variance guard Drop columns with sd near zero before cor() or dist()
EFA
Suitability checks (KMO, Bartlett, distance spread) Confirm enough shared variance and distance spread to segment
Factor extraction and rotation Principal axis factoring with Varimax or Oblimin rotation
Reliability (alpha and omega) Report alpha per scale; use omega when loadings are unequal
Clustering
k selection via elbow and silhouette Pick k where WSS bends and average silhouette peaks
Cluster profile table Cluster means, sizes, shares, and business labels
Delivery
Clustering on factor scores Run k-means on factor scores, not raw items, for cleaner segments
Reusable fit function Wrap prep, EFA, scoring, and clustering into one function with a seed
Handover checklist Source, variables, suitability, fit, scores, cluster, profile, caveats