22 Implementation of Advanced Methods with R

22.1 The Unsupervised Workflow

Chapters 20 and 21 introduced factor analysis and cluster analysis. This chapter folds both into a single end-to-end pipeline in R: frame the segmentation question, prepare the data, check suitability, fit the factor solution, compute scores and reliability, cluster on either the items or the factor scores, profile the clusters, and package the analysis as a reusable function. As in Chapter 19, the focus is on the glue between steps rather than the methods themselves.

flowchart LR
    A[Frame question] --> B[Prepare data]
    B --> C[Suitability checks]
    C --> D[EFA fit]
    D --> E[Scores and reliability]
    E --> F[Clustering]
    F --> G[Profile segments]
    G --> H[Report and hand over]
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

Same pipeline, different outputs

The same eight boxes apply whether the deliverable is a factor solution (labels on constructs), a segment scheme (labels on observations), or the combination (clusters built on factor scores). Making the pipeline explicit lets a segmentation project be reviewed, reproduced, and refreshed without rereading code from scratch.

22.2 Problem Framing

The first decision is which kind of label the business wants. If the deliverable is a smaller set of constructs that summarise a battery of items (for example, distilling 20 survey questions into four drivers), the method is exploratory factor analysis (Chapter 20). If the deliverable is a set of groups of observations that behave similarly (customers, stores, products), the method is clustering (Chapter 21). When both are needed, the usual order is EFA first to get clean dimensions, then clustering on the factor scores to get clean segments on those dimensions.

A simple selector

Items need a smaller set of labels: EFA. Rows need a smaller set of labels: clustering. Items and rows both need labels and the items are noisy: EFA followed by clustering on the factor scores (see §9).

22.3 Data Preparation

Before any unsupervised fit, drop rows with missing values on the variables that will go into the model, z-scale numeric predictors so no variable dominates the distance or the correlation matrix, pool rare categories into Other, and guard against zero-variance columns that break both cor() and dist().

Try here

After preparation the numeric columns are z-scaled, the near-constant column has been dropped from num, and the rare levels of segment have been pooled into a single Other bucket that is safe to use downstream.

22.4 Suitability Checks

Chapter 20 introduced KMO and Bartlett’s test to confirm that a correlation matrix has enough shared variance for factor analysis. Clustering has no single equivalent, but the spread of pairwise distances is a fast informal check: if most distances are similar, there is no geometric structure to recover.

Try here

KMO above 0.70 and a significant Bartlett test clear the EFA side; a wide, non-degenerate distance distribution with visible spread clears the clustering side. A unimodal, narrow distance distribution would warn that no clusters are likely to stand out.

22.5 EFA Pipeline

The standard EFA pipeline is: eigenvalues-plus-scree to decide the number of factors, principal-axis extraction with a rotation, and inspection of the loading matrix against a salience cutoff. For this chapter the rotation is Varimax because the factors were built to be uncorrelated; in real projects Oblimin is used whenever a factor correlation of 0.30 or more is plausible.

Try here

Two eigenvalues exceed 1, the two-factor solution is fit, and setting sub-0.40 loadings to blank produces a clean simple-structure table that is ready for labelling. Items 1-3 define factor one; items 4-6 define factor two.

22.6 Scoring and Reliability

Each retained factor needs a score per observation and a reliability estimate so the construct can be used downstream. The regression method computes factor scores directly from the solution; unit-weighted composite scores (row means of items in the scale) are a more transparent alternative that usually correlates above 0.95 with the regression scores.

Try here

Regression and unit-weighted scores agree above 0.95 on each factor, and Cronbach alpha comfortably exceeds Nunnally’s 0.70 threshold, so either score version is defensible for the next step.

22.7 Clustering Pipeline

With prepared numeric features the clustering pipeline is: pick \(k\) by an explicit rule (elbow in within-cluster SS and maximum average silhouette, Chapter 21), fit k-means with several random starts, and carry forward the cluster assignments.

Try here

The silhouette rule selects \(k\) at its maximum, the WSS curve confirms the elbow at the same \(k\), and the final kmeans fit uses nstart = 25 to protect against a poor random start.

22.8 Cluster Profiling

Cluster IDs are not yet a deliverable. Profiling means computing the cluster means on the variables that went into the distance, the cluster size as a share of the total, and a short business label per cluster. A one-table profile is the minimum stakeholders need to accept or push back on the segmentation.

Try here

The profile table is the single artefact the business sees first. Cluster means on the prepared variables, counts, and shares belong on one row; a label and a short narrative then turn “cluster 2” into “north-east high-x/low-y”.

22.9 Combining Factors and Clusters

When the raw items are noisy or partly redundant, clustering on factor scores is usually cleaner than clustering on the items directly. The factor solution absorbs the correlated structure into a few orthogonal axes; clustering on those axes produces segments that are easier to label and less sensitive to a single item.

Try here

The cluster means on the factor scores read as a business story (for example, “high on factor one, low on factor two”) without the analyst having to eyeball six item-level numbers per cluster.

22.10 Packaging as a Reusable Function

The prep, EFA, and clustering code from the previous sections can be folded into a single function that takes a data frame and the number of factors and clusters and returns everything needed for the report: loadings, reliabilities, cluster sizes, and the profile table on factor scores.

Try here

Wrapping the pipeline this way turns a scripted notebook into an audited routine: the seed is fixed, the steps run in the same order every time, and the output is a single list that the report pulls from.

22.11 Reporting and Handover

A complete unsupervised report states the population and sample size, the variables used (and how they were scaled, with rare levels pooled), the suitability checks (KMO, Bartlett, distance spread), the EFA extraction and rotation, the loading table with a salience cutoff, per-scale reliability (alpha and, where loadings differ, omega), the rule used to pick \(k\), the clustering algorithm with its settings, and the cluster profile with business labels. Stakeholders expect at least one sensitivity check: does the solution survive a different seed, a different rotation, or \(k \pm 1\)? Fabrigar et al. (1999) and Kaufman and Rousseeuw (1990) are the textbook references reviewers expect; Hastie, Tibshirani and Friedman (2009) covers the broader validation landscape.

Handover checklist

Data source and extraction date, (2) variables used and preparation steps including pooled levels, (3) suitability evidence (KMO, Bartlett, distance spread), (4) EFA extraction, rotation, loadings, communalities, reliability, (5) clustering algorithm, k rule, and assignments, (6) profile table with labels, (7) the fit_segments() function and seed, (8) known caveats and recommended refresh cadence.

Summary

Concept	Description
Workflow
Unsupervised Workflow	Same pipeline as supervised work; outputs differ
Problem Framing	A simple selector chooses between EFA and clustering
Data Preparation	Standardise, handle missingness, and encode factors leak-free
Suitability Checks	KMO and Bartlett confirm factor analysis is appropriate
Pipelines
EFA Pipeline	Extract, rotate, score, and validate factors end-to-end
Clustering Pipeline	Standardise, fit, validate, and profile clusters
Combining Factors and Clusters	Use factor scores as inputs to clustering for richer segments
Reuse
Handover	Package the pipeline as a reusable function for handover