21  Cluster Analysis

21.1 What Clustering Is

Cluster analysis groups observations so that members of the same group are more similar to each other than to members of other groups. There is no response variable, so the task is unsupervised: the algorithm proposes the grouping, and the analyst decides whether it is substantively useful. Typical business uses include customer segmentation, store archetyping, product bundling, and anomaly isolation.

flowchart LR
    A[Raw observations] --> B[Distance or dissimilarity]
    B --> C[Clustering algorithm]
    C --> D[Cluster assignments]
    D --> E[Profile and label]
    E --> F[Business use]

NoteClustering does not validate itself

Any algorithm will return clusters, even on random data. The analyst is responsible for choosing a sensible distance, picking the number of clusters with a principled rule (§4), validating separation (§8), and deciding whether the resulting groups are interpretable in the business context (§10).

21.2 Distance and Scaling

Most clustering methods need a distance or dissimilarity between every pair of observations. Euclidean distance is the default for continuous variables, Manhattan is common for outlier-robust distance, and Gower (1971) handles mixed numeric and categorical data. Any variable with a larger numeric range dominates Euclidean distance, so scaling to a common spread (via scale() for z-scores) is almost always required before clustering on numeric features.

Before scaling, spend drives all of the distance and visits are ignored. After scaling, the two near-origin rows and the three high-spend rows form visibly separated blocks in the distance matrix.

21.3 k-Means Algorithm

k-means (MacQueen 1967; Lloyd 1982) partitions \(n\) observations into \(k\) clusters by minimising within-cluster sum of squares. Starting from \(k\) initial centres, it assigns each point to the nearest centre, recomputes centres as the mean of their members, and repeats until assignments stop changing. The Hartigan and Wong (1979) variant used by stats::kmeans is the default in R. Because the objective is non-convex, always restart from several random starts (nstart) and keep the best.

The three groups are recovered and the centre marks (x symbols) sit near the middles of the simulated clouds. nstart = 25 protects against a poor random initial configuration.

21.4 Choosing k

The number of clusters is a decision, not an estimate. Two complementary rules are common. The elbow method plots the total within-cluster sum of squares against \(k\) and looks for the bend where extra clusters buy little reduction. The silhouette method (Rousseeuw 1987) averages the silhouette width across observations at each \(k\) and prefers the \(k\) with the highest average. The gap statistic (Tibshirani, Walther and Hastie 2001) compares observed within-cluster dispersion to that of reference null data and is more formal but slower.

WSS drops sharply from \(k = 2\) to \(k = 3\) and flattens afterwards; average silhouette peaks at \(k = 3\). Both rules agree on the data-generating \(k\).

21.5 Hierarchical Clustering

Agglomerative hierarchical clustering starts with each observation as its own cluster and merges the two nearest clusters at each step, producing a full binary tree (dendrogram). The user reads the tree and cuts it at a chosen height or chosen number of clusters. Unlike k-means, the number of clusters does not need to be specified before fitting.

The three coloured rectangles show the natural three-cluster cut. The height at which a rectangle is drawn corresponds to the between-cluster distance at that level.

21.6 Linkage Methods Compared

The linkage rule decides how inter-cluster distance is measured during merging. Single linkage uses the nearest pair (prone to chaining), complete linkage uses the farthest pair (produces compact clusters), average linkage averages all pairs, and Ward’s method (Ward 1963) minimises the increase in within-cluster variance at each merge. Ward’s ward.D2 is the modern choice for compact, roughly equal-size groups when the distances are Euclidean.

Single linkage stretches a single long chain; complete and Ward produce visibly three-armed trees that separate the three clouds; average sits between them.

21.7 Cutting the Dendrogram

Once the tree is built, cutree(hc, k = 3) returns the cluster membership vector for a given number of clusters, and cutree(hc, h = 2.5) returns membership at a given height. Cross-tabulating the hierarchical assignment against a k-means assignment on the same data is a quick informal agreement check.

The cross-tabulation is block-diagonal up to label permutation: the two methods assign almost every observation to the same cluster. When the agreement is poor, the data either have no strong cluster structure or are shaped in a way that one of the two methods handles better than the other.

21.8 Cluster Validation

The silhouette width for an observation is \((b - a) / \max(a, b)\), where \(a\) is the mean distance to its own cluster and \(b\) is the mean distance to the nearest other cluster. It ranges from minus one (wrong cluster) through zero (on the boundary) to one (clearly in its cluster). The average silhouette width across a sample is a single-number diagnostic; a silhouette plot shows the distribution per cluster.

An average silhouette width above 0.50 is usually strong evidence of real cluster structure; 0.25 to 0.50 is weak; below 0.25 suggests that no clustering is preferable to the one found. The silhouette plot also exposes individual observations that are poorly placed.

21.9 Density-Based Clustering

k-means and Ward assume roughly spherical, similar-size clusters. DBSCAN (Ester et al. 1996) instead groups points that lie in dense neighbourhoods and labels sparse points as noise. Two tuning parameters are needed: eps, the neighbourhood radius, and minPts, the minimum number of neighbours for a point to be a core point. DBSCAN finds clusters of arbitrary shape and does not need \(k\) to be supplied.

k-means would force a straight-line split through both rings; DBSCAN recovers the two concentric rings and flags a small number of sparse points as noise (cluster label 0). The price is that the result depends on eps and minPts, which must be tuned for the data.

21.10 Profiling and Labelling Clusters

Cluster IDs are only useful once translated into business terms. Profiling means computing the cluster means (or medians, for skewed variables) of the input features and any auxiliary variables that were not used in the distance. A compact profile table with one row per cluster is the basis for assigning labels such as “high-spend loyalists” or “lapsed browsers”. Segment sizes, relative share, and a short narrative belong next to the profile; otherwise the clusters are just numbered buckets.

TipProfiling checklist
  1. Cluster size and share of total, (2) mean or median on each clustering variable, (3) mean or share on external variables not used in the fit, (4) a short business label, (5) at least one sanity check against a known segment or benchmark.

21.11 Reporting Clustering Work

A complete clustering report states the population and sample size, the variables used and how they were scaled, the distance or dissimilarity, the algorithm and its settings (including nstart for k-means, linkage for hierarchical, eps and minPts for DBSCAN), the rule used to choose \(k\), a validation summary (average silhouette or gap), and the profile table with labels. Reviewers want to see at least one sensitivity check: does the solution survive a different seed, a different linkage, or \(k \pm 1\)? Kaufman and Rousseeuw (1990) remains the standard textbook reference for the end-to-end workflow.

21.12 Summary

Summary of clustering concepts introduced in this chapter
Concept Description
Foundations
Unsupervised grouping Group observations by similarity with no response variable
Distance and dissimilarity Euclidean and Manhattan for numeric; Gower for mixed types
Scaling before clustering z-score numeric features so no variable dominates distance
Partitional
k-means algorithm Partition into k clusters by minimising within-cluster SS
Choosing k: elbow and silhouette Plot WSS and average silhouette across a range of k
Within-cluster sum of squares Total distance of points to their cluster centre
Hierarchical
Agglomerative hierarchical clustering Merge closest pairs into a binary tree (dendrogram)
Linkage methods Single, complete, average, Ward.D2 change cluster shape
Dendrogram cut cutree() at a chosen k or height to get memberships
Validation
Silhouette validation Silhouette width summarises within-vs-between separation
Dunn index Between-cluster minimum divided by within-cluster maximum
Density and Delivery
DBSCAN for irregular shapes Finds dense regions and flags sparse points as noise
Profiling and labelling Cluster means and shares translated into business labels
Reporting checklist Variables, scaling, algorithm, k rule, validation, profile