14  Multivariate Data Analysis Tools

14.1 Multivariate Analysis in Context

Multivariate analysis examines three or more variables at once. It completes the “tools with R” triplet: Chapter 12 worked with a single variable, Chapter 13 with pairs, and this chapter with the full column set. The goals shift accordingly: instead of a single test statistic, the tools on this page look for joint structure, reduce dimension, or compare many means at once.

NoteFour recurring multivariate questions

How are all the variables related to each other, which rows are multivariate outliers, can the set of variables be summarised by a smaller number of components, and do groups of rows cluster together. Each question has a standard R tool.

TipScale matters for most multivariate tools

Correlation, PCA, and distance-based methods are sensitive to the units of measurement. Standardise (scale) numeric columns before applying any of them unless the variables are already on comparable scales.

14.2 Visualising Many Variables

Before any multivariate model, the column set is scanned with pictures. A pairs plot places every pairwise scatter on one grid; a correlation heat map shows the full matrix at a glance; a parallel-coordinate plot traces each row as a line across standardised axes.

TipThree pictures, three views

Pairs plot is for shape and outliers across pairs, heat map is for strength and sign of association at a glance, parallel coordinates is for spotting rows that move together or against the crowd.

14.3 Correlation Matrix and Partial Correlation

The correlation matrix generalises Chapter 13’s bivariate correlation to many variables at once. A partial correlation isolates the association between two variables after removing the linear effect of a third, which is the multivariate answer to Simpson-like reversals.

NoteWhat the two numbers say

The raw correlation between spend and value is what you see at a glance. The partial correlation is what remains after tenure is accounted for. A large drop warns that the pairwise picture was being inflated by a shared third variable.

14.4 Multicollinearity and VIF

When several predictors move together, their individual coefficients in a regression become unstable. The variance inflation factor (VIF) quantifies how much a predictor’s variance is inflated by its linear dependence on the others; rule-of-thumb values above 5 or 10 signal trouble.

WarningVIF is a diagnostic, not a decision

A high VIF says the predictor is linearly explained by the others in the model, not that it should be dropped. The remedy depends on purpose: drop one of the pair, combine them into an index, or switch to a regularised model.

14.5 Mahalanobis Distance

A univariate outlier lives far from the mean on one axis; a multivariate outlier lives far from the mean cloud in the full space, after accounting for correlation between variables. Mahalanobis distance (Mahalanobis 1936) measures that multivariate distance and is compared to a chi-square cutoff.

TipAlways inspect the flagged rows

A row flagged by Mahalanobis might be a data-entry error, a legitimate extreme, or a signal of a sub-population. Only after inspection should the analyst decide to drop, keep, or model the row separately.

14.6 Principal Component Analysis

PCA (Pearson 1901, Hotelling 1933) replaces a set of correlated numeric variables with a smaller set of uncorrelated components that preserve most of the variance. It is used for compact visualisation, for collinearity reduction before modelling, and for spotting latent directions in the data.

NoteReading the two plots

The scree plot shows how variance is distributed across components; the elbow is where extra components start adding little. The biplot shows both rows (points) and variables (arrows) on the same axes, so direction and strength of each variable in PC space are visible at once.

14.7 k-Means Clustering

k-means partitions rows into k groups so that each row is closest to its group centroid. The number of clusters k is chosen by the analyst, often guided by an elbow in the within-cluster sum-of-squares curve.

WarningRe-run with multiple starts

k-means is sensitive to the initial centroid placement. Always set nstart to 10 or more so R picks the best of several random starts; the option is cheap and stabilises the result.

14.8 Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) that merges rows step by step based on a distance matrix. It does not require k in advance; instead, the analyst cuts the tree at the height that produces a sensible number of clusters.

TipChoosing a linkage

Ward’s method tends to produce compact, similarly sized clusters and is a safe default. Complete linkage emphasises separation; single linkage is prone to chaining. The choice is part of the multivariate plan and should be reported.

14.9 MANOVA

MANOVA extends ANOVA to several numeric response variables measured on the same units. It asks whether the joint vector of means differs across groups, using Wilks’ lambda (Wilks 1932) or related statistics. It is the multivariate counterpart of Chapter 13’s one-way ANOVA.

NoteA rejection is the start, not the end

A significant MANOVA says the mean vector differs across groups. Follow-up univariate ANOVAs (one per response) and pre-planned contrasts show which responses drive the effect. Apply the multiple-testing discipline introduced in Chapter 11.

14.10 Choosing a Multivariate Tool

The right tool depends on the question: describe joint structure, flag multivariate outliers, reduce dimension, discover groups of rows, or compare mean vectors across known groups.

flowchart TD
    A[Many variables] --> B{Question}
    B -->|How are they related?| C[Correlation matrix or partial correlation]
    B -->|Which rows are unusual?| D[Mahalanobis distance]
    B -->|Can I summarise them?| E[PCA]
    B -->|Do rows group?| F{Known groups?}
    F -->|No| G[k-means or hierarchical clustering]
    F -->|Yes, compare means| H[MANOVA]
    B -->|Predictors too collinear?| I[VIF diagnostic]

TipOne question, one tool

Multivariate studies go wrong when several tools are applied without a pre-set question. Pick the question first; the diagram then picks the tool.

14.11 Reporting Multivariate Findings

A multivariate report uses the same six-section skeleton from Chapters 11 to 13 and lists the full variable set instead of one or two names.

TipSix-section multivariate report
  1. Question and variable set, (2) sample and scaling decision, (3) diagnostic views (pairs, heat map, Mahalanobis), (4) tool output (matrix, PCA loadings, cluster labels, MANOVA table), (5) effect or structure summary (variance explained, silhouette, Wilks’ lambda), (6) business decision with caveats. Keeping this skeleton stable across Chapters 11 to 14 makes univariate, bivariate, and multivariate studies directly comparable.

14.12 Summary

Summary of multivariate tools introduced in this chapter
Concept Description
Setup and Landscape
Multivariate visualisation Pairs plot, correlation heat map, parallel coordinates at scale
Scaling decision Standardise numeric columns before correlation, PCA, or distance tools
Relationships at Scale
Correlation matrix All pairwise correlations in a single cor() call
Partial correlation Correlation between two variables after removing the effect of a third
VIF diagnostic Flags predictors that are linearly explained by the others
Outliers and Dimension Reduction
Mahalanobis distance Multivariate distance from the mean cloud, compared to chi-square cutoff
prcomp PCA Principal components via prcomp with centre and scale
Scree and variance explained How variance is distributed across components; elbow guides retention
Biplot Row points and variable arrows on the same PC axes
Grouping Structure
k-means clustering Partition rows around centroids with a chosen k
Hierarchical with hclust Dendrogram built from a distance matrix; cut the tree to get clusters
Cluster-count guidance Elbow plot for k-means, dendrogram cut for hierarchical
Multivariate Tests and Reporting
MANOVA Compares mean vectors across groups using Wilks' lambda
Six-section multivariate report Question, variable set, diagnostic, tool, structure, decision