14 Multivariate Data Analysis Tools
14.1 Multivariate Analysis in Context
Multivariate analysis examines three or more variables at once. It completes the “tools with R” triplet: Chapter 12 worked with a single variable, Chapter 13 with pairs, and this chapter with the full column set. The goals shift accordingly: instead of a single test statistic, the tools on this page look for joint structure, reduce dimension, or compare many means at once.
How are all the variables related to each other, which rows are multivariate outliers, can the set of variables be summarised by a smaller number of components, and do groups of rows cluster together. Each question has a standard R tool.
Correlation, PCA, and distance-based methods are sensitive to the units of measurement. Standardise (scale) numeric columns before applying any of them unless the variables are already on comparable scales.
14.2 Visualising Many Variables
Before any multivariate model, the column set is scanned with pictures. A pairs plot places every pairwise scatter on one grid; a correlation heat map shows the full matrix at a glance; a parallel-coordinate plot traces each row as a line across standardised axes.
Pairs plot is for shape and outliers across pairs, heat map is for strength and sign of association at a glance, parallel coordinates is for spotting rows that move together or against the crowd.
14.3 Correlation Matrix and Partial Correlation
The correlation matrix generalises Chapter 13’s bivariate correlation to many variables at once. A partial correlation isolates the association between two variables after removing the linear effect of a third, which is the multivariate answer to Simpson-like reversals.
The raw correlation between spend and value is what you see at a glance. The partial correlation is what remains after tenure is accounted for. A large drop warns that the pairwise picture was being inflated by a shared third variable.
14.4 Multicollinearity and VIF
When several predictors move together, their individual coefficients in a regression become unstable. The variance inflation factor (VIF) quantifies how much a predictor’s variance is inflated by its linear dependence on the others; rule-of-thumb values above 5 or 10 signal trouble.
A high VIF says the predictor is linearly explained by the others in the model, not that it should be dropped. The remedy depends on purpose: drop one of the pair, combine them into an index, or switch to a regularised model.
14.5 Mahalanobis Distance
A univariate outlier lives far from the mean on one axis; a multivariate outlier lives far from the mean cloud in the full space, after accounting for correlation between variables. Mahalanobis distance (Mahalanobis 1936) measures that multivariate distance and is compared to a chi-square cutoff.
A row flagged by Mahalanobis might be a data-entry error, a legitimate extreme, or a signal of a sub-population. Only after inspection should the analyst decide to drop, keep, or model the row separately.
14.6 Principal Component Analysis
PCA (Pearson 1901, Hotelling 1933) replaces a set of correlated numeric variables with a smaller set of uncorrelated components that preserve most of the variance. It is used for compact visualisation, for collinearity reduction before modelling, and for spotting latent directions in the data.
The scree plot shows how variance is distributed across components; the elbow is where extra components start adding little. The biplot shows both rows (points) and variables (arrows) on the same axes, so direction and strength of each variable in PC space are visible at once.
14.7 k-Means Clustering
k-means partitions rows into k groups so that each row is closest to its group centroid. The number of clusters k is chosen by the analyst, often guided by an elbow in the within-cluster sum-of-squares curve.
k-means is sensitive to the initial centroid placement. Always set nstart to 10 or more so R picks the best of several random starts; the option is cheap and stabilises the result.
14.8 Hierarchical Clustering
Hierarchical clustering builds a tree (dendrogram) that merges rows step by step based on a distance matrix. It does not require k in advance; instead, the analyst cuts the tree at the height that produces a sensible number of clusters.
Ward’s method tends to produce compact, similarly sized clusters and is a safe default. Complete linkage emphasises separation; single linkage is prone to chaining. The choice is part of the multivariate plan and should be reported.
14.9 MANOVA
MANOVA extends ANOVA to several numeric response variables measured on the same units. It asks whether the joint vector of means differs across groups, using Wilks’ lambda (Wilks 1932) or related statistics. It is the multivariate counterpart of Chapter 13’s one-way ANOVA.
A significant MANOVA says the mean vector differs across groups. Follow-up univariate ANOVAs (one per response) and pre-planned contrasts show which responses drive the effect. Apply the multiple-testing discipline introduced in Chapter 11.
14.10 Choosing a Multivariate Tool
The right tool depends on the question: describe joint structure, flag multivariate outliers, reduce dimension, discover groups of rows, or compare mean vectors across known groups.
flowchart TD
A[Many variables] --> B{Question}
B -->|How are they related?| C[Correlation matrix or partial correlation]
B -->|Which rows are unusual?| D[Mahalanobis distance]
B -->|Can I summarise them?| E[PCA]
B -->|Do rows group?| F{Known groups?}
F -->|No| G[k-means or hierarchical clustering]
F -->|Yes, compare means| H[MANOVA]
B -->|Predictors too collinear?| I[VIF diagnostic]
Multivariate studies go wrong when several tools are applied without a pre-set question. Pick the question first; the diagram then picks the tool.
14.11 Reporting Multivariate Findings
A multivariate report uses the same six-section skeleton from Chapters 11 to 13 and lists the full variable set instead of one or two names.
- Question and variable set, (2) sample and scaling decision, (3) diagnostic views (pairs, heat map, Mahalanobis), (4) tool output (matrix, PCA loadings, cluster labels, MANOVA table), (5) effect or structure summary (variance explained, silhouette, Wilks’ lambda), (6) business decision with caveats. Keeping this skeleton stable across Chapters 11 to 14 makes univariate, bivariate, and multivariate studies directly comparable.
14.12 Summary
| Concept | Description |
|---|---|
| Setup and Landscape | |
| Multivariate visualisation | Pairs plot, correlation heat map, parallel coordinates at scale |
| Scaling decision | Standardise numeric columns before correlation, PCA, or distance tools |
| Relationships at Scale | |
| Correlation matrix | All pairwise correlations in a single cor() call |
| Partial correlation | Correlation between two variables after removing the effect of a third |
| VIF diagnostic | Flags predictors that are linearly explained by the others |
| Outliers and Dimension Reduction | |
| Mahalanobis distance | Multivariate distance from the mean cloud, compared to chi-square cutoff |
| prcomp PCA | Principal components via prcomp with centre and scale |
| Scree and variance explained | How variance is distributed across components; elbow guides retention |
| Biplot | Row points and variable arrows on the same PC axes |
| Grouping Structure | |
| k-means clustering | Partition rows around centroids with a chosen k |
| Hierarchical with hclust | Dendrogram built from a distance matrix; cut the tree to get clusters |
| Cluster-count guidance | Elbow plot for k-means, dendrogram cut for hierarchical |
| Multivariate Tests and Reporting | |
| MANOVA | Compares mean vectors across groups using Wilks' lambda |
| Six-section multivariate report | Question, variable set, diagnostic, tool, structure, decision |