14 Multivariate Data Analysis Tools

14.1 Multivariate Analysis in Context

Multivariate analysis examines three or more variables at once. It completes the “tools with R” triplet: Chapter 12 worked with a single variable, Chapter 13 with pairs, and this chapter with the full column set. The goals shift accordingly: instead of a single test statistic, the tools on this page look for joint structure, reduce dimension, or compare many means at once.

Four recurring multivariate questions

How are all the variables related to each other, which rows are multivariate outliers, can the set of variables be summarised by a smaller number of components, and do groups of rows cluster together. Each question has a standard R tool.

Scale matters for most multivariate tools

Correlation, PCA, and distance-based methods are sensitive to the units of measurement. Standardise (scale) numeric columns before applying any of them unless the variables are already on comparable scales.

14.2 Visualising Many Variables

Before any multivariate model, the column set is scanned with pictures. A pairs plot places every pairwise scatter on one grid; a correlation heat map shows the full matrix at a glance; a parallel-coordinate plot traces each row as a line across standardised axes.

Try here

Three pictures, three views

Pairs plot is for shape and outliers across pairs, heat map is for strength and sign of association at a glance, parallel coordinates is for spotting rows that move together or against the crowd.

14.3 Correlation Matrix and Partial Correlation

The correlation matrix generalises Chapter 13’s bivariate correlation to many variables at once. A partial correlation isolates the association between two variables after removing the linear effect of a third, which is the multivariate answer to Simpson-like reversals.

Try here

What the two numbers say

The raw correlation between spend and value is what you see at a glance. The partial correlation is what remains after tenure is accounted for. A large drop warns that the pairwise picture was being inflated by a shared third variable.

14.4 Multicollinearity and VIF

When several predictors move together, their individual coefficients in a regression become unstable. The variance inflation factor (VIF) quantifies how much a predictor’s variance is inflated by its linear dependence on the others; rule-of-thumb values above 5 or 10 signal trouble.

Try here

VIF is a diagnostic, not a decision

A high VIF says the predictor is linearly explained by the others in the model, not that it should be dropped. The remedy depends on purpose: drop one of the pair, combine them into an index, or switch to a regularised model.

14.5 Mahalanobis Distance

A univariate outlier lives far from the mean on one axis; a multivariate outlier lives far from the mean cloud in the full space, after accounting for correlation between variables. Mahalanobis distance (Mahalanobis 1936) measures that multivariate distance and is compared to a chi-square cutoff.

Try here

Always inspect the flagged rows

A row flagged by Mahalanobis might be a data-entry error, a legitimate extreme, or a signal of a sub-population. Only after inspection should the analyst decide to drop, keep, or model the row separately.

14.6 Principal Component Analysis

PCA (Pearson 1901, Hotelling 1933) replaces a set of correlated numeric variables with a smaller set of uncorrelated components that preserve most of the variance. It is used for compact visualisation, for collinearity reduction before modelling, and for spotting latent directions in the data.

Try here

Reading the two plots

The scree plot shows how variance is distributed across components; the elbow is where extra components start adding little. The biplot shows both rows (points) and variables (arrows) on the same axes, so direction and strength of each variable in PC space are visible at once.

14.7 k-Means Clustering

k-means partitions rows into k groups so that each row is closest to its group centroid. The number of clusters k is chosen by the analyst, often guided by an elbow in the within-cluster sum-of-squares curve.

Try here

Re-run with multiple starts

k-means is sensitive to the initial centroid placement. Always set nstart to 10 or more so R picks the best of several random starts; the option is cheap and stabilises the result.

14.8 Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) that merges rows step by step based on a distance matrix. It does not require k in advance; instead, the analyst cuts the tree at the height that produces a sensible number of clusters.

Try here

Choosing a linkage

Ward’s method tends to produce compact, similarly sized clusters and is a safe default. Complete linkage emphasises separation; single linkage is prone to chaining. The choice is part of the multivariate plan and should be reported.

14.9 MANOVA

MANOVA extends ANOVA to several numeric response variables measured on the same units. It asks whether the joint vector of means differs across groups, using Wilks’ lambda (Wilks 1932) or related statistics. It is the multivariate counterpart of Chapter 13’s one-way ANOVA.

Try here

A rejection is the start, not the end

A significant MANOVA says the mean vector differs across groups. Follow-up univariate ANOVAs (one per response) and pre-planned contrasts show which responses drive the effect. Apply the multiple-testing discipline introduced in Chapter 11.

14.10 Choosing a Multivariate Tool

The right tool depends on the question: describe joint structure, flag multivariate outliers, reduce dimension, discover groups of rows, or compare mean vectors across known groups.

flowchart TD
    A[Many variables] --> B{Question}
    B -->|How are they related?| C[Correlation matrix or partial correlation]
    B -->|Which rows are unusual?| D[Mahalanobis distance]
    B -->|Can I summarise them?| E[PCA]
    B -->|Do rows group?| F{Known groups?}
    F -->|No| G[k-means or hierarchical clustering]
    F -->|Yes, compare means| H[MANOVA]
    B -->|Predictors too collinear?| I[VIF diagnostic]
    classDef default fill:#2a4d69,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

One question, one tool

Multivariate studies go wrong when several tools are applied without a pre-set question. Pick the question first; the diagram then picks the tool.

14.11 Reporting Multivariate Findings

A multivariate report uses the same six-section skeleton from Chapters 11 to 13 and lists the full variable set instead of one or two names.

Six-section multivariate report

Question and variable set, (2) sample and scaling decision, (3) diagnostic views (pairs, heat map, Mahalanobis), (4) tool output (matrix, PCA loadings, cluster labels, MANOVA table), (5) effect or structure summary (variance explained, silhouette, Wilks’ lambda), (6) business decision with caveats. Keeping this skeleton stable across Chapters 11 to 14 makes univariate, bivariate, and multivariate studies directly comparable.

Summary

Concept	Description
Setup and Diagnostics
Multivariate Setup	Three or more variables analysed jointly with shared scale
Visualising Many Variables	Pairs plot, heatmap, and parallel coordinates as quick views
Correlation and Partial Correlation	Pairwise vs adjusted-for-others association
Multicollinearity (VIF)	VIF flags redundant predictors that inflate standard errors
Mahalanobis Distance	Distance metric that accounts for covariance structure
Reduction and Clustering
Principal Component Analysis	Dimension reduction onto axes of maximum variance
k-Means Clustering	Partition observations into k similar clusters
Reporting	Always inspect flagged rows and document modelling choices