13 Bivariate Data Analysis with R
13.1 Bivariate Analysis in Context
Bivariate analysis examines two variables together. The test that is appropriate depends on the measurement-level pair: two numeric variables, a numeric and a categorical variable, or two categorical variables. Chapter 12 applied the confirmatory framework of Chapter 11 to one variable at a time; this chapter extends the same framework to pairs.
Numeric with numeric is the domain of correlation and simple regression. Numeric with categorical is the domain of t-tests, ANOVA, and their rank-based counterparts. Categorical with categorical is the domain of chi-square and Fisher’s exact tests.
The same pair of variables can be tested in several ways, but the measurement-level pair is the first filter. Everything in this chapter attaches to one of the three pair types above.
13.2 Visualising Two Variables
Each pair type has a standard first picture. Two numeric variables use a scatterplot. A numeric variable with a categorical one uses a grouped boxplot. Two categorical variables use a mosaic plot.
The picture usually tells you which test is sensible before any p-value is computed. A curved scatter rules out Pearson; a boxplot with unequal spreads warns that Welch, not pooled, is the right t-test.
13.3 Correlation: Pearson, Spearman, Kendall
Correlation quantifies how two numeric variables move together. Pearson’s r measures linear association and assumes approximate bivariate normality. Spearman’s rho (Spearman 1904) and Kendall’s tau (Kendall 1938) work on ranks and measure monotone association, which is safer when the relationship is not linear or the variables are ordinal.
Pearson for linear relationships between approximately normal variables, Spearman when either variable is ordinal or the relationship is monotone but not linear, Kendall when the sample is small or ties are frequent. All three report a coefficient, a p-value, and a 95 percent interval where available.
13.4 Simple Linear Regression
Simple linear regression fits a line y equals a plus b x that minimises squared residuals. It is the parametric extension of Pearson correlation: it adds a point estimate of slope, an intercept, and an R squared that indicates the share of variance in y explained by x.
A slope coefficient says how y changes on average with x in this sample, not that x causes y. Causal claims require a design that controls or randomises confounders.
13.5 Independent Two-Sample t-Test
The independent two-sample t-test asks whether the means of a numeric variable differ between two groups. R’s t.test defaults to Welch’s form, which does not assume equal variances and is the right default in most business settings.
The pooled two-sample t assumes equal variances. Welch relaxes that assumption with a degrees-of-freedom correction and loses almost nothing when variances are equal. Prefer Welch unless you have a strong reason to pool.
13.6 Paired t-Test
The paired t-test applies when each observation in one group is matched with an observation in the other: the same customer measured before and after, the same store on weekdays and weekends. It tests whether the mean of the within-pair differences is zero.
When the matching is informative, the paired test is substantially more powerful than the independent-groups test. When it is not, the paired test is simply the wrong model for the data.
13.7 Non-Parametric Group Comparisons
When distributions are skewed or sample sizes are small, rank-based alternatives replace the t-test and ANOVA. The Wilcoxon rank-sum test (Mann and Whitney 1947) is the two-group counterpart; Kruskal-Wallis (Kruskal and Wallis 1952) is the multi-group counterpart.
These tests compare distributions via ranks, not means. A significant Wilcoxon or Kruskal-Wallis result says the distributions are shifted, not specifically that the means differ.
13.8 One-Way ANOVA
One-way ANOVA tests whether the means of a numeric variable differ across three or more groups. It is the parametric extension of the two-sample t-test and assumes approximately normal residuals with roughly equal variance across groups.
A significant F says at least one group mean differs from the others. It does not say which groups differ. For a disciplined next step, plan the specific contrasts in advance and apply an adjustment for the number of comparisons, as covered in Chapter 11.
13.9 Tests of Independence for Two Categorical Variables
For two categorical variables, the chi-square test of independence (Pearson 1900) compares the observed two-way table with the table expected under independence. When expected counts are small, Fisher’s exact test (Fisher 1934) is preferred.
Fisher’s exact test is preferred when any expected count falls below about five. For larger tables and larger samples, chi-square is both adequate and faster.
13.10 Effect Size for Bivariate Tests
Every bivariate test has a companion effect size that turns the result into a magnitude. Pearson r and R squared quantify linear relationships; Cohen’s d summarises a two-group mean difference; eta-squared summarises an ANOVA; Cramér’s V summarises a chi-square table.
Report r and R-squared with correlation and regression; Cohen’s d with two-sample t; eta-squared with ANOVA; Cramer’s V with chi-square. Each one is on a scale that readers can compare across studies.
13.11 Choosing a Bivariate Test
The measurement-level pair is the main switch. The diagram below names the tool for each cell of that switch.
flowchart TD
A[Two variables] --> B{Pair type}
B -->|Numeric and Numeric| C{Relationship shape}
C -->|Linear, approx. normal| D[Pearson, lm]
C -->|Monotone or ordinal| E[Spearman or Kendall]
B -->|Numeric and Categorical| F{Number of groups}
F -->|Two independent| G[Welch t-test or Wilcoxon rank-sum]
F -->|Two paired| H[Paired t-test or Wilcoxon signed-rank]
F -->|Three or more| I[One-way ANOVA or Kruskal-Wallis]
B -->|Categorical and Categorical| J{Expected counts}
J -->|All at least five| K[chisq.test]
J -->|Any below five| L[fisher.test]
Each test brings assumptions that should still be checked once the variable pair is identified. The diagram narrows the choice; the diagnostic confirms it.
13.12 Reporting Bivariate Findings
A bivariate report reuses the six-section skeleton from Chapters 11 and 12 and names two variables instead of one.
- Question and variable pair, (2) sample and design (independent, paired, observational), (3) diagnostic view and assumption check, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision with caveats. Keeping the structure stable across chapters lets a reader compare a univariate and a bivariate study at a glance.
13.13 Summary
| Concept | Description |
|---|---|
| Setup and Landscape | |
| Bivariate visualisation | Scatter, grouped boxplot, mosaic for the three pair types |
| Variable-type pairing | Measurement-level pair drives the test family |
| Two Numeric Variables | |
| Pearson correlation | Linear association under approximate bivariate normality |
| Spearman and Kendall | Monotone, rank-based association; safer for ordinal or skewed data |
| Simple linear regression | Fitted line plus R-squared; extends correlation with slope and intercept |
| Numeric by Categorical | |
| Independent t-test (Welch) | Mean difference between two independent groups; no equal-variance assumption |
| Paired t-test | Mean of within-pair differences; use when each observation is matched |
| Wilcoxon rank-sum | Non-parametric two-group comparison on ranks |
| One-way ANOVA | Three-or-more-group parametric comparison of means |
| Kruskal-Wallis | Non-parametric counterpart of one-way ANOVA |
| Two Categorical Variables | |
| Chi-square independence | Observed vs expected counts in a two-way table |
| Fisher's exact test | Exact counterpart for small expected counts in 2 x 2 tables |
| Effect Size and Reporting | |
| Pearson r and R-squared | Effect size for correlation and simple regression |
| Cohen's d two-sample | Standardised mean difference between two groups |
| Eta-squared for ANOVA | Share of variance explained by the grouping factor |
| Cramer's V for chi-square | Strength of association between two categorical variables |
| Six-section bivariate report | Question, pair, design, diagnostic, test, effect, decision |