flowchart LR
A[Research question] --> B[Hypotheses H0 and H1]
B --> C[Test and alpha fixed]
C --> D[Data collection or holdout]
D --> E[Test statistic]
E --> F[p-value and CI]
F --> G[Decision with effect size]
G --> H[Report]
11 Confirmatory Data Analysis
11.1 What Confirmatory Data Analysis Is
Confirmatory Data Analysis (CDA) is the phase of an analytics workflow in which pre-specified hypotheses are tested against data that has not been used to generate them. The goal is to decide whether a pattern is compatible with chance variation or whether the evidence is strong enough to treat the pattern as a finding. Unlike exploration, CDA is rule-bound: the test, the threshold, and the data split are chosen before the result is seen.
CDA is a planned, rule-bound examination of whether data contradict a specific hypothesis about the population, using a chosen test statistic, a chosen significance level, and a protocol fixed before the test is run.
A confirmatory result is read as “the data are (or are not) consistent with the null at this level,” not as “the effect has been proven.” The statement is about evidence, not certainty.
11.2 CDA Among the Types of Analysis
Descriptive analysis summarises what has been observed. Exploratory analysis scans the data for unexpected patterns and generates candidate hypotheses. Confirmatory analysis takes one of those hypotheses, sharpens it into a testable claim, and evaluates it against evidence in a disciplined way.
Descriptive yields summaries. Exploratory yields leads. Confirmatory yields decisions. Each stance is legitimate on its own, and each protects the others from overreach.
Running a test on the exact dataset that suggested the pattern inflates the apparent evidence. Confirmation should use held-out data, a new sample, or a pre-registered plan.
11.3 The Confirmatory Workflow
A disciplined confirmatory study moves through a fixed sequence: state the question, translate it into hypotheses, choose a test and a significance level, collect or hold out the data, compute the test statistic, read off the p-value and the interval, and report the decision with its effect size.
The order is the discipline. Each step is committed to before the next is evaluated, which removes the most common routes to false-positive inflation.
11.4 Framing Hypotheses
Every confirmatory test pairs a null hypothesis (H0) with an alternative (H1). The null is the no-effect baseline that the test is set up to reject; the alternative is the statement the analyst expects the data to support. A one-sided alternative is used when direction is fixed in advance; a two-sided alternative is the default when either direction is of interest.
If H0 cannot be falsified by any plausible result, the test cannot carry information. A sharp, directional null is easier to interpret than a vague one.
Flipping from two-sided to one-sided after inspecting the direction of the effect doubles the effective significance level without disclosure.
11.5 Test Statistics and Sampling Distributions
A test statistic is a single number computed from the sample that summarises how far the sample deviates from the null. Its sampling distribution describes the range of values that statistic would take if H0 were true and the study were repeated many times. The tail behaviour of that distribution is what the p-value reads off (Fisher 1925).
The standard error is the standard deviation of the sampling distribution. It falls as the square root of n, which is why larger samples produce sharper tests.
11.6 The p-Value
The p-value is the probability, under the null hypothesis, of observing a test statistic at least as extreme as the one actually observed. A small p-value indicates that the observed result would be unusual if H0 were true; it does not say how large the effect is, and it does not say H0 is false with probability one minus p.
A p-value of 0.03 does not mean the probability that H0 is true is 3 percent. It is a tail probability computed assuming H0, not a probability statement about the hypothesis itself.
11.7 Significance, Type I, Type II Errors, and Power
Before the test is run, the analyst fixes a significance level alpha, which is the tolerated rate of false rejections. A Type I error rejects a true null; a Type II error fails to reject a false null. Power is the probability of rejecting a false null at a given effect size and sample size, and equals one minus the Type II error rate (Neyman and Pearson 1933).
Cutting alpha reduces false positives but also reduces the chance of catching a real effect. Sample size is the lever that restores power once alpha is set.
11.8 Effect Size
A p-value answers whether an effect is distinguishable from noise; an effect size answers how large it is. Cohen’s d standardises a mean difference by its pooled standard deviation; Pearson’s r summarises a linear relationship. A significant result with a negligible effect size is statistically real but practically uninteresting.
Cohen (1988) offered 0.2, 0.5, 0.8 as small, medium, and large d values. They are rules of thumb for orientation, not thresholds for decisions.
11.9 Confidence Intervals
A 95 percent confidence interval is a range constructed so that, across many repetitions of the study, 95 percent of such intervals would contain the true parameter. A CI communicates both the point estimate and the precision around it, and is often more informative than a p-value on its own.
If a 95 percent CI for a mean difference excludes zero, the corresponding two-sided test at alpha 0.05 rejects H0. The interval adds information: magnitude and direction.
11.10 Parametric vs Non-Parametric Framing
Parametric tests assume the data come from a family of distributions described by a small number of parameters, typically normal with a given mean and variance. Non-parametric tests relax that assumption and work on ranks or on resamples. The choice is part of the confirmatory plan and should be justified from what is known about the data, not picked after inspecting the sample.
Parametric tests are sharper when the distributional assumption is met. Non-parametric tests are safer when the distribution is unknown, the sample is small, or the variable is ordinal. The specific tests are covered in chapters 12 to 14.
Running a parametric test, seeing a borderline result, and switching to a non-parametric test (or the reverse) is a form of selective reporting that biases the evidence.
11.11 Multiple Testing
When many hypotheses are tested on the same data, the chance of at least one false positive rises quickly. Adjustments control either the family-wise error rate (Bonferroni) or the false discovery rate (Benjamini and Hochberg 1995). The adjustment should be pre-specified together with the number of tests.
Bonferroni controls the probability of any false positive across the family. Benjamini-Hochberg controls the expected proportion of false positives among the rejections. The first is stricter; the second tolerates some false positives in exchange for power.
11.12 Reporting Confirmatory Findings
A confirmatory report states the hypothesis, the planned test and level, the sample and any pre-registration, the test statistic with its degrees of freedom, the p-value, the effect size, and the confidence interval. The narrative explains the business decision that the evidence supports or fails to support.
- Question and hypotheses, (2) design and pre-registration notes, (3) sample and assumptions, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision and caveats. This mirrors the six-section EDA report from the previous chapter and keeps the two stages comparable.
The same Quarto workflow that carries an EDA report also carries a CDA report: the tests, the sample, and the adjustments live in code, so the decision can be rerun by anyone with access to the document.
11.13 Common CDA Mistakes
Most confirmatory failures are protocol failures, not test failures. Running many tests and reporting only the significant ones (p-hacking), changing the hypothesis after seeing the data (HARKing), and reporting a significant p-value without its effect size are the three that appear most often in business settings.
- Testing on the data used to find the pattern, (2) reading the p-value as the probability that H0 is true, (3) ignoring effect size, (4) switching sidedness after seeing the direction, (5) silent multiple testing, (6) failing to report non-significant results. Each one weakens the confirmatory claim.
11.14 Summary
| Concept | Description |
|---|---|
| Stance and Landscape | |
| What CDA is | Rule-bound test of a pre-specified hypothesis on held-out data |
| CDA vs EDA | EDA generates hypotheses; CDA tests them |
| CDA vs descriptive | Descriptive summarises; CDA decides whether a claim holds |
| Hypotheses and Test Machinery | |
| Null hypothesis H0 | The no-effect baseline that the test is set up to reject |
| Alternative hypothesis H1 | The claim the analyst expects the data to support |
| One-sided vs two-sided | Two-sided by default; one-sided only when direction is fixed in advance |
| Test statistic | Single number summarising sample deviation from the null |
| Sampling distribution | Distribution of the test statistic under H0 across repeated sampling |
| Errors, Power, Effect, Intervals | |
| Significance level alpha | Tolerated false-positive rate, fixed before the test |
| Type I error | Rejecting a null that is actually true |
| Type II error | Failing to reject a null that is actually false |
| Power | Probability of rejecting a false null at a given effect size and n |
| Effect size | Standardised magnitude of effect; Cohen's d, Pearson's r |
| Confidence interval | Range capturing the true parameter across repeated sampling |
| Parametric Choice and Multiple Testing | |
| Parametric framing | Assumes a named distribution family; sharper when met |
| Non-parametric framing | Works on ranks or resamples; safer under weak assumptions |
| Bonferroni correction | Family-wise error rate control; strict |
| Benjamini-Hochberg FDR | False discovery rate control; tolerates some false positives for power |
| Reporting and Cautions | |
| Six-section CDA report | Question, design, sample, test, effect, decision |
| P-hacking caution | Selective reporting inflates false positives |