11  Confirmatory Data Analysis

11.1 What Confirmatory Data Analysis Is

Confirmatory Data Analysis (CDA) is the phase of an analytics workflow in which pre-specified hypotheses are tested against data that has not been used to generate them. The goal is to decide whether a pattern is compatible with chance variation or whether the evidence is strong enough to treat the pattern as a finding. Unlike exploration, CDA is rule-bound: the test, the threshold, and the data split are chosen before the result is seen.

NoteA working definition

CDA is a planned, rule-bound examination of whether data contradict a specific hypothesis about the population, using a chosen test statistic, a chosen significance level, and a protocol fixed before the test is run.

TipCDA is evidential, not decorative

A confirmatory result is read as “the data are (or are not) consistent with the null at this level,” not as “the effect has been proven.” The statement is about evidence, not certainty.

11.2 CDA Among the Types of Analysis

Descriptive analysis summarises what has been observed. Exploratory analysis scans the data for unexpected patterns and generates candidate hypotheses. Confirmatory analysis takes one of those hypotheses, sharpens it into a testable claim, and evaluates it against evidence in a disciplined way.

ImportantThree stances, three outputs

Descriptive yields summaries. Exploratory yields leads. Confirmatory yields decisions. Each stance is legitimate on its own, and each protects the others from overreach.

WarningDo not confirm on the same data used to explore

Running a test on the exact dataset that suggested the pattern inflates the apparent evidence. Confirmation should use held-out data, a new sample, or a pre-registered plan.

11.3 The Confirmatory Workflow

A disciplined confirmatory study moves through a fixed sequence: state the question, translate it into hypotheses, choose a test and a significance level, collect or hold out the data, compute the test statistic, read off the p-value and the interval, and report the decision with its effect size.

flowchart LR
    A[Research question] --> B[Hypotheses H0 and H1]
    B --> C[Test and alpha fixed]
    C --> D[Data collection or holdout]
    D --> E[Test statistic]
    E --> F[p-value and CI]
    F --> G[Decision with effect size]
    G --> H[Report]

NoteWhat makes the workflow confirmatory

The order is the discipline. Each step is committed to before the next is evaluated, which removes the most common routes to false-positive inflation.

11.4 Framing Hypotheses

Every confirmatory test pairs a null hypothesis (H0) with an alternative (H1). The null is the no-effect baseline that the test is set up to reject; the alternative is the statement the analyst expects the data to support. A one-sided alternative is used when direction is fixed in advance; a two-sided alternative is the default when either direction is of interest.

TipPhrase H0 so that a clear rejection is meaningful

If H0 cannot be falsified by any plausible result, the test cannot carry information. A sharp, directional null is easier to interpret than a vague one.

WarningChoose sidedness before seeing the data

Flipping from two-sided to one-sided after inspecting the direction of the effect doubles the effective significance level without disclosure.

11.5 Test Statistics and Sampling Distributions

A test statistic is a single number computed from the sample that summarises how far the sample deviates from the null. Its sampling distribution describes the range of values that statistic would take if H0 were true and the study were repeated many times. The tail behaviour of that distribution is what the p-value reads off (Fisher 1925).

NoteThe standard error is the width of this distribution

The standard error is the standard deviation of the sampling distribution. It falls as the square root of n, which is why larger samples produce sharper tests.

11.6 The p-Value

The p-value is the probability, under the null hypothesis, of observing a test statistic at least as extreme as the one actually observed. A small p-value indicates that the observed result would be unusual if H0 were true; it does not say how large the effect is, and it does not say H0 is false with probability one minus p.

ImportantWhat the p-value is not

A p-value of 0.03 does not mean the probability that H0 is true is 3 percent. It is a tail probability computed assuming H0, not a probability statement about the hypothesis itself.

11.7 Significance, Type I, Type II Errors, and Power

Before the test is run, the analyst fixes a significance level alpha, which is the tolerated rate of false rejections. A Type I error rejects a true null; a Type II error fails to reject a false null. Power is the probability of rejecting a false null at a given effect size and sample size, and equals one minus the Type II error rate (Neyman and Pearson 1933).

TipLower alpha, lower power for a fixed n

Cutting alpha reduces false positives but also reduces the chance of catching a real effect. Sample size is the lever that restores power once alpha is set.

11.8 Effect Size

A p-value answers whether an effect is distinguishable from noise; an effect size answers how large it is. Cohen’s d standardises a mean difference by its pooled standard deviation; Pearson’s r summarises a linear relationship. A significant result with a negligible effect size is statistically real but practically uninteresting.

NoteRough Cohen benchmarks

Cohen (1988) offered 0.2, 0.5, 0.8 as small, medium, and large d values. They are rules of thumb for orientation, not thresholds for decisions.

11.9 Confidence Intervals

A 95 percent confidence interval is a range constructed so that, across many repetitions of the study, 95 percent of such intervals would contain the true parameter. A CI communicates both the point estimate and the precision around it, and is often more informative than a p-value on its own.

ImportantCI and test are two sides of one coin

If a 95 percent CI for a mean difference excludes zero, the corresponding two-sided test at alpha 0.05 rejects H0. The interval adds information: magnitude and direction.

11.10 Parametric vs Non-Parametric Framing

Parametric tests assume the data come from a family of distributions described by a small number of parameters, typically normal with a given mean and variance. Non-parametric tests relax that assumption and work on ranks or on resamples. The choice is part of the confirmatory plan and should be justified from what is known about the data, not picked after inspecting the sample.

TipWhen each is the natural choice

Parametric tests are sharper when the distributional assumption is met. Non-parametric tests are safer when the distribution is unknown, the sample is small, or the variable is ordinal. The specific tests are covered in chapters 12 to 14.

WarningDo not switch framing after seeing the p-value

Running a parametric test, seeing a borderline result, and switching to a non-parametric test (or the reverse) is a form of selective reporting that biases the evidence.

11.11 Multiple Testing

When many hypotheses are tested on the same data, the chance of at least one false positive rises quickly. Adjustments control either the family-wise error rate (Bonferroni) or the false discovery rate (Benjamini and Hochberg 1995). The adjustment should be pre-specified together with the number of tests.

NoteTwo different error targets

Bonferroni controls the probability of any false positive across the family. Benjamini-Hochberg controls the expected proportion of false positives among the rejections. The first is stricter; the second tolerates some false positives in exchange for power.

11.12 Reporting Confirmatory Findings

A confirmatory report states the hypothesis, the planned test and level, the sample and any pre-registration, the test statistic with its degrees of freedom, the p-value, the effect size, and the confidence interval. The narrative explains the business decision that the evidence supports or fails to support.

TipSix-section CDA report
  1. Question and hypotheses, (2) design and pre-registration notes, (3) sample and assumptions, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision and caveats. This mirrors the six-section EDA report from the previous chapter and keeps the two stages comparable.
NoteReproducibility via Quarto

The same Quarto workflow that carries an EDA report also carries a CDA report: the tests, the sample, and the adjustments live in code, so the decision can be rerun by anyone with access to the document.

11.13 Common CDA Mistakes

Most confirmatory failures are protocol failures, not test failures. Running many tests and reporting only the significant ones (p-hacking), changing the hypothesis after seeing the data (HARKing), and reporting a significant p-value without its effect size are the three that appear most often in business settings.

WarningSix recurring mistakes
  1. Testing on the data used to find the pattern, (2) reading the p-value as the probability that H0 is true, (3) ignoring effect size, (4) switching sidedness after seeing the direction, (5) silent multiple testing, (6) failing to report non-significant results. Each one weakens the confirmatory claim.

11.14 Summary

Summary of confirmatory-analysis concepts introduced in this chapter
Concept Description
Stance and Landscape
What CDA is Rule-bound test of a pre-specified hypothesis on held-out data
CDA vs EDA EDA generates hypotheses; CDA tests them
CDA vs descriptive Descriptive summarises; CDA decides whether a claim holds
Hypotheses and Test Machinery
Null hypothesis H0 The no-effect baseline that the test is set up to reject
Alternative hypothesis H1 The claim the analyst expects the data to support
One-sided vs two-sided Two-sided by default; one-sided only when direction is fixed in advance
Test statistic Single number summarising sample deviation from the null
Sampling distribution Distribution of the test statistic under H0 across repeated sampling
Errors, Power, Effect, Intervals
Significance level alpha Tolerated false-positive rate, fixed before the test
Type I error Rejecting a null that is actually true
Type II error Failing to reject a null that is actually false
Power Probability of rejecting a false null at a given effect size and n
Effect size Standardised magnitude of effect; Cohen's d, Pearson's r
Confidence interval Range capturing the true parameter across repeated sampling
Parametric Choice and Multiple Testing
Parametric framing Assumes a named distribution family; sharper when met
Non-parametric framing Works on ranks or resamples; safer under weak assumptions
Bonferroni correction Family-wise error rate control; strict
Benjamini-Hochberg FDR False discovery rate control; tolerates some false positives for power
Reporting and Cautions
Six-section CDA report Question, design, sample, test, effect, decision
P-hacking caution Selective reporting inflates false positives