11 Confirmatory Data Analysis

11.1 What Confirmatory Data Analysis Is

Confirmatory Data Analysis (CDA) is the phase of an analytics workflow in which pre-specified hypotheses are tested against data that has not been used to generate them. The goal is to decide whether a pattern is compatible with chance variation or whether the evidence is strong enough to treat the pattern as a finding. Unlike exploration, CDA is rule-bound: the test, the threshold, and the data split are chosen before the result is seen.

A working definition

CDA is a planned, rule-bound examination of whether data contradict a specific hypothesis about the population, using a chosen test statistic, a chosen significance level, and a protocol fixed before the test is run.

CDA is evidential, not decorative

A confirmatory result is read as “the data are (or are not) consistent with the null at this level,” not as “the effect has been proven.” The statement is about evidence, not certainty.

11.2 CDA Among the Types of Analysis

Descriptive analysis summarises what has been observed. Exploratory analysis scans the data for unexpected patterns and generates candidate hypotheses. Confirmatory analysis takes one of those hypotheses, sharpens it into a testable claim, and evaluates it against evidence in a disciplined way.

Three stances, three outputs

Descriptive yields summaries. Exploratory yields leads. Confirmatory yields decisions. Each stance is legitimate on its own, and each protects the others from overreach.

Do not confirm on the same data used to explore

Running a test on the exact dataset that suggested the pattern inflates the apparent evidence. Confirmation should use held-out data, a new sample, or a pre-registered plan.

11.3 The Confirmatory Workflow

A disciplined confirmatory study moves through a fixed sequence: state the question, translate it into hypotheses, choose a test and a significance level, collect or hold out the data, compute the test statistic, read off the p-value and the interval, and report the decision with its effect size.

flowchart LR
    A[Research question] --> B[Hypotheses H0 and H1]
    B --> C[Test and alpha fixed]
    C --> D[Data collection or holdout]
    D --> E[Test statistic]
    E --> F[p-value and CI]
    F --> G[Decision with effect size]
    G --> H[Report]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

What makes the workflow confirmatory

The order is the discipline. Each step is committed to before the next is evaluated, which removes the most common routes to false-positive inflation.

11.4 Framing Hypotheses

Every confirmatory test pairs a null hypothesis (H0) with an alternative (H1). The null is the no-effect baseline that the test is set up to reject; the alternative is the statement the analyst expects the data to support. A one-sided alternative is used when direction is fixed in advance; a two-sided alternative is the default when either direction is of interest.

Phrase H0 so that a clear rejection is meaningful

If H0 cannot be falsified by any plausible result, the test cannot carry information. A sharp, directional null is easier to interpret than a vague one.

Try here

Choose sidedness before seeing the data

Flipping from two-sided to one-sided after inspecting the direction of the effect doubles the effective significance level without disclosure.

11.5 Test Statistics and Sampling Distributions

A test statistic is a single number computed from the sample that summarises how far the sample deviates from the null. Its sampling distribution describes the range of values that statistic would take if H0 were true and the study were repeated many times. The tail behaviour of that distribution is what the p-value reads off (Fisher 1925).

Try here

The standard error is the width of this distribution

The standard error is the standard deviation of the sampling distribution. It falls as the square root of n, which is why larger samples produce sharper tests.

11.6 The p-Value

The p-value is the probability, under the null hypothesis, of observing a test statistic at least as extreme as the one actually observed. A small p-value indicates that the observed result would be unusual if H0 were true; it does not say how large the effect is, and it does not say H0 is false with probability one minus p.

Try here

What the p-value is not

A p-value of 0.03 does not mean the probability that H0 is true is 3 percent. It is a tail probability computed assuming H0, not a probability statement about the hypothesis itself.

11.7 Significance, Type I, Type II Errors, and Power

Before the test is run, the analyst fixes a significance level alpha, which is the tolerated rate of false rejections. A Type I error rejects a true null; a Type II error fails to reject a false null. Power is the probability of rejecting a false null at a given effect size and sample size, and equals one minus the Type II error rate (Neyman and Pearson 1933).

Try here

Lower alpha, lower power for a fixed n

Cutting alpha reduces false positives but also reduces the chance of catching a real effect. Sample size is the lever that restores power once alpha is set.

11.8 Effect Size

A p-value answers whether an effect is distinguishable from noise; an effect size answers how large it is. Cohen’s d standardises a mean difference by its pooled standard deviation; Pearson’s r summarises a linear relationship. A significant result with a negligible effect size is statistically real but practically uninteresting.

Try here

Rough Cohen benchmarks

Cohen (1988) offered 0.2, 0.5, 0.8 as small, medium, and large d values. They are rules of thumb for orientation, not thresholds for decisions.

11.9 Confidence Intervals

A 95 percent confidence interval is a range constructed so that, across many repetitions of the study, 95 percent of such intervals would contain the true parameter. A CI communicates both the point estimate and the precision around it, and is often more informative than a p-value on its own.

Try here

CI and test are two sides of one coin

If a 95 percent CI for a mean difference excludes zero, the corresponding two-sided test at alpha 0.05 rejects H0. The interval adds information: magnitude and direction.

11.10 Parametric vs Non-Parametric Framing

Parametric tests assume the data come from a family of distributions described by a small number of parameters, typically normal with a given mean and variance. Non-parametric tests relax that assumption and work on ranks or on resamples. The choice is part of the confirmatory plan and should be justified from what is known about the data, not picked after inspecting the sample.

When each is the natural choice

Parametric tests are sharper when the distributional assumption is met. Non-parametric tests are safer when the distribution is unknown, the sample is small, or the variable is ordinal. The specific tests are covered in chapters 12 to 14.

Do not switch framing after seeing the p-value

Running a parametric test, seeing a borderline result, and switching to a non-parametric test (or the reverse) is a form of selective reporting that biases the evidence.

11.11 Multiple Testing

When many hypotheses are tested on the same data, the chance of at least one false positive rises quickly. Adjustments control either the family-wise error rate (Bonferroni) or the false discovery rate (Benjamini and Hochberg 1995). The adjustment should be pre-specified together with the number of tests.

Try here

Two different error targets

Bonferroni controls the probability of any false positive across the family. Benjamini-Hochberg controls the expected proportion of false positives among the rejections. The first is stricter; the second tolerates some false positives in exchange for power.

11.12 Reporting Confirmatory Findings

A confirmatory report states the hypothesis, the planned test and level, the sample and any pre-registration, the test statistic with its degrees of freedom, the p-value, the effect size, and the confidence interval. The narrative explains the business decision that the evidence supports or fails to support.

Six-section CDA report

Question and hypotheses, (2) design and pre-registration notes, (3) sample and assumptions, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision and caveats. This mirrors the six-section EDA report from the previous chapter and keeps the two stages comparable.

Reproducibility via Quarto

The same Quarto workflow that carries an EDA report also carries a CDA report: the tests, the sample, and the adjustments live in code, so the decision can be rerun by anyone with access to the document.

11.13 Common CDA Mistakes

Most confirmatory failures are protocol failures, not test failures. Running many tests and reporting only the significant ones (p-hacking), changing the hypothesis after seeing the data (HARKing), and reporting a significant p-value without its effect size are the three that appear most often in business settings.

Try here

Six recurring mistakes

Testing on the data used to find the pattern, (2) reading the p-value as the probability that H0 is true, (3) ignoring effect size, (4) switching sidedness after seeing the direction, (5) silent multiple testing, (6) failing to report non-significant results. Each one weakens the confirmatory claim.

Summary

Concept	Description
Foundations
What CDA Is	Tests pre-specified hypotheses against data with formal evidence
Three Stances	Descriptive, exploratory, and confirmatory each contribute
Workflow	Pre-register, collect, test, report — the confirmatory loop
Hypothesis Mechanics
Framing Hypotheses	State H0 so a clear rejection is meaningful
Test Statistic	Standardised quantity comparing data to the null
Sampling Distribution	Distribution of the statistic under repeated sampling
Reporting
P-value	Probability of seeing the data (or more extreme) under H0
Effect Size and CI	Always report alongside the p-value, with confidence interval