10  Exploratory Data Analysis

10.1 What Exploratory Data Analysis Is

Exploratory data analysis (EDA) is the stage at which an analyst inspects a cleaned dataset to uncover its structure, anomalies, and candidate patterns before any formal model is fitted or any hypothesis is tested. Its purpose is discovery rather than description or inference. The output of an EDA pass is a set of leads, diagnostics, and provisional findings that inform the method choice and the substantive questions for the remainder of the analysis.

NoteA working definition

Exploratory data analysis applies visual and numerical techniques to a cleaned dataset in order to surface its shape, its quality problems, and the relationships among its variables, without committing to any specific model or hypothesis.

ImportantEDA findings are provisional

A pattern spotted in EDA is a lead, not a conclusion. Treating it as confirmed, and reporting it without an independent test, is the most common error at this stage and is the reason confirmatory analysis exists as a separate step.

10.2 EDA Among the Types of Analysis

EDA sits between descriptive analysis (Chapter 9) and confirmatory analysis (Chapter 11). The three differ in purpose, stance, and the kind of evidence they produce.

NoteThree stances, three outputs

Descriptive analysis summarises what is in the data: central tendency, dispersion, shape, frequency. Its output is a set of numbers and tables that characterise the sample. Exploratory analysis looks for patterns the analyst did not expect: segments, skew, outliers, non-linear relationships, confounded comparisons. Its output is a set of leads. Confirmatory analysis tests a pre-specified hypothesis on data the exploration did not touch. Its output is a decision: the effect is, or is not, supported.

TipThe three are sequential, not alternatives

A careful study uses all three. Description establishes what is in the sample. Exploration nominates the effects worth testing. Confirmation decides whether each effect is real. Skipping exploration risks testing irrelevant hypotheses; skipping confirmation risks reporting leads as findings.

10.3 The Exploratory Workflow

EDA is iterative. A single exploratory session is a sequence of short cycles: ask a question, produce a view that bears on it, read what the view implies, formulate the next question. Most of the work happens in this loop.

flowchart LR
  A[Question] --> B[View]
  B --> C[Reading]
  C --> D[Refine]
  D --> A
  C --> E[Finding]

NoteFour habits of effective exploration

Assume nothing. Treat every column as unfamiliar on the first pass; do not rely on documentation alone. Look at everything. Plot each variable at least once and most variables twice. Iterate quickly. A rough chart that reveals a surprise beats a polished chart that reveals nothing. Record what you see. Annotate findings inline so the exploratory notebook becomes an audit trail.

10.4 Data Profiling

The first executable step in EDA is a data profile: how many rows, how many columns, of what types, with what missingness, and whether the primary key is unique. A short profile routinely catches pipeline problems that would otherwise survive into modelling.

NoteThe minimum profile

Row and column counts (dim), column types (sapply(..., class)), per-column missingness (mean(is.na(.))), cardinality (length(unique(.))), primary-key uniqueness (duplicated(.)). Five calls; one table; one minute.

10.5 Visualising Distribution Shape

Once the profile is clean, the next step is to look at the shape of every numeric column. Four compact views cover most cases: the histogram for the coarse shape, the density for a smoother version, the boxplot for centre, spread, and outliers, and the Q-Q plot for departure from normality.

NoteChart-to-question alignment

Histogram: is the distribution roughly symmetric, skewed, or multimodal? Density: where does the mass sit, and are there overlapping populations? Boxplot: how far do extreme values sit from the bulk? Q-Q plot: how far does the distribution depart from normal?

TipWhen a log transform helps

A long right tail on a strictly positive variable (revenue, income, waiting time) usually looks approximately normal on a logarithmic scale. A quick hist(log(x)) often reveals structure that the linear scale hides.

10.6 Detecting Outliers

Outliers are of two kinds: data errors, which should be repaired or removed, and real extremes, which should be kept. EDA’s job is to count and locate them; the treatment decision belongs to preparation or to the model. Two detection rules are standard.

NoteIQR fence

A value is flagged if it falls more than 1.5 times the interquartile range below Q1 or above Q3. The rule is robust to skew because it uses quartiles rather than the mean, and is the default rule in boxplots.

NoteZ-score rule

A value is flagged if its absolute standardised score exceeds 3, where \(z = (x - \bar{x})/s\). Sensitive, simple, but inflates on long-tailed data because the standard deviation itself is pulled up by the extremes.

10.7 Exploring Pairwise Relationships

The natural EDA view of two continuous variables is a scatterplot; of a continuous variable split by a category, a set of side-by-side boxplots. Both answer a single question: does one variable shift with the other? A smoother (a straight or loess line) adds a visual trend without committing to a model.

WarningOverplotting blurs patterns

A scatter of a few thousand points usually degenerates into a solid cloud. Use transparency (alpha), hexbin, or two-dimensional density estimates to see the structure inside the cloud.

10.8 Multivariate Exploration

When more than two variables are in play, two views scale well: the scatterplot matrix (every pair as a mini scatter) and the correlation heat map (every pair’s linear association as a coloured cell). Both are scanning tools, not tests; they tell the analyst where to look next.

TipCorrelations are leads, not conclusions

A cell in the heat map says two variables move together; it does not say one causes the other, nor that the relationship is linear outside the sample. The confirmatory test lives in Chapter 13.

10.9 Exploring Time Patterns

Business data almost always carries a time signature: weekday effects, monthly cycles, trend, and occasional level shifts from promotions or policy changes. A short time scan at the EDA stage catches these before they appear as unexplained variance in a later model.

10.10 Reporting Exploratory Findings

An EDA pass produces many views and a smaller number of findings. The report that carries them forward follows a stable six-section structure so that reviewers and downstream analysts can read it quickly.

NoteSix-section EDA report

Overview (rows, columns, period, source). Quality profile (NA rates, duplicates, type drift). Distributional profile (shape of each numeric variable, frequency of each category). Relationship scan (correlations, cross-tabs). Subgroup patterns (segment, region, channel, period). Findings and actions (what preparation and modelling must handle).

TipReproducibility via Quarto

A Quarto (.qmd) file keeps the code, the figures, and the commentary in one document that renders to HTML, PDF, or Word. Parameterising the YAML header produces one report per region, segment, or period from the same source file.

10.11 Common EDA Mistakes

Exploratory work has predictable failure modes. Most share a common root: treating a lead as if it were a conclusion. A standing demonstration of why summary statistics without plots can mislead is Anscombe’s quartet (Anscombe 1973): four datasets with nearly identical means, standard deviations, and correlations, yet very different shapes.

WarningSix recurring mistakes

Trusting summary statistics alone. Identical means and SDs can hide very different shapes; always plot. Cherry-picking views. Running twenty charts and reporting the striking one is selection bias. Ignoring small cells. A segment with five rows is not a segment. Chart junk. Three-dimensional effects and heavy gridlines reduce clarity. Over-plotting. Thousands of points merge into a blob; use transparency, binning, or density. Presenting exploratory findings as confirmatory. An EDA lead is a candidate for testing, not a conclusion on its own.

10.12 Summary

Summary of exploratory-analysis concepts introduced in this chapter
Concept Description
Stance and Landscape
EDA deliverables Leads, diagnostics, provisional findings for later analysis
EDA vs descriptive Descriptive summarises what is there; EDA discovers what is unexpected
EDA vs confirmatory EDA generates hypotheses; confirmatory tests them on untouched data
Profiling
Data profile Rows, columns, types, NA%, cardinality, duplicate-key check
Missingness scan Column-wise NA percentage; drives imputation decisions
Cardinality Number of unique values; flags rare levels and ID-like columns
Distribution and Outliers
Histogram, density, boxplot Shape views for a numeric variable at a glance
Q-Q plot Diagnostic for departure from normality
IQR fence Flags values beyond Q1 minus 1.5 IQR or Q3 plus 1.5 IQR
Z-score flag Flags absolute standardised scores above three
Relationships and Subgroups
Scatter with smoother Bivariate view with a visible trend, not yet a model
Scatterplot matrix Every pair of numeric variables as a mini scatter
Correlation heat map Coloured matrix of pairwise linear associations; a scanning tool
Time-pattern scan Weekday, month, and trend view for a dated series
Reporting and Cautions
Six-section report Overview, quality, distributions, relationships, subgroups, findings
Anscombe's lesson Identical summary statistics can sit behind very different shapes

Exploratory data analysis is the stage at which a cleaned dataset becomes an understood dataset. It produces the questions worth asking of the data, the quality problems worth fixing before modelling, and the candidate patterns worth testing in confirmatory work. Its results are never final on their own; they are the bridge from description to inference.