flowchart LR A[Question] --> B[View] B --> C[Reading] C --> D[Refine] D --> A C --> E[Finding]
10 Exploratory Data Analysis
10.1 What Exploratory Data Analysis Is
Exploratory data analysis (EDA) is the stage at which an analyst inspects a cleaned dataset to uncover its structure, anomalies, and candidate patterns before any formal model is fitted or any hypothesis is tested. Its purpose is discovery rather than description or inference. The output of an EDA pass is a set of leads, diagnostics, and provisional findings that inform the method choice and the substantive questions for the remainder of the analysis.
Exploratory data analysis applies visual and numerical techniques to a cleaned dataset in order to surface its shape, its quality problems, and the relationships among its variables, without committing to any specific model or hypothesis.
A pattern spotted in EDA is a lead, not a conclusion. Treating it as confirmed, and reporting it without an independent test, is the most common error at this stage and is the reason confirmatory analysis exists as a separate step.
10.2 EDA Among the Types of Analysis
EDA sits between descriptive analysis (Chapter 9) and confirmatory analysis (Chapter 11). The three differ in purpose, stance, and the kind of evidence they produce.
Descriptive analysis summarises what is in the data: central tendency, dispersion, shape, frequency. Its output is a set of numbers and tables that characterise the sample. Exploratory analysis looks for patterns the analyst did not expect: segments, skew, outliers, non-linear relationships, confounded comparisons. Its output is a set of leads. Confirmatory analysis tests a pre-specified hypothesis on data the exploration did not touch. Its output is a decision: the effect is, or is not, supported.
A careful study uses all three. Description establishes what is in the sample. Exploration nominates the effects worth testing. Confirmation decides whether each effect is real. Skipping exploration risks testing irrelevant hypotheses; skipping confirmation risks reporting leads as findings.
10.3 The Exploratory Workflow
EDA is iterative. A single exploratory session is a sequence of short cycles: ask a question, produce a view that bears on it, read what the view implies, formulate the next question. Most of the work happens in this loop.
Assume nothing. Treat every column as unfamiliar on the first pass; do not rely on documentation alone. Look at everything. Plot each variable at least once and most variables twice. Iterate quickly. A rough chart that reveals a surprise beats a polished chart that reveals nothing. Record what you see. Annotate findings inline so the exploratory notebook becomes an audit trail.
10.4 Data Profiling
The first executable step in EDA is a data profile: how many rows, how many columns, of what types, with what missingness, and whether the primary key is unique. A short profile routinely catches pipeline problems that would otherwise survive into modelling.
Row and column counts (dim), column types (sapply(..., class)), per-column missingness (mean(is.na(.))), cardinality (length(unique(.))), primary-key uniqueness (duplicated(.)). Five calls; one table; one minute.
10.5 Visualising Distribution Shape
Once the profile is clean, the next step is to look at the shape of every numeric column. Four compact views cover most cases: the histogram for the coarse shape, the density for a smoother version, the boxplot for centre, spread, and outliers, and the Q-Q plot for departure from normality.
Histogram: is the distribution roughly symmetric, skewed, or multimodal? Density: where does the mass sit, and are there overlapping populations? Boxplot: how far do extreme values sit from the bulk? Q-Q plot: how far does the distribution depart from normal?
A long right tail on a strictly positive variable (revenue, income, waiting time) usually looks approximately normal on a logarithmic scale. A quick hist(log(x)) often reveals structure that the linear scale hides.
10.6 Detecting Outliers
Outliers are of two kinds: data errors, which should be repaired or removed, and real extremes, which should be kept. EDA’s job is to count and locate them; the treatment decision belongs to preparation or to the model. Two detection rules are standard.
A value is flagged if it falls more than 1.5 times the interquartile range below Q1 or above Q3. The rule is robust to skew because it uses quartiles rather than the mean, and is the default rule in boxplots.
A value is flagged if its absolute standardised score exceeds 3, where \(z = (x - \bar{x})/s\). Sensitive, simple, but inflates on long-tailed data because the standard deviation itself is pulled up by the extremes.
10.7 Exploring Pairwise Relationships
The natural EDA view of two continuous variables is a scatterplot; of a continuous variable split by a category, a set of side-by-side boxplots. Both answer a single question: does one variable shift with the other? A smoother (a straight or loess line) adds a visual trend without committing to a model.
A scatter of a few thousand points usually degenerates into a solid cloud. Use transparency (alpha), hexbin, or two-dimensional density estimates to see the structure inside the cloud.
10.8 Multivariate Exploration
When more than two variables are in play, two views scale well: the scatterplot matrix (every pair as a mini scatter) and the correlation heat map (every pair’s linear association as a coloured cell). Both are scanning tools, not tests; they tell the analyst where to look next.
A cell in the heat map says two variables move together; it does not say one causes the other, nor that the relationship is linear outside the sample. The confirmatory test lives in Chapter 13.
10.9 Exploring Time Patterns
Business data almost always carries a time signature: weekday effects, monthly cycles, trend, and occasional level shifts from promotions or policy changes. A short time scan at the EDA stage catches these before they appear as unexplained variance in a later model.
10.10 Reporting Exploratory Findings
An EDA pass produces many views and a smaller number of findings. The report that carries them forward follows a stable six-section structure so that reviewers and downstream analysts can read it quickly.
Overview (rows, columns, period, source). Quality profile (NA rates, duplicates, type drift). Distributional profile (shape of each numeric variable, frequency of each category). Relationship scan (correlations, cross-tabs). Subgroup patterns (segment, region, channel, period). Findings and actions (what preparation and modelling must handle).
A Quarto (.qmd) file keeps the code, the figures, and the commentary in one document that renders to HTML, PDF, or Word. Parameterising the YAML header produces one report per region, segment, or period from the same source file.
10.11 Common EDA Mistakes
Exploratory work has predictable failure modes. Most share a common root: treating a lead as if it were a conclusion. A standing demonstration of why summary statistics without plots can mislead is Anscombe’s quartet (Anscombe 1973): four datasets with nearly identical means, standard deviations, and correlations, yet very different shapes.
Trusting summary statistics alone. Identical means and SDs can hide very different shapes; always plot. Cherry-picking views. Running twenty charts and reporting the striking one is selection bias. Ignoring small cells. A segment with five rows is not a segment. Chart junk. Three-dimensional effects and heavy gridlines reduce clarity. Over-plotting. Thousands of points merge into a blob; use transparency, binning, or density. Presenting exploratory findings as confirmatory. An EDA lead is a candidate for testing, not a conclusion on its own.
10.12 Summary
| Concept | Description |
|---|---|
| Stance and Landscape | |
| EDA deliverables | Leads, diagnostics, provisional findings for later analysis |
| EDA vs descriptive | Descriptive summarises what is there; EDA discovers what is unexpected |
| EDA vs confirmatory | EDA generates hypotheses; confirmatory tests them on untouched data |
| Profiling | |
| Data profile | Rows, columns, types, NA%, cardinality, duplicate-key check |
| Missingness scan | Column-wise NA percentage; drives imputation decisions |
| Cardinality | Number of unique values; flags rare levels and ID-like columns |
| Distribution and Outliers | |
| Histogram, density, boxplot | Shape views for a numeric variable at a glance |
| Q-Q plot | Diagnostic for departure from normality |
| IQR fence | Flags values beyond Q1 minus 1.5 IQR or Q3 plus 1.5 IQR |
| Z-score flag | Flags absolute standardised scores above three |
| Relationships and Subgroups | |
| Scatter with smoother | Bivariate view with a visible trend, not yet a model |
| Scatterplot matrix | Every pair of numeric variables as a mini scatter |
| Correlation heat map | Coloured matrix of pairwise linear associations; a scanning tool |
| Time-pattern scan | Weekday, month, and trend view for a dated series |
| Reporting and Cautions | |
| Six-section report | Overview, quality, distributions, relationships, subgroups, findings |
| Anscombe's lesson | Identical summary statistics can sit behind very different shapes |
Exploratory data analysis is the stage at which a cleaned dataset becomes an understood dataset. It produces the questions worth asking of the data, the quality problems worth fixing before modelling, and the candidate patterns worth testing in confirmatory work. Its results are never final on their own; they are the bridge from description to inference.