10 Exploratory Data Analysis

10.1 What Exploratory Data Analysis Is

Exploratory data analysis (EDA) is the stage at which an analyst inspects a cleaned dataset to uncover its structure, anomalies, and candidate patterns before any formal model is fitted or any hypothesis is tested. Its purpose is discovery rather than description or inference. The output of an EDA pass is a set of leads, diagnostics, and provisional findings that inform the method choice and the substantive questions for the remainder of the analysis.

A working definition

Exploratory data analysis applies visual and numerical techniques to a cleaned dataset in order to surface its shape, its quality problems, and the relationships among its variables, without committing to any specific model or hypothesis.

EDA findings are provisional

A pattern spotted in EDA is a lead, not a conclusion. Treating it as confirmed, and reporting it without an independent test, is the most common error at this stage and is the reason confirmatory analysis exists as a separate step.

10.2 EDA Among the Types of Analysis

EDA sits between descriptive analysis (Chapter 9) and confirmatory analysis (Chapter 11). The three differ in purpose, stance, and the kind of evidence they produce.

Three stances, three outputs

Descriptive analysis summarises what is in the data: central tendency, dispersion, shape, frequency. Its output is a set of numbers and tables that characterise the sample. Exploratory analysis looks for patterns the analyst did not expect: segments, skew, outliers, non-linear relationships, confounded comparisons. Its output is a set of leads. Confirmatory analysis tests a pre-specified hypothesis on data the exploration did not touch. Its output is a decision: the effect is, or is not, supported.

The three are sequential, not alternatives

A careful study uses all three. Description establishes what is in the sample. Exploration nominates the effects worth testing. Confirmation decides whether each effect is real. Skipping exploration risks testing irrelevant hypotheses; skipping confirmation risks reporting leads as findings.

10.3 The Exploratory Workflow

EDA is iterative. A single exploratory session is a sequence of short cycles: ask a question, produce a view that bears on it, read what the view implies, formulate the next question. Most of the work happens in this loop.

flowchart LR
  A[Question] --> B[View]
  B --> C[Reading]
  C --> D[Refine]
  D --> A
  C --> E[Finding]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

Four habits of effective exploration

Assume nothing. Treat every column as unfamiliar on the first pass; do not rely on documentation alone. Look at everything. Plot each variable at least once and most variables twice. Iterate quickly. A rough chart that reveals a surprise beats a polished chart that reveals nothing. Record what you see. Annotate findings inline so the exploratory notebook becomes an audit trail.

10.4 Data Profiling

The first executable step in EDA is a data profile: how many rows, how many columns, of what types, with what missingness, and whether the primary key is unique. A short profile routinely catches pipeline problems that would otherwise survive into modelling.

The minimum profile

Row and column counts (dim), column types (sapply(..., class)), per-column missingness (mean(is.na(.))), cardinality (length(unique(.))), primary-key uniqueness (duplicated(.)). Five calls; one table; one minute.

Try here

10.5 Visualising Distribution Shape

Once the profile is clean, the next step is to look at the shape of every numeric column. Four compact views cover most cases: the histogram for the coarse shape, the density for a smoother version, the boxplot for centre, spread, and outliers, and the Q-Q plot for departure from normality.

Chart-to-question alignment

Histogram: is the distribution roughly symmetric, skewed, or multimodal? Density: where does the mass sit, and are there overlapping populations? Boxplot: how far do extreme values sit from the bulk? Q-Q plot: how far does the distribution depart from normal?

Try here

When a log transform helps

A long right tail on a strictly positive variable (revenue, income, waiting time) usually looks approximately normal on a logarithmic scale. A quick hist(log(x)) often reveals structure that the linear scale hides.

10.6 Detecting Outliers

Outliers are of two kinds: data errors, which should be repaired or removed, and real extremes, which should be kept. EDA’s job is to count and locate them; the treatment decision belongs to preparation or to the model. Two detection rules are standard.

IQR fence

A value is flagged if it falls more than 1.5 times the interquartile range below Q1 or above Q3. The rule is robust to skew because it uses quartiles rather than the mean, and is the default rule in boxplots.

Z-score rule

A value is flagged if its absolute standardised score exceeds 3, where \(z = (x - \bar{x})/s\). Sensitive, simple, but inflates on long-tailed data because the standard deviation itself is pulled up by the extremes.

Try here

10.7 Exploring Pairwise Relationships

The natural EDA view of two continuous variables is a scatterplot; of a continuous variable split by a category, a set of side-by-side boxplots. Both answer a single question: does one variable shift with the other? A smoother (a straight or loess line) adds a visual trend without committing to a model.

Try here

Overplotting blurs patterns

A scatter of a few thousand points usually degenerates into a solid cloud. Use transparency (alpha), hexbin, or two-dimensional density estimates to see the structure inside the cloud.

10.8 Multivariate Exploration

When more than two variables are in play, two views scale well: the scatterplot matrix (every pair as a mini scatter) and the correlation heat map (every pair’s linear association as a coloured cell). Both are scanning tools, not tests; they tell the analyst where to look next.

Try here

Correlations are leads, not conclusions

A cell in the heat map says two variables move together; it does not say one causes the other, nor that the relationship is linear outside the sample. The confirmatory test lives in Chapter 13.

10.9 Exploring Time Patterns

Business data almost always carries a time signature: weekday effects, monthly cycles, trend, and occasional level shifts from promotions or policy changes. A short time scan at the EDA stage catches these before they appear as unexplained variance in a later model.

Try here

10.10 Reporting Exploratory Findings

An EDA pass produces many views and a smaller number of findings. The report that carries them forward follows a stable six-section structure so that reviewers and downstream analysts can read it quickly.

Six-section EDA report

Overview (rows, columns, period, source). Quality profile (NA rates, duplicates, type drift). Distributional profile (shape of each numeric variable, frequency of each category). Relationship scan (correlations, cross-tabs). Subgroup patterns (segment, region, channel, period). Findings and actions (what preparation and modelling must handle).

Reproducibility via Quarto

A Quarto (.qmd) file keeps the code, the figures, and the commentary in one document that renders to HTML, PDF, or Word. Parameterising the YAML header produces one report per region, segment, or period from the same source file.

10.11 Common EDA Mistakes

Exploratory work has predictable failure modes. Most share a common root: treating a lead as if it were a conclusion. A standing demonstration of why summary statistics without plots can mislead is Anscombe’s quartet (Anscombe 1973): four datasets with nearly identical means, standard deviations, and correlations, yet very different shapes.

Try here

Six recurring mistakes

Trusting summary statistics alone. Identical means and SDs can hide very different shapes; always plot. Cherry-picking views. Running twenty charts and reporting the striking one is selection bias. Ignoring small cells. A segment with five rows is not a segment. Chart junk. Three-dimensional effects and heavy gridlines reduce clarity. Over-plotting. Thousands of points merge into a blob; use transparency, binning, or density. Presenting exploratory findings as confirmatory. An EDA lead is a candidate for testing, not a conclusion on its own.

Summary

Concept	Description
Foundations and Workflow
What EDA Is	Open-ended inquiry to learn what data can and cannot tell you
Three Stances of Analysis	Descriptive, exploratory, and confirmatory each play distinct roles
Tools
Profiling	Counts, missing, types, ranges, and uniqueness for every variable
Distribution Shape	Histograms, density plots, and QQ plots reveal the shape
Outlier Detection	Boxplots and the 1.5 x IQR rule flag unusual observations
Bivariate Relationships	Scatterplots and correlation summarise relationships in pairs
Multivariate First Look	Pairs plots and correlation heatmaps for many variables
Practice
Iterative Habits	Profile -> visualise -> diagnose -> iterate until ready