8  Data Preparation: Cleaning, Munging, Normalization and Transformation

8.1 Why Data Preparation Matters

Data preparation is the set of steps that turn raw data into a form suitable for analysis. Surveys of working analysts have consistently placed the effort share at between 60 and 80 percent of total project time. The ratio is not a complaint about tooling; it reflects the fact that analytical outputs are bounded above by the quality of their inputs, and raw data is almost never in the shape that a model or a dashboard requires.

NoteThe three reasons raw data is not analysis-ready

First, it is dirty (missing values, duplicates, invalid entries, inconsistent encodings). Second, it is in the wrong shape (one row per transaction when the model needs one row per customer, or spread across six joined tables). Third, it is on the wrong scale or in the wrong form for the intended analysis (heavy-skewed revenue, unstandardised predictors, categorical variables that need to be numeric).

ImportantPreparation is an analytic activity, not clerical work

Decisions made during preparation (how to fill missing values, which outliers to treat, how to encode a category, which log to apply) are themselves inferences. They are first-order decisions that shape every downstream conclusion, and they should be documented, reviewed, and reproducible.

8.2 Dimensions of Data Quality

The ISO 8000 family and most practitioner references agree on six dimensions of data quality. A preparation workflow is essentially a sequence of checks and fixes along these dimensions.

NoteSix dimensions

Accuracy: values correctly represent the real-world entity they describe (a recorded price actually matches the price charged). Completeness: required fields are populated. Consistency: the same fact is represented the same way across rows, tables, and systems (one customer is not “HDFC Bank”, “Hdfc”, and “HDFC” in three rows). Timeliness: the data is current enough for its intended use. Uniqueness: each real-world entity appears once (no duplicate customer records). Validity: values conform to defined rules (a GSTIN has the right format, a PIN code is six digits, a date is a real date).

8.3 The Preparation Pipeline

A typical preparation workflow moves through a fixed sequence of stages. The order matters: cleaning before wrangling, wrangling before scaling, scaling before modelling.

flowchart LR
  R[Raw data] --> C[Cleaning<br/>fix errors]
  C --> W[Wrangling<br/>reshape and join]
  W --> M[Missing &amp; outliers<br/>impute and treat]
  M --> N[Normalization<br/>bring to common scale]
  N --> T[Transformation<br/>reshape distributions]
  T --> E[Encoding<br/>categorical to numeric]
  E --> A[Analysis-ready]

8.4 Data Cleaning

Cleaning fixes errors that are present in the data as it arrives. The most common classes are structural (whitespace, casing, mixed date formats), duplicative (the same entity in multiple rows), and invalid (values that violate domain rules).

WarningCleaning is irreversible

Every cleaning step either discards information or replaces it with a corrected value. Keep the raw file read-only and write the cleaned dataset to a new file. A preparation script that can be re-run end-to-end is the minimum professional standard.

8.5 Handling Missing Values

Missing values are almost always present in real data. How they should be handled depends on why they are missing.

NoteThree mechanisms (Rubin 1976)

MCAR (Missing Completely At Random): the probability of being missing does not depend on any variable. MAR (Missing At Random): missingness depends on observed variables but not on the missing value itself (income missing more often for unemployed respondents). MNAR (Missing Not At Random): missingness depends on the unobserved value itself (high earners decline to report income). MCAR tolerates the simplest methods; MNAR is the hardest and often needs domain modelling.

ImportantSingle imputation understates uncertainty

Mean, median, and regression imputation replace each missing value with a single best guess and then treat the completed data as if it were fully observed. Standard errors computed on the completed data are therefore optimistic. Multiple imputation (the mice package in R is the standard tool) generates several plausible completed datasets, runs the analysis on each, and pools the results, propagating the uncertainty from the missing values into the final inference.

8.6 Handling Outliers

An outlier is a value far from the bulk of the distribution. Outliers can be genuine (a very large corporate order, a ninety-year-old customer) or erroneous (a typo, a unit-conversion error, a test record that leaked into production). Preparation should identify outliers, diagnose them, and treat them only when justified.

NoteThe 1.5 x IQR rule (Tukey)

Let Q1 and Q3 be the first and third quartiles and IQR = Q3 - Q1. Values below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR are flagged as potential outliers. The rule is a starting screen, not a verdict: a flagged value is examined, not deleted.

WarningDo not delete silently

Every outlier treatment (removal, capping, transformation) is a decision that affects the result. Record the rule used, the count of affected rows, and the rationale. An outlier that was silently removed will eventually resurface as a contradiction between the analysis and reality.

8.7 Data Munging and Wrangling

Munging (or wrangling) is the restructuring step: joining, reshaping, pivoting, deduplicating, and generally rearranging the data into the form the analysis needs. Most analytical methods expect a “tidy” layout: one row per observation, one column per variable.

NoteLong and wide formats

In long format, each row is an observation and a repeated variable (month) lives in its own column. In wide format, the repeated variable is spread across columns (one column per month). Both forms are useful; the choice depends on whether the analysis expects observations stacked (modelling) or aligned (reporting).

TipTidyverse tools

Production analytics in R uses the tidyverse (dplyr for filter, select, join, summarise; tidyr for pivot_longer and pivot_wider; stringr for text cleaning; lubridate for dates). The base-R code here is deliberate because it runs in a browser without package installation, but the same operations are shorter and more readable in tidyverse syntax.

8.8 Normalization

Normalization brings variables onto comparable scales so that no single variable dominates a distance-based method or an algorithm sensitive to scale (k-means, PCA, neural networks, regularised regression).

NoteThree common schemes

Min-max scaling: linearly rescales to [0, 1] using (x - min) / (max - min). Preserves the shape of the distribution; sensitive to extreme values. Z-score standardisation: centres to mean 0 and scales to standard deviation 1 using (x - mean) / sd. The default choice in statistical work. Robust scaling: uses the median and IQR instead of mean and SD. Preferred when outliers are present and not being treated separately.

8.9 Transformation

Transformation changes the shape of a variable’s distribution, typically to stabilise variance, reduce skew, or make a relationship more linear. Preparation transformations are distinct from the arithmetic inside a model: they are applied before modelling and are part of the dataset used for analysis.

NoteCommon transformations

Log: compresses right-skew; useful for revenue, income, count data (after adding a small constant to handle zeros). Square root: milder than log; often used for count data. Inverse (1/x): strongest compression; occasionally useful for rates. Box-Cox (Box and Cox 1964): a family parameterised by lambda that includes log (lambda = 0), square root (lambda = 0.5), and identity (lambda = 1). Requires strictly positive values. Yeo-Johnson (2000): a Box-Cox variant that allows zero and negative values.

ImportantTransformation changes the question

A regression of log(revenue) on a predictor estimates a percentage effect; a regression of revenue estimates a rupee effect. A log transform is never neutral; the transformed model answers a different business question than the untransformed model, and the reporting should reflect that.

8.10 Encoding Categorical Variables

Most statistical and machine-learning methods require numeric inputs, so categorical variables must be encoded. The choice of encoding is a trade-off between dimensionality and information.

NoteCommon schemes

One-hot (dummy) encoding: one binary column per category; the default for nominal variables in regression and tree models. Label encoding: assigns integer codes (1, 2, 3…); appropriate only for ordinal variables or as a compact storage format. Ordinal encoding: assigns codes that respect the rank order (Bronze = 1, Silver = 2, Gold = 3, Platinum = 4). Target (mean) encoding: replaces each category with the mean of the outcome within that category; compact but leaks information from the target and requires care to prevent overfitting. Binary encoding: assigns each category a binary code, producing log2(k) columns; useful for high-cardinality variables.

8.11 Reproducibility in Preparation

Preparation decisions are first-order inputs to every subsequent analysis. Treating them as code, not as ad-hoc spreadsheet edits, is the difference between an analysis that can be defended and one that cannot.

NoteFour practical standards

Keep raw data read-only. Write cleaned outputs to separate files. Script every step. A preparation notebook or .R file should produce the analysis-ready dataset deterministically from the raw inputs. Version the code and, when feasible, the data. Git for code; DVC or LakeFS for datasets too large for Git. Record assumptions. A short “data dictionary and decisions” file noting which columns were dropped, how missings were filled, and how outliers were treated is the minimum audit trail for any serious analysis.

WarningSpreadsheets are not preparation tools

Interactive edits in Excel cannot be reproduced, reviewed, or version-controlled. Preparation done in a spreadsheet is effectively anonymous after two weeks: no one can reconstruct which cells were changed, by whom, or why. Preparation in R, Python, or SQL scripts retains the history automatically.

8.12 Summary

Summary of data-preparation concepts introduced in this chapter
Concept Description
Quality Dimensions
Accuracy Values correctly represent reality
Completeness Required fields are populated
Consistency Same fact is represented the same way
Timeliness Data is current for its intended use
Uniqueness Each entity appears exactly once
Validity Values conform to domain rules
Cleaning Targets
Missing values Detect, diagnose, impute (MCAR, MAR, MNAR)
Outliers Flag with IQR or z-score; winsorise or remove
Duplicates Identify and drop exact or near-duplicate rows
Wrangling Operations
Reshape Change between long and wide layouts
Join Combine tables on a shared key
Pivot Rearrange rows into columns or vice versa
Normalization and Transformation
Min-max Rescale to the [0, 1] range
Z-score Centre to mean 0, standard deviation 1
Log Compress right-skew (revenue, counts)
Box-Cox Family of power transforms, lambda-parameterised
Encoding Schemes
One-hot One binary column per nominal category
Ordinal Integer codes that respect rank order
Target Replace category with mean of the outcome

Preparation is where analysis is won or lost. A clean, correctly shaped, well-encoded dataset usually makes the subsequent modelling straightforward; a hasty or undocumented preparation pipeline will eventually produce a result that no one can reproduce or defend.