8 Data Preparation: Cleaning, Munging, Normalization and Transformation

8.1 Why Data Preparation Matters

Data preparation is the set of steps that turn raw data into a form suitable for analysis. Surveys of working analysts have consistently placed the effort share at between 60 and 80 percent of total project time. The ratio is not a complaint about tooling; it reflects the fact that analytical outputs are bounded above by the quality of their inputs, and raw data is almost never in the shape that a model or a dashboard requires.

The three reasons raw data is not analysis-ready

First, it is dirty (missing values, duplicates, invalid entries, inconsistent encodings). Second, it is in the wrong shape (one row per transaction when the model needs one row per customer, or spread across six joined tables). Third, it is on the wrong scale or in the wrong form for the intended analysis (heavy-skewed revenue, unstandardised predictors, categorical variables that need to be numeric).

Preparation is an analytic activity, not clerical work

Decisions made during preparation (how to fill missing values, which outliers to treat, how to encode a category, which log to apply) are themselves inferences. They are first-order decisions that shape every downstream conclusion, and they should be documented, reviewed, and reproducible.

8.2 Dimensions of Data Quality

The ISO 8000 family and most practitioner references agree on six dimensions of data quality. A preparation workflow is essentially a sequence of checks and fixes along these dimensions.

Six dimensions

Accuracy: values correctly represent the real-world entity they describe (a recorded price actually matches the price charged). Completeness: required fields are populated. Consistency: the same fact is represented the same way across rows, tables, and systems (one customer is not “HDFC Bank”, “Hdfc”, and “HDFC” in three rows). Timeliness: the data is current enough for its intended use. Uniqueness: each real-world entity appears once (no duplicate customer records). Validity: values conform to defined rules (a GSTIN has the right format, a PIN code is six digits, a date is a real date).

8.3 The Preparation Pipeline

A typical preparation workflow moves through a fixed sequence of stages. The order matters: cleaning before wrangling, wrangling before scaling, scaling before modelling.

flowchart LR
  R[Raw data] --> C[Cleaning<br/>fix errors]
  C --> W[Wrangling<br/>reshape and join]
  W --> M[Missing &amp; outliers<br/>impute and treat]
  M --> N[Normalization<br/>bring to common scale]
  N --> T[Transformation<br/>reshape distributions]
  T --> E[Encoding<br/>categorical to numeric]
  E --> A[Analysis-ready]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

8.4 Data Cleaning

Cleaning fixes errors that are present in the data as it arrives. The most common classes are structural (whitespace, casing, mixed date formats), duplicative (the same entity in multiple rows), and invalid (values that violate domain rules).

Try here

Cleaning is irreversible

Every cleaning step either discards information or replaces it with a corrected value. Keep the raw file read-only and write the cleaned dataset to a new file. A preparation script that can be re-run end-to-end is the minimum professional standard.

8.5 Handling Missing Values

Missing values are almost always present in real data. How they should be handled depends on why they are missing.

Three mechanisms (Rubin 1976)

MCAR (Missing Completely At Random): the probability of being missing does not depend on any variable. MAR (Missing At Random): missingness depends on observed variables but not on the missing value itself (income missing more often for unemployed respondents). MNAR (Missing Not At Random): missingness depends on the unobserved value itself (high earners decline to report income). MCAR tolerates the simplest methods; MNAR is the hardest and often needs domain modelling.

Try here

Single imputation understates uncertainty

Mean, median, and regression imputation replace each missing value with a single best guess and then treat the completed data as if it were fully observed. Standard errors computed on the completed data are therefore optimistic. Multiple imputation (the mice package in R is the standard tool) generates several plausible completed datasets, runs the analysis on each, and pools the results, propagating the uncertainty from the missing values into the final inference.

8.6 Handling Outliers

An outlier is a value far from the bulk of the distribution. Outliers can be genuine (a very large corporate order, a ninety-year-old customer) or erroneous (a typo, a unit-conversion error, a test record that leaked into production). Preparation should identify outliers, diagnose them, and treat them only when justified.

The 1.5 x IQR rule (Tukey)

Let Q1 and Q3 be the first and third quartiles and IQR = Q3 - Q1. Values below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR are flagged as potential outliers. The rule is a starting screen, not a verdict: a flagged value is examined, not deleted.

Try here

Do not delete silently

Every outlier treatment (removal, capping, transformation) is a decision that affects the result. Record the rule used, the count of affected rows, and the rationale. An outlier that was silently removed will eventually resurface as a contradiction between the analysis and reality.

8.7 Data Munging and Wrangling

Munging (or wrangling) is the restructuring step: joining, reshaping, pivoting, deduplicating, and generally rearranging the data into the form the analysis needs. Most analytical methods expect a “tidy” layout: one row per observation, one column per variable.

Long and wide formats

In long format, each row is an observation and a repeated variable (month) lives in its own column. In wide format, the repeated variable is spread across columns (one column per month). Both forms are useful; the choice depends on whether the analysis expects observations stacked (modelling) or aligned (reporting).

Try here

Tidyverse tools

Production analytics in R uses the tidyverse (dplyr for filter, select, join, summarise; tidyr for pivot_longer and pivot_wider; stringr for text cleaning; lubridate for dates). The base-R code here is deliberate because it runs in a browser without package installation, but the same operations are shorter and more readable in tidyverse syntax.

8.8 Normalization

Normalization brings variables onto comparable scales so that no single variable dominates a distance-based method or an algorithm sensitive to scale (k-means, PCA, neural networks, regularised regression).

Three common schemes

Min-max scaling: linearly rescales to [0, 1] using (x - min) / (max - min). Preserves the shape of the distribution; sensitive to extreme values. Z-score standardisation: centres to mean 0 and scales to standard deviation 1 using (x - mean) / sd. The default choice in statistical work. Robust scaling: uses the median and IQR instead of mean and SD. Preferred when outliers are present and not being treated separately.

Try here

8.9 Transformation

Transformation changes the shape of a variable’s distribution, typically to stabilise variance, reduce skew, or make a relationship more linear. Preparation transformations are distinct from the arithmetic inside a model: they are applied before modelling and are part of the dataset used for analysis.

Common transformations

Log: compresses right-skew; useful for revenue, income, count data (after adding a small constant to handle zeros). Square root: milder than log; often used for count data. Inverse (1/x): strongest compression; occasionally useful for rates. Box-Cox (Box and Cox 1964): a family parameterised by lambda that includes log (lambda = 0), square root (lambda = 0.5), and identity (lambda = 1). Requires strictly positive values. Yeo-Johnson (2000): a Box-Cox variant that allows zero and negative values.

Try here

Transformation changes the question

A regression of log(revenue) on a predictor estimates a percentage effect; a regression of revenue estimates a rupee effect. A log transform is never neutral; the transformed model answers a different business question than the untransformed model, and the reporting should reflect that.

8.10 Encoding Categorical Variables

Most statistical and machine-learning methods require numeric inputs, so categorical variables must be encoded. The choice of encoding is a trade-off between dimensionality and information.

Common schemes

One-hot (dummy) encoding: one binary column per category; the default for nominal variables in regression and tree models. Label encoding: assigns integer codes (1, 2, 3…); appropriate only for ordinal variables or as a compact storage format. Ordinal encoding: assigns codes that respect the rank order (Bronze = 1, Silver = 2, Gold = 3, Platinum = 4). Target (mean) encoding: replaces each category with the mean of the outcome within that category; compact but leaks information from the target and requires care to prevent overfitting. Binary encoding: assigns each category a binary code, producing log2(k) columns; useful for high-cardinality variables.

Try here

8.11 Reproducibility in Preparation

Preparation decisions are first-order inputs to every subsequent analysis. Treating them as code, not as ad-hoc spreadsheet edits, is the difference between an analysis that can be defended and one that cannot.

Four practical standards

Keep raw data read-only. Write cleaned outputs to separate files. Script every step. A preparation notebook or .R file should produce the analysis-ready dataset deterministically from the raw inputs. Version the code and, when feasible, the data. Git for code; DVC or LakeFS for datasets too large for Git. Record assumptions. A short “data dictionary and decisions” file noting which columns were dropped, how missings were filled, and how outliers were treated is the minimum audit trail for any serious analysis.

Spreadsheets are not preparation tools

Interactive edits in Excel cannot be reproduced, reviewed, or version-controlled. Preparation done in a spreadsheet is effectively anonymous after two weeks: no one can reconstruct which cells were changed, by whom, or why. Preparation in R, Python, or SQL scripts retains the history automatically.

Summary

Concept	Description
Quality and Cleaning
Why Preparation Matters	Raw data is rarely analysis-ready; preparation is most of the work
Six Quality Dimensions	Accuracy, completeness, consistency, timeliness, validity, uniqueness
Cleaning Pipeline	Audit -> clean -> impute -> validate, with every step logged
Missing and Outliers
Missing-Value Mechanisms	MCAR, MAR, and MNAR (Rubin 1976) distinguish missingness mechanisms
Imputation	Single imputation understates uncertainty; multiple imputation is safer
Outlier Detection	Statistical and visual rules for spotting unusual observations
1.5 x IQR Rule	Tukey's rule flags points beyond Q1-1.5IQR or Q3+1.5IQR
Transformation
Transformation	Log, square root, and Box-Cox transforms to fix scale and skew