flowchart LR R[Raw data] --> C[Cleaning<br/>fix errors] C --> W[Wrangling<br/>reshape and join] W --> M[Missing & outliers<br/>impute and treat] M --> N[Normalization<br/>bring to common scale] N --> T[Transformation<br/>reshape distributions] T --> E[Encoding<br/>categorical to numeric] E --> A[Analysis-ready]
8 Data Preparation: Cleaning, Munging, Normalization and Transformation
8.1 Why Data Preparation Matters
Data preparation is the set of steps that turn raw data into a form suitable for analysis. Surveys of working analysts have consistently placed the effort share at between 60 and 80 percent of total project time. The ratio is not a complaint about tooling; it reflects the fact that analytical outputs are bounded above by the quality of their inputs, and raw data is almost never in the shape that a model or a dashboard requires.
First, it is dirty (missing values, duplicates, invalid entries, inconsistent encodings). Second, it is in the wrong shape (one row per transaction when the model needs one row per customer, or spread across six joined tables). Third, it is on the wrong scale or in the wrong form for the intended analysis (heavy-skewed revenue, unstandardised predictors, categorical variables that need to be numeric).
Decisions made during preparation (how to fill missing values, which outliers to treat, how to encode a category, which log to apply) are themselves inferences. They are first-order decisions that shape every downstream conclusion, and they should be documented, reviewed, and reproducible.
8.2 Dimensions of Data Quality
The ISO 8000 family and most practitioner references agree on six dimensions of data quality. A preparation workflow is essentially a sequence of checks and fixes along these dimensions.
Accuracy: values correctly represent the real-world entity they describe (a recorded price actually matches the price charged). Completeness: required fields are populated. Consistency: the same fact is represented the same way across rows, tables, and systems (one customer is not “HDFC Bank”, “Hdfc”, and “HDFC” in three rows). Timeliness: the data is current enough for its intended use. Uniqueness: each real-world entity appears once (no duplicate customer records). Validity: values conform to defined rules (a GSTIN has the right format, a PIN code is six digits, a date is a real date).
8.3 The Preparation Pipeline
A typical preparation workflow moves through a fixed sequence of stages. The order matters: cleaning before wrangling, wrangling before scaling, scaling before modelling.
8.4 Data Cleaning
Cleaning fixes errors that are present in the data as it arrives. The most common classes are structural (whitespace, casing, mixed date formats), duplicative (the same entity in multiple rows), and invalid (values that violate domain rules).
Every cleaning step either discards information or replaces it with a corrected value. Keep the raw file read-only and write the cleaned dataset to a new file. A preparation script that can be re-run end-to-end is the minimum professional standard.
8.5 Handling Missing Values
Missing values are almost always present in real data. How they should be handled depends on why they are missing.
MCAR (Missing Completely At Random): the probability of being missing does not depend on any variable. MAR (Missing At Random): missingness depends on observed variables but not on the missing value itself (income missing more often for unemployed respondents). MNAR (Missing Not At Random): missingness depends on the unobserved value itself (high earners decline to report income). MCAR tolerates the simplest methods; MNAR is the hardest and often needs domain modelling.
Mean, median, and regression imputation replace each missing value with a single best guess and then treat the completed data as if it were fully observed. Standard errors computed on the completed data are therefore optimistic. Multiple imputation (the mice package in R is the standard tool) generates several plausible completed datasets, runs the analysis on each, and pools the results, propagating the uncertainty from the missing values into the final inference.
8.6 Handling Outliers
An outlier is a value far from the bulk of the distribution. Outliers can be genuine (a very large corporate order, a ninety-year-old customer) or erroneous (a typo, a unit-conversion error, a test record that leaked into production). Preparation should identify outliers, diagnose them, and treat them only when justified.
Let Q1 and Q3 be the first and third quartiles and IQR = Q3 - Q1. Values below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR are flagged as potential outliers. The rule is a starting screen, not a verdict: a flagged value is examined, not deleted.
Every outlier treatment (removal, capping, transformation) is a decision that affects the result. Record the rule used, the count of affected rows, and the rationale. An outlier that was silently removed will eventually resurface as a contradiction between the analysis and reality.
8.7 Data Munging and Wrangling
Munging (or wrangling) is the restructuring step: joining, reshaping, pivoting, deduplicating, and generally rearranging the data into the form the analysis needs. Most analytical methods expect a “tidy” layout: one row per observation, one column per variable.
In long format, each row is an observation and a repeated variable (month) lives in its own column. In wide format, the repeated variable is spread across columns (one column per month). Both forms are useful; the choice depends on whether the analysis expects observations stacked (modelling) or aligned (reporting).
Production analytics in R uses the tidyverse (dplyr for filter, select, join, summarise; tidyr for pivot_longer and pivot_wider; stringr for text cleaning; lubridate for dates). The base-R code here is deliberate because it runs in a browser without package installation, but the same operations are shorter and more readable in tidyverse syntax.
8.8 Normalization
Normalization brings variables onto comparable scales so that no single variable dominates a distance-based method or an algorithm sensitive to scale (k-means, PCA, neural networks, regularised regression).
Min-max scaling: linearly rescales to [0, 1] using (x - min) / (max - min). Preserves the shape of the distribution; sensitive to extreme values. Z-score standardisation: centres to mean 0 and scales to standard deviation 1 using (x - mean) / sd. The default choice in statistical work. Robust scaling: uses the median and IQR instead of mean and SD. Preferred when outliers are present and not being treated separately.
8.9 Transformation
Transformation changes the shape of a variable’s distribution, typically to stabilise variance, reduce skew, or make a relationship more linear. Preparation transformations are distinct from the arithmetic inside a model: they are applied before modelling and are part of the dataset used for analysis.
Log: compresses right-skew; useful for revenue, income, count data (after adding a small constant to handle zeros). Square root: milder than log; often used for count data. Inverse (1/x): strongest compression; occasionally useful for rates. Box-Cox (Box and Cox 1964): a family parameterised by lambda that includes log (lambda = 0), square root (lambda = 0.5), and identity (lambda = 1). Requires strictly positive values. Yeo-Johnson (2000): a Box-Cox variant that allows zero and negative values.
A regression of log(revenue) on a predictor estimates a percentage effect; a regression of revenue estimates a rupee effect. A log transform is never neutral; the transformed model answers a different business question than the untransformed model, and the reporting should reflect that.
8.10 Encoding Categorical Variables
Most statistical and machine-learning methods require numeric inputs, so categorical variables must be encoded. The choice of encoding is a trade-off between dimensionality and information.
One-hot (dummy) encoding: one binary column per category; the default for nominal variables in regression and tree models. Label encoding: assigns integer codes (1, 2, 3…); appropriate only for ordinal variables or as a compact storage format. Ordinal encoding: assigns codes that respect the rank order (Bronze = 1, Silver = 2, Gold = 3, Platinum = 4). Target (mean) encoding: replaces each category with the mean of the outcome within that category; compact but leaks information from the target and requires care to prevent overfitting. Binary encoding: assigns each category a binary code, producing log2(k) columns; useful for high-cardinality variables.
8.11 Reproducibility in Preparation
Preparation decisions are first-order inputs to every subsequent analysis. Treating them as code, not as ad-hoc spreadsheet edits, is the difference between an analysis that can be defended and one that cannot.
Keep raw data read-only. Write cleaned outputs to separate files. Script every step. A preparation notebook or .R file should produce the analysis-ready dataset deterministically from the raw inputs. Version the code and, when feasible, the data. Git for code; DVC or LakeFS for datasets too large for Git. Record assumptions. A short “data dictionary and decisions” file noting which columns were dropped, how missings were filled, and how outliers were treated is the minimum audit trail for any serious analysis.
Interactive edits in Excel cannot be reproduced, reviewed, or version-controlled. Preparation done in a spreadsheet is effectively anonymous after two weeks: no one can reconstruct which cells were changed, by whom, or why. Preparation in R, Python, or SQL scripts retains the history automatically.
8.12 Summary
| Concept | Description |
|---|---|
| Quality Dimensions | |
| Accuracy | Values correctly represent reality |
| Completeness | Required fields are populated |
| Consistency | Same fact is represented the same way |
| Timeliness | Data is current for its intended use |
| Uniqueness | Each entity appears exactly once |
| Validity | Values conform to domain rules |
| Cleaning Targets | |
| Missing values | Detect, diagnose, impute (MCAR, MAR, MNAR) |
| Outliers | Flag with IQR or z-score; winsorise or remove |
| Duplicates | Identify and drop exact or near-duplicate rows |
| Wrangling Operations | |
| Reshape | Change between long and wide layouts |
| Join | Combine tables on a shared key |
| Pivot | Rearrange rows into columns or vice versa |
| Normalization and Transformation | |
| Min-max | Rescale to the [0, 1] range |
| Z-score | Centre to mean 0, standard deviation 1 |
| Log | Compress right-skew (revenue, counts) |
| Box-Cox | Family of power transforms, lambda-parameterised |
| Encoding Schemes | |
| One-hot | One binary column per nominal category |
| Ordinal | Integer codes that respect rank order |
| Target | Replace category with mean of the outcome |
Preparation is where analysis is won or lost. A clean, correctly shaped, well-encoded dataset usually makes the subsequent modelling straightforward; a hasty or undocumented preparation pipeline will eventually produce a result that no one can reproduce or defend.