9 Descriptive Data Analysis
9.1 What Descriptive Data Analysis Is
Descriptive data analysis summarises the essential features of a dataset so that a human can reason about it. Its purpose is to describe what the data shows, not to infer anything beyond the data at hand or to predict what will happen next. The output is a compact set of numbers, tables, and charts that together answer the question: what does this dataset look like?
Descriptive analysis applies arithmetic, tabular, and visual summaries to a dataset in order to communicate its central features: where the values cluster, how they spread, what shape their distribution takes, and how they relate to other variables.
Descriptive statistics describe the data that was actually measured. They do not, by themselves, support claims about a wider population, a future period, or a causal mechanism. Such claims require the additional machinery of sampling theory, hypothesis testing, or causal inference.
9.2 Why Descriptive Analysis Comes First
A short descriptive pass is the cheapest, fastest, and most informative use of an analyst’s first hour with a new dataset. It delivers four things.
Familiarity: orders of magnitude, units, and typical values are learned quickly. Error detection: implausible extremes, impossible dates, and negative revenues surface immediately. Method selection: the shape and scale of a variable determine whether a mean, a median, or a log-transformed mean is the right summary and whether a parametric or non-parametric test is justified later. Baseline: the descriptive snapshot becomes the reference against which any modelling or experimental result is sanity-checked.
9.3 Measures of Central Tendency
Central-tendency statistics summarise the “typical” value of a variable. Three classical measures plus two specialised means cover almost all business use cases.
The sum of the values divided by the count. The default summary for interval and ratio data. Sensitive to extreme values, so the arithmetic mean of revenue in a long-tailed distribution can exceed the value that most customers actually produce.
The middle value when the data is ordered. Robust to outliers. The median is the recommended summary for skewed distributions (revenue, wait times, income) and is the only legitimate central value for ordinal data.
The most frequent value. The only central measure defined for nominal data (most-common payment method, most-common city). A distribution can be unimodal, bimodal, or multimodal; a bimodal distribution usually signals that the sample mixes two underlying groups.
The geometric mean is the n-th root of the product of n values; it is the correct average for multiplicative processes such as compound growth rates and financial returns. The harmonic mean is n divided by the sum of reciprocals; it is the correct average for rates measured over a fixed quantity, such as average speed over a fixed distance.
The average of three regional means is not the overall mean unless every region has the same sample size. Always compute the mean on the pooled data, or take a weighted average using the region sizes as weights.
9.4 Measures of Dispersion
Dispersion describes how spread out the values are. Two datasets can share the same mean yet differ sharply in behaviour; dispersion is what tells them apart.
Range: maximum minus minimum; simple but dominated by extremes. Interquartile range (IQR): Q3 minus Q1; robust and the dispersion companion to the median. Variance: the mean of squared deviations from the mean; has squared units, which is inconvenient. Standard deviation (SD): square root of variance; same units as the data; the default dispersion for interval and ratio variables. Coefficient of variation (CV): SD divided by mean, reported as a percentage; a unit-free measure that allows comparison across variables on different scales. Median absolute deviation (MAD): the median of absolute deviations from the median; a robust dispersion paired with the median.
A central-tendency statistic without a dispersion statistic is almost never useful. “Mean monthly spend is ₹7,693” conveys less than “Mean is ₹7,693 with SD ₹10,520”, and “Median is ₹4,200 with IQR ₹3,900” is often the most honest pair on skewed data.
9.5 Measures of Shape
Shape statistics describe how the distribution departs from the bell-shaped normal.
A measure of asymmetry. Positive (right) skew: a long right tail; the mean sits to the right of the median. Typical of revenue, claim sizes, session durations. Negative (left) skew: a long left tail; the mean sits to the left of the median. Less common in business data. Zero skew: symmetric; mean equals median.
A measure of tail heaviness relative to a normal distribution. Mesokurtic (kurtosis = 3, excess = 0): normal-like tails. Leptokurtic (excess > 0): heavier tails, more extreme values than a normal would predict. Typical of financial returns. Platykurtic (excess < 0): lighter tails than normal, fewer extremes.
9.6 Measures of Position
Position statistics locate a value relative to the rest of the distribution.
The p-th percentile is the value below which p percent of the data falls. Quartiles are the 25th, 50th, and 75th percentiles; deciles split the data into tenths. Percentiles are the standard way to report distributions of revenue, wait times, and response times (an SLA of “p95 under 200 ms” is a percentile claim).
A z-score expresses a value in units of standard deviations from the mean: z = (x - mean) / SD. It allows comparison across variables measured on different scales and is the basis for many outlier rules.
9.7 Frequency Distributions
For categorical and discrete data, the natural summary is a frequency distribution: how often each category occurs. Three derived quantities are reported alongside raw counts.
Frequency is the count of observations in each category. Relative frequency is the count divided by the total, expressed as a proportion or percentage. Cumulative frequency is the running total from the first category onward (only meaningful for ordered categories).
9.8 Cross-Tabulation and Contingency Tables
Cross-tabulation shows the joint distribution of two (or more) categorical variables. It is the starting point for describing association before any formal test is run.
A contingency table with three percentage layers (joint, row, column) is hard to read. Pick the orientation that matches the question: row percentages for “within each region, how is the plan mix?”; column percentages for “within each plan, how is the region mix?”.
9.9 Grouped Descriptives
Most real questions are conditional: mean spend by region, median response time by hour of day, CSAT by tier. The split-apply-combine pattern computes the same statistic on each group and returns a compact table.
9.10 Descriptive Visualisation
Visualisation is an inseparable part of descriptive analysis. A correctly chosen chart communicates shape and scale at a glance; a poorly chosen one obscures both.
Histogram: shape and modality of a continuous variable. Density plot: a smoothed histogram, useful for comparing two distributions. Boxplot: median, IQR, and outliers at a glance; excellent for grouped comparisons. Bar chart: frequencies of a categorical variable. Pareto chart: bars ordered by frequency with a cumulative line, for prioritising categories. Scatterplot: pairwise relationship between two continuous variables.
A pie chart encodes frequency through angle, which the eye is poor at comparing. A bar chart conveys the same information more accurately and in less space. Reserve pies for the rare cases where “share of a whole” is the only message and there are no more than three or four categories.
9.11 Reporting Descriptive Statistics
The conventions for reporting descriptive statistics are tight and widely followed. Adhering to them makes the work easier to read, review, and reproduce.
Research articles and board-grade reports typically open with a table that summarises every variable: for continuous variables, mean (SD) or median (IQR); for categorical variables, count (percentage). The sample size (n) is always stated. Missing counts are reported rather than hidden.
Two to three significant figures are usually enough. Reporting “mean CSAT 4.127348” suggests a false precision and is harder to read than “mean CSAT 4.13 (SD 0.72)”. Always include units: ₹, percentage points, minutes, days.
9.12 Common Mistakes
- Reporting a mean without a dispersion statistic. A central value alone cannot communicate the distribution. 2. Using the arithmetic mean on heavily skewed data. Report the median, or the geometric mean for multiplicative data. 3. Treating a Likert score of 3.4 as “a number”. Report the median and the distribution; if a mean is used, accompany it with a disclaimer. 4. Averaging percentages without weighting by base. A 50 percent response rate from a sample of 10 and a 20 percent response rate from a sample of 1,000 do not average to 35 percent. 5. Reporting statistics without the sample size. Every summary is conditional on n; without it, the reader cannot judge the reliability of the estimate.
9.13 Summary
| Concept | Description |
|---|---|
| Central Tendency | |
| Mean | Arithmetic average; the default for interval and ratio data |
| Median | Middle value; robust to outliers; use on skewed or ordinal data |
| Mode | Most frequent value; the only central measure for nominal data |
| Dispersion | |
| Range | Maximum minus minimum; simple but dominated by extremes |
| IQR | Q3 minus Q1; robust; paired with the median |
| Standard deviation | Square root of variance; default dispersion with the mean |
| Coefficient of variation | SD divided by mean; unit-free; compare across scales |
| Shape | |
| Skewness | Asymmetry of the distribution; positive, negative, or zero |
| Kurtosis | Tail heaviness relative to a normal distribution |
| Position and Frequency | |
| Percentile | Value below which a stated share of data falls |
| Z-score | Value expressed in standard-deviation units from the mean |
| Frequency table | Counts and proportions for a categorical variable |
| Contingency table | Joint distribution of two categorical variables |
| Reporting Conventions | |
| Sample size | Always state n; every summary is conditional on it |
| Significant figures | Two to three significant figures is usually enough |
| Table 1 | Open the report with a descriptive summary of every variable |
A careful descriptive summary rarely settles a question on its own, but a rushed or careless one almost always misleads the analysis that follows. Description is the foundation on which every later inference stands.