9 Descriptive Data Analysis

9.1 What Descriptive Data Analysis Is

Descriptive data analysis summarises the essential features of a dataset so that a human can reason about it. Its purpose is to describe what the data shows, not to infer anything beyond the data at hand or to predict what will happen next. The output is a compact set of numbers, tables, and charts that together answer the question: what does this dataset look like?

A working definition

Descriptive analysis applies arithmetic, tabular, and visual summaries to a dataset in order to communicate its central features: where the values cluster, how they spread, what shape their distribution takes, and how they relate to other variables.

The ceiling of descriptive analysis

Descriptive statistics describe the data that was actually measured. They do not, by themselves, support claims about a wider population, a future period, or a causal mechanism. Such claims require the additional machinery of sampling theory, hypothesis testing, or causal inference.

9.2 Why Descriptive Analysis Comes First

A short descriptive pass is the cheapest, fastest, and most informative use of an analyst’s first hour with a new dataset. It delivers four things.

Four reasons to begin with description

Familiarity: orders of magnitude, units, and typical values are learned quickly. Error detection: implausible extremes, impossible dates, and negative revenues surface immediately. Method selection: the shape and scale of a variable determine whether a mean, a median, or a log-transformed mean is the right summary and whether a parametric or non-parametric test is justified later. Baseline: the descriptive snapshot becomes the reference against which any modelling or experimental result is sanity-checked.

9.3 Measures of Central Tendency

Central-tendency statistics summarise the “typical” value of a variable. Three classical measures plus two specialised means cover almost all business use cases.

Arithmetic mean

The sum of the values divided by the count. The default summary for interval and ratio data. Sensitive to extreme values, so the arithmetic mean of revenue in a long-tailed distribution can exceed the value that most customers actually produce.

Median

The middle value when the data is ordered. Robust to outliers. The median is the recommended summary for skewed distributions (revenue, wait times, income) and is the only legitimate central value for ordinal data.

Mode

The most frequent value. The only central measure defined for nominal data (most-common payment method, most-common city). A distribution can be unimodal, bimodal, or multimodal; a bimodal distribution usually signals that the sample mixes two underlying groups.

Geometric and harmonic means

The geometric mean is the n-th root of the product of n values; it is the correct average for multiplicative processes such as compound growth rates and financial returns. The harmonic mean is n divided by the sum of reciprocals; it is the correct average for rates measured over a fixed quantity, such as average speed over a fixed distance.

Try here

Do not average averages

The average of three regional means is not the overall mean unless every region has the same sample size. Always compute the mean on the pooled data, or take a weighted average using the region sizes as weights.

9.4 Measures of Dispersion

Dispersion describes how spread out the values are. Two datasets can share the same mean yet differ sharply in behaviour; dispersion is what tells them apart.

Six common measures

Range: maximum minus minimum; simple but dominated by extremes. Interquartile range (IQR): Q3 minus Q1; robust and the dispersion companion to the median. Variance: the mean of squared deviations from the mean; has squared units, which is inconvenient. Standard deviation (SD): square root of variance; same units as the data; the default dispersion for interval and ratio variables. Coefficient of variation (CV): SD divided by mean, reported as a percentage; a unit-free measure that allows comparison across variables on different scales. Median absolute deviation (MAD): the median of absolute deviations from the median; a robust dispersion paired with the median.

Try here

Report a pair

A central-tendency statistic without a dispersion statistic is almost never useful. “Mean monthly spend is ₹7,693” conveys less than “Mean is ₹7,693 with SD ₹10,520”, and “Median is ₹4,200 with IQR ₹3,900” is often the most honest pair on skewed data.

9.5 Measures of Shape

Shape statistics describe how the distribution departs from the bell-shaped normal.

Skewness

A measure of asymmetry. Positive (right) skew: a long right tail; the mean sits to the right of the median. Typical of revenue, claim sizes, session durations. Negative (left) skew: a long left tail; the mean sits to the left of the median. Less common in business data. Zero skew: symmetric; mean equals median.

Kurtosis

A measure of tail heaviness relative to a normal distribution. Mesokurtic (kurtosis = 3, excess = 0): normal-like tails. Leptokurtic (excess > 0): heavier tails, more extreme values than a normal would predict. Typical of financial returns. Platykurtic (excess < 0): lighter tails than normal, fewer extremes.

Try here

9.6 Measures of Position

Position statistics locate a value relative to the rest of the distribution.

Percentiles and quartiles

The p-th percentile is the value below which p percent of the data falls. Quartiles are the 25th, 50th, and 75th percentiles; deciles split the data into tenths. Percentiles are the standard way to report distributions of revenue, wait times, and response times (an SLA of “p95 under 200 ms” is a percentile claim).

Z-scores

A z-score expresses a value in units of standard deviations from the mean: z = (x - mean) / SD. It allows comparison across variables measured on different scales and is the basis for many outlier rules.

Try here

9.7 Frequency Distributions

For categorical and discrete data, the natural summary is a frequency distribution: how often each category occurs. Three derived quantities are reported alongside raw counts.

Frequency, relative frequency, cumulative frequency

Frequency is the count of observations in each category. Relative frequency is the count divided by the total, expressed as a proportion or percentage. Cumulative frequency is the running total from the first category onward (only meaningful for ordered categories).

Try here

9.8 Cross-Tabulation and Contingency Tables

Cross-tabulation shows the joint distribution of two (or more) categorical variables. It is the starting point for describing association before any formal test is run.

Try here

Choose row or column percentages, not both

A contingency table with three percentage layers (joint, row, column) is hard to read. Pick the orientation that matches the question: row percentages for “within each region, how is the plan mix?”; column percentages for “within each plan, how is the region mix?”.

9.9 Grouped Descriptives

Most real questions are conditional: mean spend by region, median response time by hour of day, CSAT by tier. The split-apply-combine pattern computes the same statistic on each group and returns a compact table.

Try here

9.10 Descriptive Visualisation

Visualisation is an inseparable part of descriptive analysis. A correctly chosen chart communicates shape and scale at a glance; a poorly chosen one obscures both.

Chart-to-measure alignment

Histogram: shape and modality of a continuous variable. Density plot: a smoothed histogram, useful for comparing two distributions. Boxplot: median, IQR, and outliers at a glance; excellent for grouped comparisons. Bar chart: frequencies of a categorical variable. Pareto chart: bars ordered by frequency with a cumulative line, for prioritising categories. Scatterplot: pairwise relationship between two continuous variables.

Try here

Pie charts rarely help

A pie chart encodes frequency through angle, which the eye is poor at comparing. A bar chart conveys the same information more accurately and in less space. Reserve pies for the rare cases where “share of a whole” is the only message and there are no more than three or four categories.

9.11 Reporting Descriptive Statistics

The conventions for reporting descriptive statistics are tight and widely followed. Adhering to them makes the work easier to read, review, and reproduce.

The “Table 1” convention

Research articles and board-grade reports typically open with a table that summarises every variable: for continuous variables, mean (SD) or median (IQR); for categorical variables, count (percentage). The sample size (n) is always stated. Missing counts are reported rather than hidden.

Significant figures and units

Two to three significant figures are usually enough. Reporting “mean CSAT 4.127348” suggests a false precision and is harder to read than “mean CSAT 4.13 (SD 0.72)”. Always include units: ₹, percentage points, minutes, days.

9.12 Common Mistakes

Five frequent errors

Reporting a mean without a dispersion statistic. A central value alone cannot communicate the distribution. 2. Using the arithmetic mean on heavily skewed data. Report the median, or the geometric mean for multiplicative data. 3. Treating a Likert score of 3.4 as “a number”. Report the median and the distribution; if a mean is used, accompany it with a disclaimer. 4. Averaging percentages without weighting by base. A 50 percent response rate from a sample of 10 and a 20 percent response rate from a sample of 1,000 do not average to 35 percent. 5. Reporting statistics without the sample size. Every summary is conditional on n; without it, the reader cannot judge the reliability of the estimate.

Summary

Concept	Description
Foundations
What Descriptive Analysis Is	Working definition: numerical and visual summaries of one variable
Why Begin Here	Description grounds every later inference and model
Central Tendency
Mean	Arithmetic average for symmetric, well-behaved data
Median	Middle value, robust to outliers and skewed distributions
Mode	Most frequent value; works for any measurement level
Geometric and Harmonic Means	For ratios, growth rates, and rate-of-rates
Dispersion
Six Dispersion Measures	Range, IQR, variance, SD, MAD, and coefficient of variation
Reporting	Always report a centre alongside a spread