12  Univariate Data Analysis with R

12.1 Univariate Analysis in Context

Univariate analysis examines one variable at a time. It asks focused questions: what is the centre and spread of this variable, what is its shape, and is its parameter consistent with a pre-specified value? Chapter 11 introduced the confirmatory framework; this chapter applies that framework to the univariate case using R.

NoteWhat belongs in a univariate study

Centre, spread, shape, and a single test against a fixed benchmark. Anything that compares two or more variables or groups is bivariate or multivariate territory and belongs to chapters 13 and 14.

TipMatch the test to the measurement level

Numeric variables call for t, Wilcoxon, or sign-based tests. Categorical variables call for proportion or chi-square goodness-of-fit tests. The measurement level is the first fork in the decision tree.

12.2 Inspecting a Single Variable in R

Before any test is chosen, the variable itself is inspected: its class, length, missingness, and basic summary. In R, str, summary, and table cover the majority of cases.

NoteThree questions this block answers

Is the variable the class you expected, does the summary look plausible, and what is the category balance for a factor? A surprise at this stage usually means a data-quality issue, not a modelling issue.

12.3 Univariate Visualisation

A single variable is visualised with a histogram or density for continuous data and a bar chart for categorical data. A boxplot adds a compact view of spread and tails.

TipThree charts, three questions

Histogram answers “what is the shape,” boxplot answers “where are the tails and outliers,” bar chart answers “how are categories distributed.” Together they describe a single variable in under a minute.

12.4 One-Sample t-Test

The one-sample t-test asks whether the mean of a variable is consistent with a fixed benchmark. It assumes the observations are independent and the sampling distribution of the mean is approximately normal, which holds for moderate samples even when the variable itself is not normal.

NoteReading the output

The t.test output reports the t-statistic, degrees of freedom, p-value, the 95 percent confidence interval for the mean, and the sample estimate. Interpretation follows Chapter 11: the p-value is a tail probability under H0, not a statement about H0 itself.

12.5 Checking Normality

The t-test is robust to mild departures from normality at moderate sample sizes, but a pre-check still matters for small samples. Shapiro and Wilk’s W (Shapiro and Wilk 1965) is the standard numerical check; the Q-Q plot is the standard visual one.

WarningDo not over-interpret a normality test

Shapiro-Wilk has high power in large samples, meaning it flags trivial departures that do not threaten the t-test. For n above about 100, treat the Q-Q plot as the primary diagnostic.

12.6 One-Sample Wilcoxon Signed-Rank Test

When the distribution is skewed or the sample is small, the one-sample Wilcoxon signed-rank test (Wilcoxon 1945) replaces the t-test. It asks whether the median is consistent with a benchmark, using ranks rather than raw values.

NoteWhen to prefer Wilcoxon over t

Choose Wilcoxon when the variable is ordinal, when the sample is small and visibly skewed, or when a few extreme values dominate the mean. For symmetric heavy-tailed data, t is still usually fine.

12.7 Sign Test

The sign test is the simplest non-parametric univariate test. It asks whether the proportion of observations above a hypothesised median equals one half, using only the signs of the deviations. It has low power but no distributional assumption beyond independence.

TipSign test in one line via binomial

R has no sign.test in base, but the sign test is a binom.test of successes above the hypothesised median against p = 0.5 after dropping ties. This keeps the dependency surface small.

12.8 Chi-Square Goodness-of-Fit Test

For a categorical variable, the chi-square goodness-of-fit test (Pearson 1900) compares observed category counts to a vector of expected proportions.

WarningExpected counts rule of thumb

Chi-square approximates well when every expected count is at least five. When that fails, collapse adjacent categories or switch to an exact multinomial test.

12.9 Confidence Intervals for Univariate Summaries

A point estimate gains interpretive weight when paired with an interval. For a mean, t.test returns a t-based CI directly. For non-standard summaries such as the median, a bootstrap CI is a general fallback.

NoteCIs carry more information than p-values alone

Two different studies can both reject H0 with the same p-value yet imply very different effect magnitudes. The CI shows the magnitude directly.

12.10 Effect Size for Univariate Tests

Effect sizes translate a test result into a magnitude. For a one-sample mean, Cohen’s d is the distance from the benchmark in units of the standard deviation. For a Wilcoxon test, a common effect size is r = Z divided by the square root of n.

TipReport effect size next to every significant p-value

A p-value answers “is the effect distinguishable from noise;” the effect size answers “is it worth acting on.” Business decisions need both.

12.11 Choosing a Univariate Test

The choice of test is driven by measurement level, sample size, and the assumption profile of the variable. The diagram below summarises the usual path.

flowchart TD
    A[Single variable] --> B{Measurement level}
    B -->|Numeric| C{Shape and n}
    B -->|Categorical| E[chisq.test against expected]
    C -->|Approximately normal or large n| F[One-sample t-test]
    C -->|Skewed or small n| G[Wilcoxon signed-rank]
    C -->|Only signs of deviation trusted| H[Sign test via binom.test]

NoteThe diagram is a starting point, not a rule

Always pair the chosen test with a diagnostic check: a Q-Q plot for t, a histogram for Wilcoxon, an expected-counts check for chi-square.

12.12 Reporting Univariate Findings

A univariate report re-uses the six-section skeleton introduced for CDA in Chapter 11 and fills it in for a single variable.

TipSix-section univariate report
  1. Question and benchmark, (2) variable and sample, (3) diagnostic view and assumption check, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision with caveats. Keeping the skeleton stable across chapters 11 to 14 lets readers compare studies quickly.

12.13 Summary

Summary of univariate tools introduced in this chapter
Concept Description
Setting Up the Variable
str / summary / table Class, summary, and category counts for a single variable
Univariate visualisation Histogram, boxplot, density, bar chart for shape and balance
Mini-dataset setup Self-contained simulated sample that runs without external data
Parametric Univariate Tests
One-sample t-test Tests whether the mean equals a benchmark under approximate normality
Shapiro-Wilk normality Numerical check for normality; power grows fast with n
Q-Q plot Visual diagnostic for departure from normality
Non-Parametric Univariate Tests
Wilcoxon signed-rank Non-parametric one-sample test on the median
Sign test via binom.test Minimal non-parametric test on the sign of deviations
When to switch to non-parametric Skew, small n, ordinal scale, or extreme values dominate the mean
Categorical Univariate
Chi-square goodness-of-fit Compares observed category counts to expected proportions
Precision, Effect, Reporting
t-based CI Interval for a mean returned directly by t.test
Bootstrap CI Resampling interval for any univariate summary (median, trimmed mean, etc.)
Cohen's d one-sample Mean minus benchmark, divided by sample standard deviation
Wilcoxon r Z divided by square root of n; effect size for signed-rank test
Six-section univariate report Question, variable, diagnostic, test, effect, decision