12 Univariate Data Analysis with R
12.1 Univariate Analysis in Context
Univariate analysis examines one variable at a time. It asks focused questions: what is the centre and spread of this variable, what is its shape, and is its parameter consistent with a pre-specified value? Chapter 11 introduced the confirmatory framework; this chapter applies that framework to the univariate case using R.
Centre, spread, shape, and a single test against a fixed benchmark. Anything that compares two or more variables or groups is bivariate or multivariate territory and belongs to chapters 13 and 14.
Numeric variables call for t, Wilcoxon, or sign-based tests. Categorical variables call for proportion or chi-square goodness-of-fit tests. The measurement level is the first fork in the decision tree.
12.2 Inspecting a Single Variable in R
Before any test is chosen, the variable itself is inspected: its class, length, missingness, and basic summary. In R, str, summary, and table cover the majority of cases.
Is the variable the class you expected, does the summary look plausible, and what is the category balance for a factor? A surprise at this stage usually means a data-quality issue, not a modelling issue.
12.3 Univariate Visualisation
A single variable is visualised with a histogram or density for continuous data and a bar chart for categorical data. A boxplot adds a compact view of spread and tails.
Histogram answers “what is the shape,” boxplot answers “where are the tails and outliers,” bar chart answers “how are categories distributed.” Together they describe a single variable in under a minute.
12.4 One-Sample t-Test
The one-sample t-test asks whether the mean of a variable is consistent with a fixed benchmark. It assumes the observations are independent and the sampling distribution of the mean is approximately normal, which holds for moderate samples even when the variable itself is not normal.
The t.test output reports the t-statistic, degrees of freedom, p-value, the 95 percent confidence interval for the mean, and the sample estimate. Interpretation follows Chapter 11: the p-value is a tail probability under H0, not a statement about H0 itself.
12.5 Checking Normality
The t-test is robust to mild departures from normality at moderate sample sizes, but a pre-check still matters for small samples. Shapiro and Wilk’s W (Shapiro and Wilk 1965) is the standard numerical check; the Q-Q plot is the standard visual one.
Shapiro-Wilk has high power in large samples, meaning it flags trivial departures that do not threaten the t-test. For n above about 100, treat the Q-Q plot as the primary diagnostic.
12.6 One-Sample Wilcoxon Signed-Rank Test
When the distribution is skewed or the sample is small, the one-sample Wilcoxon signed-rank test (Wilcoxon 1945) replaces the t-test. It asks whether the median is consistent with a benchmark, using ranks rather than raw values.
Choose Wilcoxon when the variable is ordinal, when the sample is small and visibly skewed, or when a few extreme values dominate the mean. For symmetric heavy-tailed data, t is still usually fine.
12.7 Sign Test
The sign test is the simplest non-parametric univariate test. It asks whether the proportion of observations above a hypothesised median equals one half, using only the signs of the deviations. It has low power but no distributional assumption beyond independence.
R has no sign.test in base, but the sign test is a binom.test of successes above the hypothesised median against p = 0.5 after dropping ties. This keeps the dependency surface small.
12.8 Chi-Square Goodness-of-Fit Test
For a categorical variable, the chi-square goodness-of-fit test (Pearson 1900) compares observed category counts to a vector of expected proportions.
Chi-square approximates well when every expected count is at least five. When that fails, collapse adjacent categories or switch to an exact multinomial test.
12.9 Confidence Intervals for Univariate Summaries
A point estimate gains interpretive weight when paired with an interval. For a mean, t.test returns a t-based CI directly. For non-standard summaries such as the median, a bootstrap CI is a general fallback.
Two different studies can both reject H0 with the same p-value yet imply very different effect magnitudes. The CI shows the magnitude directly.
12.10 Effect Size for Univariate Tests
Effect sizes translate a test result into a magnitude. For a one-sample mean, Cohen’s d is the distance from the benchmark in units of the standard deviation. For a Wilcoxon test, a common effect size is r = Z divided by the square root of n.
A p-value answers “is the effect distinguishable from noise;” the effect size answers “is it worth acting on.” Business decisions need both.
12.11 Choosing a Univariate Test
The choice of test is driven by measurement level, sample size, and the assumption profile of the variable. The diagram below summarises the usual path.
flowchart TD
A[Single variable] --> B{Measurement level}
B -->|Numeric| C{Shape and n}
B -->|Categorical| E[chisq.test against expected]
C -->|Approximately normal or large n| F[One-sample t-test]
C -->|Skewed or small n| G[Wilcoxon signed-rank]
C -->|Only signs of deviation trusted| H[Sign test via binom.test]
Always pair the chosen test with a diagnostic check: a Q-Q plot for t, a histogram for Wilcoxon, an expected-counts check for chi-square.
12.12 Reporting Univariate Findings
A univariate report re-uses the six-section skeleton introduced for CDA in Chapter 11 and fills it in for a single variable.
- Question and benchmark, (2) variable and sample, (3) diagnostic view and assumption check, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision with caveats. Keeping the skeleton stable across chapters 11 to 14 lets readers compare studies quickly.
12.13 Summary
| Concept | Description |
|---|---|
| Setting Up the Variable | |
| str / summary / table | Class, summary, and category counts for a single variable |
| Univariate visualisation | Histogram, boxplot, density, bar chart for shape and balance |
| Mini-dataset setup | Self-contained simulated sample that runs without external data |
| Parametric Univariate Tests | |
| One-sample t-test | Tests whether the mean equals a benchmark under approximate normality |
| Shapiro-Wilk normality | Numerical check for normality; power grows fast with n |
| Q-Q plot | Visual diagnostic for departure from normality |
| Non-Parametric Univariate Tests | |
| Wilcoxon signed-rank | Non-parametric one-sample test on the median |
| Sign test via binom.test | Minimal non-parametric test on the sign of deviations |
| When to switch to non-parametric | Skew, small n, ordinal scale, or extreme values dominate the mean |
| Categorical Univariate | |
| Chi-square goodness-of-fit | Compares observed category counts to expected proportions |
| Precision, Effect, Reporting | |
| t-based CI | Interval for a mean returned directly by t.test |
| Bootstrap CI | Resampling interval for any univariate summary (median, trimmed mean, etc.) |
| Cohen's d one-sample | Mean minus benchmark, divided by sample standard deviation |
| Wilcoxon r | Z divided by square root of n; effect size for signed-rank test |
| Six-section univariate report | Question, variable, diagnostic, test, effect, decision |