12 Univariate Data Analysis with R

12.1 Univariate Analysis in Context

Univariate analysis examines one variable at a time. It asks focused questions: what is the centre and spread of this variable, what is its shape, and is its parameter consistent with a pre-specified value? Chapter 11 introduced the confirmatory framework; this chapter applies that framework to the univariate case using R.

What belongs in a univariate study

Centre, spread, shape, and a single test against a fixed benchmark. Anything that compares two or more variables or groups is bivariate or multivariate territory and belongs to chapters 13 and 14.

Match the test to the measurement level

Numeric variables call for t, Wilcoxon, or sign-based tests. Categorical variables call for proportion or chi-square goodness-of-fit tests. The measurement level is the first fork in the decision tree.

12.2 Inspecting a Single Variable in R

Before any test is chosen, the variable itself is inspected: its class, length, missingness, and basic summary. In R, str, summary, and table cover the majority of cases.

Try here

Three questions this block answers

Is the variable the class you expected, does the summary look plausible, and what is the category balance for a factor? A surprise at this stage usually means a data-quality issue, not a modelling issue.

12.3 Univariate Visualisation

A single variable is visualised with a histogram or density for continuous data and a bar chart for categorical data. A boxplot adds a compact view of spread and tails.

Try here

Three charts, three questions

Histogram answers “what is the shape,” boxplot answers “where are the tails and outliers,” bar chart answers “how are categories distributed.” Together they describe a single variable in under a minute.

12.4 One-Sample t-Test

The one-sample t-test asks whether the mean of a variable is consistent with a fixed benchmark. It assumes the observations are independent and the sampling distribution of the mean is approximately normal, which holds for moderate samples even when the variable itself is not normal.

Try here

Reading the output

The t.test output reports the t-statistic, degrees of freedom, p-value, the 95 percent confidence interval for the mean, and the sample estimate. Interpretation follows Chapter 11: the p-value is a tail probability under H0, not a statement about H0 itself.

12.5 Checking Normality

The t-test is robust to mild departures from normality at moderate sample sizes, but a pre-check still matters for small samples. Shapiro and Wilk’s W (Shapiro and Wilk 1965) is the standard numerical check; the Q-Q plot is the standard visual one.

Try here

Do not over-interpret a normality test

Shapiro-Wilk has high power in large samples, meaning it flags trivial departures that do not threaten the t-test. For n above about 100, treat the Q-Q plot as the primary diagnostic.

12.6 One-Sample Wilcoxon Signed-Rank Test

When the distribution is skewed or the sample is small, the one-sample Wilcoxon signed-rank test (Wilcoxon 1945) replaces the t-test. It asks whether the median is consistent with a benchmark, using ranks rather than raw values.

Try here

When to prefer Wilcoxon over t

Choose Wilcoxon when the variable is ordinal, when the sample is small and visibly skewed, or when a few extreme values dominate the mean. For symmetric heavy-tailed data, t is still usually fine.

12.7 Sign Test

The sign test is the simplest non-parametric univariate test. It asks whether the proportion of observations above a hypothesised median equals one half, using only the signs of the deviations. It has low power but no distributional assumption beyond independence.

Try here

Sign test in one line via binomial

R has no sign.test in base, but the sign test is a binom.test of successes above the hypothesised median against p = 0.5 after dropping ties. This keeps the dependency surface small.

12.8 Chi-Square Goodness-of-Fit Test

For a categorical variable, the chi-square goodness-of-fit test (Pearson 1900) compares observed category counts to a vector of expected proportions.

Try here

Expected counts rule of thumb

Chi-square approximates well when every expected count is at least five. When that fails, collapse adjacent categories or switch to an exact multinomial test.

12.9 Confidence Intervals for Univariate Summaries

A point estimate gains interpretive weight when paired with an interval. For a mean, t.test returns a t-based CI directly. For non-standard summaries such as the median, a bootstrap CI is a general fallback.

Try here

CIs carry more information than p-values alone

Two different studies can both reject H0 with the same p-value yet imply very different effect magnitudes. The CI shows the magnitude directly.

12.10 Effect Size for Univariate Tests

Effect sizes translate a test result into a magnitude. For a one-sample mean, Cohen’s d is the distance from the benchmark in units of the standard deviation. For a Wilcoxon test, a common effect size is r = Z divided by the square root of n.

Try here

Report effect size next to every significant p-value

A p-value answers “is the effect distinguishable from noise;” the effect size answers “is it worth acting on.” Business decisions need both.

12.11 Choosing a Univariate Test

The choice of test is driven by measurement level, sample size, and the assumption profile of the variable. The diagram below summarises the usual path.

flowchart TD
    A[Single variable] --> B{Measurement level}
    B -->|Numeric| C{Shape and n}
    B -->|Categorical| E[chisq.test against expected]
    C -->|Approximately normal or large n| F[One-sample t-test]
    C -->|Skewed or small n| G[Wilcoxon signed-rank]
    C -->|Only signs of deviation trusted| H[Sign test via binom.test]
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

The diagram is a starting point, not a rule

Always pair the chosen test with a diagnostic check: a Q-Q plot for t, a histogram for Wilcoxon, an expected-counts check for chi-square.

12.12 Reporting Univariate Findings

A univariate report re-uses the six-section skeleton introduced for CDA in Chapter 11 and fills it in for a single variable.

Six-section univariate report

Question and benchmark, (2) variable and sample, (3) diagnostic view and assumption check, (4) test statistic and p-value, (5) effect size and confidence interval, (6) business decision with caveats. Keeping the skeleton stable across chapters 11 to 14 lets readers compare studies quickly.

Summary

Concept	Description
Inspect and Visualise
Inspecting in R	str(), summary(), table() answer most one-variable questions
Visualisation	Histogram, density, and boxplot for distribution shape
Inferential Tools
One-Sample t-Test	Tests whether the mean differs from a hypothesised value
Normality Check	Shapiro-Wilk and visual checks before parametric tests
Wilcoxon Signed-Rank	Robust median-based alternative to the one-sample t-test
Sign Test	Simple median-difference test based on signs of deviations
Reporting
Match Test to Level	Pick the test by measurement level and shape, not by habit
Reading Outputs	Five things to report from any test output