Chapter 19 – Non-Parametric Statistics

Learning goals

By the end of this chapter you will be able to:

Explain when non-parametric tests are preferred over traditional (parametric) procedures such as the t-test or ANOVA.
Describe the logic of the chi-square family of tests.
Distinguish between chi-square tests of goodness of fit and independence.
Recognize rank-based alternatives to t-tests and one-way ANOVA (Mann–Whitney U, Wilcoxon signed-rank, Kruskal–Wallis, Friedman).
Use PyStatsV1 and pingouin to analyze survey-style data with chi-square tests on categorical variables.

19.1 When parametric assumptions break down

In earlier chapters, we focused on parametric procedures:

t-tests (Chapters 9–11)
One-way and factorial ANOVA (Chapters 12–14)
Regression and ANCOVA (Chapters 16–18)

These procedures make several assumptions about the data:

Quantitative scale – variables are interval or ratio, not purely nominal.
Normality – scores within each group are (approximately) normally distributed.
Homogeneity of variance – population variances are equal across groups.
Linearity – for correlation and regression, the relationship between variables is approximately linear.

When these assumptions are badly violated, parametric tests can give misleading p-values and confidence intervals. In those cases, we often turn to non-parametric methods.

Non-parametric tests typically:

Work with ranks or counts rather than raw numeric values.
Make fewer distributional assumptions.
Are often slightly less powerful when parametric assumptions are met, but more robust when those assumptions fail.

In psychological research, non-parametric tests are especially useful when:

The outcome is ordinal (e.g., Likert scales: “Strongly disagree” to “Strongly agree”).
The data are severely skewed or have heavy outliers that cannot be reasonably transformed.
The variable is categorical (e.g., therapy preference, diagnostic category, treatment response yes/no).

19.2 Chi-square tests for categorical data

The most common non-parametric tests in introductory psychology involve frequency counts in categories. The basic question is:

Do the observed counts differ from what we would expect by chance?

The chi-square family addresses this question in two main situations.

19.2.1 Goodness of fit

A chi-square goodness-of-fit test compares observed category counts to a theoretical or expected distribution. For example:

A survey asks which coping strategy students use most often: Exercise, Therapy, Mindfulness, or Social support.
If there was no preference, we would expect roughly equal counts in each category (25% each).
The chi-square goodness-of-fit test asks whether the observed distribution differs significantly from this uniform expectation.

Statistically, we compute:

χ² = Σ (Observed - Expected)² / Expected

with degrees of freedom \(df = k - 1\), where k is the number of categories.

If χ² is large relative to its degrees of freedom, the p-value will be small and we reject the null hypothesis that the observed frequencies match the expected distribution.

19.2.2 Test of independence

A chi-square test of independence asks whether two categorical variables are related. For example:

Variable 1: Type of therapy received (Control, CBT, Mindfulness).
Variable 2: Treatment outcome (Improved vs. Did not improve).

We arrange the counts in a contingency table and again compute a chi-square statistic. Here, the null hypothesis states that the variables are independent – knowing a person’s therapy type tells you nothing about their likelihood of improvement.

We also report an effect size such as Cramér’s V, which is based on the chi-square value but scaled to lie between 0 and 1:

~0.10: small association
~0.30: medium
~0.50 or higher: large

19.3 Rank-based tests

Not all non-parametric tests are based on counts. Many are based on ranks of the outcome variable. Instead of analyzing raw scores, we:

Combine all scores across groups.
Rank them from lowest to highest.
Analyze the ranks using an appropriate test statistic.

Some common rank-based tests and their parametric counterparts:

Mann–Whitney U: alternative to an independent-samples t-test.
Wilcoxon signed-rank: alternative to a paired-samples t-test.
Kruskal–Wallis H: alternative to a one-way ANOVA with independent groups.
Friedman test: alternative to a repeated-measures one-way ANOVA.

These tests are especially helpful when:

The outcome is ordinal (e.g., 1–7 rating scales).
The data are heavily skewed or contain extreme outliers.
Sample sizes are small, making normality assumptions doubtful.

19.4 When to choose non-parametric methods

There is no single “magic rule,” but some practical guidelines:

Use chi-square tests when both your predictor and outcome are categorical (nominal) and you are working with counts, not percentages.
Use rank-based tests when: * The outcome variable is ordinal, or * You have strong violations of normality or homogeneity that cannot be

fixed by transformations, and
- You are more concerned about validity than about squeezing out every bit of statistical power.

When in doubt, you can often:

Run the parametric test (e.g., t-test or ANOVA).
Run the non-parametric alternative.
Compare conclusions – if they agree, your result is probably robust.

19.5 PyStatsV1 Lab: Chi-square analysis of survey data

In this chapter’s lab you will use PyStatsV1 to analyze simulated survey data using chi-square tests. The code lives in:

scripts.psych_ch19_nonparametrics
tests.test_psych_ch19_nonparametrics

Running the lab

From the project root (with your virtual environment activated), run:

make psych-ch19
make test-psych-ch19

The first command will:

Simulate a coping strategies survey with four categories (e.g., Exercise, Therapy, Mindfulness, Social support).
Run a chi-square goodness-of-fit test to check whether the observed distribution differs from a uniform (no-preference) null.
Save the raw data and a summary table to:
- data/synthetic/psych_ch19_survey_gof.csv
- outputs/track_b/ch19_gof_table.csv
Generate a bar chart comparing observed versus expected counts:
- outputs/track_b/ch19_gof_barplot.png

The second dataset in the script simulates a therapy × improvement contingency table:

Students are randomly assigned to Control, CBT, or Mindfulness conditions.
Each person is classified as Improved or No change.
The script uses:
- scipy.stats.chi2_contingency() for a traditional chi-square test.
- pingouin.chi2_independence() to obtain effect sizes (e.g., Cramér’s V) and power estimates.
The script saves:
- data/synthetic/psych_ch19_survey_independence.csv – individual-level data.
- outputs/track_b/ch19_independence_table.csv – full chi-square summary.
- outputs/track_b/ch19_stacked_bar.png – a stacked bar plot showing the proportion improved within each therapy type.

Interpreting the output

After running make psych-ch19, inspect the console output and figures:

For the goodness-of-fit example, ask:
- Does the chi-square test detect that some coping strategies are preferred over others?
- Which categories contribute most to the chi-square statistic (largest observed – expected differences)?
For the independence example, ask:
- Is there evidence that treatment type and improvement are associated?
- How large is the association (Cramér’s V)?
- Do the stacked bar plots reveal a pattern that matches the numerical results?

Connection to earlier chapters

This chapter ties together several themes from earlier in the book:

Just as in Chapter 7, we rely on sampling distributions to interpret chi-square statistics.
As in Chapters 9–12, we balance Type I error (false positives) against power (true positives).
In Chapters 16–18, we extended ANOVA to regression and ANCOVA. Here, we extend the logic of hypothesis testing to categorical outcomes and ordinal data.

Non-parametric methods are not a separate universe – they are another set of tools in your scientific toolbox. When used thoughtfully, they allow you to test important psychological questions even when real-world data refuse to behave “nicely.”