Chapter 15 – Correlation

In the previous chapters you learned how to compare groups using t-tests and ANOVA. Those designs are built around experimental questions:

Does changing X cause a difference in Y?

In this chapter we turn to association questions:

Do X and Y move together? If so, how strongly and in what direction?

Correlation is the workhorse of non-experimental psychology. It is used to study relationships between naturally occurring variables such as stress, sleep, depression, and exam performance. You will see correlation again in the next chapter as the foundation of linear regression.

This chapter focuses on three big ideas:

how to quantify the direction and strength of a linear relationship,
how to read and interpret scatterplots,
and why correlation does not imply causation.

At the end, the PyStatsV1 lab shows how to compute and visualise correlations in Python using both NumPy/pandas and the pingouin statistics library.

15.1 What Is a Correlation?

A correlation is a number that describes how two variables are related. In this chapter we focus on the most common measure: the Pearson product–moment correlation, usually written as \(r\).

Pearson’s \(r\) tells you two things:

Direction – whether high scores on one variable tend to go with high scores on the other (a positive correlation), or with low scores on the other (a negative correlation).
Strength – how tightly the points cluster around a straight line.

The value of \(r\) always lies between -1 and +1:

\(r = +1.00\) – a perfect positive linear relationship
\(r = -1.00\) – a perfect negative linear relationship
\(r = 0\) – no linear relationship

In real data, \(r\) is almost never exactly -1, 0, or +1. Instead we see values like \(r = .10\) (a weak relationship) or \(r = .60\) (a moderately strong relationship).

15.2 Computing Pearson’s r

Conceptually, Pearson’s \(r\) is the standardized covariance between two variables:

\[r = \frac{\text{cov}(X, Y)}{s_X s_Y}\]

Here \(\text{cov}(X, Y)\) is the covariance between \(X\) and \(Y\), and \(s_X\) and \(s_Y\) are their sample standard deviations. Covariance captures whether high values of \(X\) tend to go with high (or low) values of \(Y\). Dividing by the standard deviations rescales the covariance to the familiar -1 to +1 range.

In practice, you will almost always compute \(r\) using software. However, it is important to understand the basic ingredients:

Convert \(X\) and \(Y\) to z-scores.
Multiply the paired z-scores \(z_X z_Y\) for each participant.
Average these cross-products.

The resulting average is exactly Pearson’s \(r\). When most participants have the same sign of \(z_X\) and \(z_Y\), the cross-products are positive and \(r\) is positive. When participants tend to have opposite signs (high on one variable, low on the other), \(r\) is negative.

15.3 Scatterplots and Visual Intuition

Before computing any correlation, you should plot the data.

A scatterplot places one variable on the x-axis and the other on the y-axis. Each participant is one point.

Scatterplots help you answer questions that the single number \(r\) cannot:

Is the relationship linear or curved?
Are there outliers that might distort the correlation?
Does the variability change across the range of \(X\)?

For example, a strong curved relationship can produce \(r \approx 0\) even though \(X\) clearly predicts \(Y\). Likewise, a single extreme outlier can produce a large \(r\) that does not represent the pattern for most participants.

Important

Always inspect a scatterplot before interpreting a correlation. The number \(r\) is helpful, but it is not a substitute for looking at the data.

15.4 Correlation Does Not Imply Causation

Psychology students often hear the slogan: “Correlation does not imply causation.” It is worth unpacking why this is true.

Suppose you find a strong positive correlation between time spent on social media and self-reported anxiety. At least three causal stories are possible:

Social media causes anxiety.
Anxiety causes social media use (perhaps anxious people are more likely to scroll in bed).
A third variable (e.g., loneliness, insomnia) causes both heavy social media use and anxiety.

A correlation alone cannot distinguish between these possibilities. To make a causal claim, you need an appropriate research design, such as an experiment with random assignment or a carefully controlled longitudinal study.

In this book we encourage the following mindset:

Use correlation to describe and explore relationships, use experimental design to test causal claims.

15.5 Partial Correlation: Controlling for a Third Variable

Sometimes you want to know whether two variables are related after controlling for another variable. For example:

Does study time predict exam score after controlling for prior GPA?
Does therapy attendance predict symptom improvement after controlling for baseline severity?

A partial correlation answers questions like these. It measures the relationship between \(X\) and \(Y\) after removing the linear effect of a third variable (or set of variables).

One way to think about this:

Regress \(X\) on the control variable(s) and keep the residuals.
Regress \(Y\) on the control variable(s) and keep the residuals.
Correlate the two sets of residuals.

The resulting partial correlation tells you whether \(X\) and \(Y\) still move together once the shared influence of the control variable has been removed.

You do not need to implement these regression steps by hand. Libraries such as pingouin and statsmodels can compute partial correlations directly from a tidy data frame.

15.6 PyStatsV1 Lab – Correlation in Python

The PyStatsV1 lab for this chapter is implemented in the script scripts.psych_ch15_correlation. It demonstrates three key skills:

Simulating data with a known population correlation

We use NumPy to simulate pairs of scores from a bivariate normal distribution with a specified population correlation (for example, \(\rho = .50\)). This allows us to check whether our estimation procedures recover the true value.
Computing correlations with NumPy and Pingouin

The script shows how to compute Pearson’s \(r\) in two ways:
- using NumPy / pandas:
```
import numpy as np

r_np = np.corrcoef(df["x"], df["y"])[0, 1]
```
- using pingouin, which also returns p-values, confidence intervals, Bayes factors, and power:
```
import pingouin as pg

corr_table = pg.corr(df["x"], df["y"], method="pearson")
r_pg = corr_table["r"].iloc[0]
```
In our automated tests we verify that r_np and r_pg are essentially identical, and that pingouin recovers the population correlation used to generate the data.
Correlation matrices, heatmaps, and partial correlations

The script also simulates a small set of psychology variables (for example, stress, sleep, anxiety, and exam scores) and then:
- computes a full correlation matrix,
- visualises the matrix as a color-coded heatmap,
- and calculates a partial correlation, such as the association between study time and exam performance after controlling for motivation.
These analyses use pingouin helper functions such as pingouin.pairwise_corr() and pingouin.partial_corr().

The synthetic data are saved in the data/synthetic/psych_ch15_correlation.csv file, and the heatmap is written to outputs/track_b/ch15_corr_heatmap.png for easy inclusion in slides or lecture notes.

Note

In Chapters 15–19 we rely increasingly on pingouin and statsmodels for the actual statistical computations. PyStatsV1 focuses on simulation, data management, and workflow, while these libraries provide well-tested implementations of advanced techniques (correlation, regression, mixed ANOVA, ANCOVA, and more). Our unit tests use simulated data with known answers to check that these tools behave as expected in the scenarios we teach.