Chapter 15 – Correlation
==========================

In the previous chapters you learned how to *compare groups* using
t-tests and ANOVA.  Those designs are built around **experimental**
questions:

*Does changing X cause a difference in Y?*

In this chapter we turn to **association** questions:

*Do X and Y move together?  If so, how strongly and in what direction?*

Correlation is the workhorse of non-experimental psychology.
It is used to study relationships between naturally occurring variables
such as stress, sleep, depression, and exam performance.  You will see
correlation again in the next chapter as the foundation of
**linear regression**.

This chapter focuses on three big ideas:

* how to quantify the direction and strength of a linear relationship,
* how to read and interpret scatterplots,
* and why **correlation does not imply causation**.

At the end, the PyStatsV1 lab shows how to compute and visualise
correlations in Python using both NumPy/pandas and the
:mod:`pingouin` statistics library.


15.1 What Is a Correlation?
---------------------------

A **correlation** is a number that describes how two variables are
related.  In this chapter we focus on the most common measure:
the **Pearson product–moment correlation**, usually written as :math:`r`.

Pearson's :math:`r` tells you two things:

* **Direction** – whether high scores on one variable tend to go with
  high scores on the other (a *positive* correlation), or with low
  scores on the other (a *negative* correlation).
* **Strength** – how tightly the points cluster around a straight line.

The value of :math:`r` always lies between -1 and +1:

* :math:`r = +1.00` – a perfect positive linear relationship
* :math:`r = -1.00` – a perfect negative linear relationship
* :math:`r = 0` – no linear relationship

In real data, :math:`r` is almost never exactly -1, 0, or +1.  Instead
we see values like :math:`r = .10` (a weak relationship) or
:math:`r = .60` (a moderately strong relationship).


15.2 Computing Pearson's r
--------------------------

Conceptually, Pearson's :math:`r` is the **standardized covariance**
between two variables:

.. math::

   r = \frac{\text{cov}(X, Y)}{s_X s_Y}

Here :math:`\text{cov}(X, Y)` is the covariance between :math:`X` and
:math:`Y`, and :math:`s_X` and :math:`s_Y` are their sample standard
deviations.  Covariance captures whether high values of :math:`X`
tend to go with high (or low) values of :math:`Y`.  Dividing by the
standard deviations rescales the covariance to the familiar
-1 to +1 range.

In practice, you will almost always compute :math:`r` using software.
However, it is important to understand the basic ingredients:

1. Convert :math:`X` and :math:`Y` to **z-scores**.
2. Multiply the paired z-scores :math:`z_X z_Y` for each participant.
3. Average these cross-products.

The resulting average is exactly Pearson's :math:`r`.  When most
participants have the **same sign** of :math:`z_X` and :math:`z_Y`,
the cross-products are positive and :math:`r` is positive.  When
participants tend to have opposite signs (high on one variable,
low on the other), :math:`r` is negative.


15.3 Scatterplots and Visual Intuition
--------------------------------------

Before computing any correlation, you should **plot the data**.

A **scatterplot** places one variable on the x-axis and the other on
the y-axis.  Each participant is one point.

Scatterplots help you answer questions that the single number :math:`r`
cannot:

* Is the relationship **linear** or curved?
* Are there **outliers** that might distort the correlation?
* Does the variability change across the range of :math:`X`?

For example, a strong curved relationship can produce
:math:`r \approx 0` even though :math:`X` clearly predicts :math:`Y`.
Likewise, a single extreme outlier can produce a large :math:`r` that
does not represent the pattern for most participants.

.. important::

   **Always inspect a scatterplot before interpreting a correlation.**
   The number :math:`r` is helpful, but it is not a substitute for
   looking at the data.


15.4 Correlation Does Not Imply Causation
-----------------------------------------

Psychology students often hear the slogan:
**"Correlation does not imply causation."**
It is worth unpacking why this is true.

Suppose you find a strong positive correlation between time spent
on social media and self-reported anxiety.  At least three causal
stories are possible:

1. **Social media causes anxiety.**
2. **Anxiety causes social media use** (perhaps anxious people are
   more likely to scroll in bed).
3. A **third variable** (e.g., loneliness, insomnia) causes both
   heavy social media use and anxiety.

A correlation alone cannot distinguish between these possibilities.
To make a causal claim, you need an appropriate **research design**,
such as an experiment with random assignment or a carefully controlled
longitudinal study.

In this book we encourage the following mindset:

*Use correlation to describe and explore relationships,  
use experimental design to test causal claims.*


15.5 Partial Correlation: Controlling for a Third Variable
----------------------------------------------------------

Sometimes you want to know whether two variables are related
**after controlling for** another variable.  For example:

* Does study time predict exam score **after controlling for**
  prior GPA?
* Does therapy attendance predict symptom improvement **after controlling
  for** baseline severity?

A **partial correlation** answers questions like these.  It measures
the relationship between :math:`X` and :math:`Y` *after removing the
linear effect* of a third variable (or set of variables).

One way to think about this:

1. Regress :math:`X` on the control variable(s) and keep the residuals.
2. Regress :math:`Y` on the control variable(s) and keep the residuals.
3. Correlate the two sets of residuals.

The resulting partial correlation tells you whether :math:`X` and
:math:`Y` still move together once the shared influence of the control
variable has been removed.

You do not need to implement these regression steps by hand.
Libraries such as :mod:`pingouin` and :mod:`statsmodels` can compute
partial correlations directly from a tidy data frame.


15.6 PyStatsV1 Lab – Correlation in Python
------------------------------------------

The PyStatsV1 lab for this chapter is implemented in the script
:mod:`scripts.psych_ch15_correlation`.  It demonstrates three key
skills:

1. **Simulating data with a known population correlation**

   We use NumPy to simulate pairs of scores from a bivariate normal
   distribution with a specified population correlation (for example,
   :math:`\rho = .50`).  This allows us to check whether our estimation
   procedures recover the true value.

2. **Computing correlations with NumPy and Pingouin**

   The script shows how to compute Pearson's :math:`r` in two ways:

   * using NumPy / pandas:

     .. code-block:: python

        import numpy as np

        r_np = np.corrcoef(df["x"], df["y"])[0, 1]

   * using :mod:`pingouin`, which also returns p-values,
     confidence intervals, Bayes factors, and power:

     .. code-block:: python

        import pingouin as pg

        corr_table = pg.corr(df["x"], df["y"], method="pearson")
        r_pg = corr_table["r"].iloc[0]

   In our automated tests we verify that ``r_np`` and ``r_pg`` are
   essentially identical, and that :mod:`pingouin` recovers the
   population correlation used to generate the data.

3. **Correlation matrices, heatmaps, and partial correlations**

   The script also simulates a small set of psychology variables
   (for example, stress, sleep, anxiety, and exam scores) and then:

   * computes a full correlation matrix,
   * visualises the matrix as a color-coded heatmap,
   * and calculates a partial correlation, such as the association
     between study time and exam performance after controlling
     for motivation.

   These analyses use :mod:`pingouin` helper functions such as
   :func:`pingouin.pairwise_corr` and :func:`pingouin.partial_corr`.

The synthetic data are saved in the
``data/synthetic/psych_ch15_correlation.csv`` file, and the heatmap
is written to
``outputs/track_b/ch15_corr_heatmap.png`` for easy inclusion in slides
or lecture notes.

.. note::

   In Chapters 15–19 we rely increasingly on :mod:`pingouin` and
   :mod:`statsmodels` for the actual statistical computations.
   PyStatsV1 focuses on **simulation, data management, and workflow**,
   while these libraries provide well-tested implementations of
   advanced techniques (correlation, regression, mixed ANOVA, ANCOVA,
   and more).  Our unit tests use simulated data with known answers
   to check that these tools behave as expected in the scenarios we
   teach.