Chapter 16 – Linear Regression
==============================

In Chapter 15 you learned how to quantify relationships between two
variables using correlation. Correlation answers the question:

.. epigraph::

   *"How strongly are two variables associated?"*

Linear regression goes one step further. It answers a different
question:

.. epigraph::

   *"How well can we **predict** one variable from one or more
   other variables?"*

In this chapter you will:

* build the idea of a **line of best fit** for predicting an outcome,
* understand how the **least-squares** method chooses that line,
* interpret the **standard error of the estimate** as typical
  prediction error,
* extend to **multiple regression**, where several predictors work
  together,
* and use :mod:`pystatsv1` and :mod:`pingouin` to fit and interpret
  regression models on a synthetic psychology dataset.


16.1 Prediction: The Line of Best Fit
-------------------------------------

Imagine we want to predict a student's exam score from their number of
study hours. Each student gives us one data point:
``(study_hours, exam_score)``. If we scatter those points, a clear
trend often appears: students who study more tend to score higher.

**Linear regression** summarizes that trend with a straight line:

.. math::

   \widehat{Y} = bX + a

where

* :math:`\widehat{Y}` is the *predicted* value of the outcome,
* :math:`X` is the predictor,
* :math:`b` is the **slope** (how much :math:`Y` changes for a
  one-unit change in :math:`X`), and
* :math:`a` is the **intercept** (the predicted value of :math:`Y`
  when :math:`X = 0`).

In psychology, we often use regression to predict:

* exam performance from study time,
* depressive symptoms from life stress,
* therapy outcomes from baseline severity and treatment type,
* or attention scores from sleep quality and caffeine intake.

The important idea is that regression is a **model of prediction**.
We care both about *how strong* the relationship is and *how well we
can forecast new data*.

Interpretation of the slope
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose our fitted line is

.. math::

   \widehat{\text{Exam Score}} = 5.0 \times \text{Study Hours} + 60.

The slope :math:`b = 5.0` means:

*For every extra hour of study, exam score is predicted to increase by
about 5 points, on average.*

The intercept :math:`a = 60` means:

*A student who studied 0 hours is predicted to score 60 (although we
should be cautious about interpreting predictions far outside the
observed range).*


16.2 Least Squares: Choosing the Best Line
------------------------------------------

There are infinitely many lines we could draw through a cloud of
points. Linear regression chooses the one that minimizes the **sum of
squared residuals**.

A **residual** is the difference between the observed outcome and the
predicted outcome from the line:

.. math::

   e_i = Y_i - \widehat{Y}_i.

The **least-squares** solution chooses :math:`a` and :math:`b` to
minimize

.. math::

   \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2.

This has several nice properties:

* it gives more weight to large errors (because of squaring),
* it has a closed-form solution (no iterative search is required),
* and it links naturally to the Pearson correlation :math:`r` and
  the ANOVA framework you saw in Chapters 12–14.

In the Chapter 16 lab script we compute the least-squares solution
using both basic NumPy functions and the higher-level
:func:`pingouin.linear_regression` helper for cross-checking.


16.3 Standard Error of the Estimate
-----------------------------------

No regression line predicts perfectly. Different students with the
same study hours will still have different exam scores. The **standard
error of the estimate** summarizes typical prediction error in the
original units of the outcome variable.

After we fit the line, we compute residuals :math:`e_i` for each case
and then:

.. math::

   S_\text{est} = \sqrt{\frac{\sum e_i^2}{n - 2}}.

This is similar to a standard deviation of the residuals. A smaller
:math:`S_\text{est}` means:

* predictions are typically closer to the observed scores,
* the regression line fits the data more tightly,
* and we have more precise forecasts for new cases.

In the lab we will:

* print :math:`S_\text{est}` for our simple regression model,
* compare it across different models,
* and relate it back to the residual plots.


16.4 Multiple Regression and R²
-------------------------------

Real psychological outcomes usually depend on **many** factors at
once. Multiple regression extends the linear model to several
predictors:

.. math::

   \widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \dots + b_k X_k.

For example, our synthetic dataset in Chapter 16 includes:

* ``study_hours`` – weekly study time,
* ``sleep_hours`` – average nightly sleep,
* ``stress`` – perceived stress score,
* and ``exam_score`` – exam performance.

A multiple regression model might be:

.. math::

   \widehat{\text{Exam Score}} =
   b_0 + b_1 \times \text{Study Hours}
       + b_2 \times \text{Sleep Hours}
       - b_3 \times \text{Stress}.

Key ideas:

* Each slope :math:`b_j` is a **partial effect**: it tells us how
  much :math:`Y` is expected to change when :math:`X_j` increases by
  one unit, holding the other predictors constant.
* We can assess overall model fit with :math:`R^2`, the proportion of
  variance in :math:`Y` explained by the predictors.
* Adjusted :math:`R^2` penalizes adding predictors that do not really
  help.

In the lab script we use :func:`pingouin.linear_regression` to fit a
multiple regression model with ``exam_score`` as the outcome and
multiple predictors. We then interpret:

* the coefficient signs (which predictors help or hurt),
* their statistical significance (p-values),
* and the overall :math:`R^2` / adjusted :math:`R^2`.


16.5 PyStatsV1 Lab: Building a Predictive Model
-----------------------------------------------

The Chapter 16 lab script is
:mod:`scripts.psych_ch16_regression`. It demonstrates:

* simulating a psychology dataset with exam performance,
* fitting and interpreting a **simple linear regression**,
* fitting a **multiple regression** with several predictors,
* saving results for replication,
* and visualizing the line of best fit.

Overview of the lab script
~~~~~~~~~~~~~~~~~~~~~~~~~~

The script is structured into a few main helper functions:

* :func:`simulate_psych_regression_dataset`

  Creates a synthetic dataset with columns such as
  ``stress``, ``sleep_hours``, ``study_hours``, and ``exam_score``,
  using a known underlying regression model. Because we know the
  "true" slopes, we can check that the estimated values behave as
  expected.

* :func:`fit_simple_regression`

  Fits a simple regression predicting ``exam_score`` from
  ``study_hours``. Returns the slope, intercept, correlation,
  :math:`R^2`, and standard error of the estimate.

* :func:`fit_multiple_regression`

  Uses :func:`pingouin.linear_regression` to fit a multiple regression
  model with several predictors. Returns the regression table along
  with :math:`R^2` and adjusted :math:`R^2` for quick inspection.

* :func:`plot_regression_line`

  Generates a scatterplot of ``study_hours`` versus ``exam_score``
  along with the fitted line. The figure is saved to the
  ``outputs/track_b`` folder for use in slides or assignments.

When you run the script via:

.. code-block:: bash

   make psych-ch16

you will see printed output that includes:

* a preview of the simulated dataset,
* the simple regression slope, intercept, :math:`R^2`, and
  standard error of the estimate,
* the multiple regression summary from :mod:`pingouin`,
* and file paths where the data, table, and figure were saved.


Files written by the lab
~~~~~~~~~~~~~~~~~~~~~~~~~

The script saves three main artifacts:

* ``data/synthetic/psych_ch16_regression.csv``

  The simulated psychology dataset used for all analyses.

* ``outputs/track_b/ch16_regression_summary.csv``

  A CSV file containing the multiple regression summary table
  produced by :func:`pingouin.linear_regression`.

* ``outputs/track_b/ch16_regression_fit.png``

  A scatterplot of study hours and exam scores with the regression
  line superimposed.

These files make it easy to reproduce the main figures and tables
for homework, lecture slides, or exam preparation.


PyStatsV1 and Pingouin together
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As in Chapters 14 and 15, we treat :mod:`pingouin` as a
**trusted reference implementation**. Our own helper functions
use NumPy and pandas to compute regression quantities "from scratch",
and then we cross-check key numbers against
:func:`pingouin.linear_regression`.

This dual approach reinforces the core philosophy of PyStatsV1:

.. epigraph::

   *Don't just calculate your results — engineer them.*

By writing small, well-tested functions and validating them against
trusted libraries, students learn both the statistical ideas and the
software engineering mindset needed for reproducible science.


Checklist: What You Should Be Able to Do
----------------------------------------

By the end of Chapter 16, you should be able to:

* explain what the regression line :math:`\widehat{Y} = bX + a` means
  in words,
* interpret the slope and intercept in a psychology example,
* describe how the least-squares method chooses the "best" line,
* compute and interpret the standard error of the estimate,
* explain the difference between simple and multiple regression,
* interpret :math:`R^2` and adjusted :math:`R^2`,
* and run the Chapter 16 PyStatsV1 lab to build and evaluate a
  predictive model.