.. _psych_ch5: Psychological Science & Statistics – Chapter 5 ============================================== Central Tendency and Variability: Summarizing What We See --------------------------------------------------------- Chapter 4 was about *looking* at data—frequency tables and graphs that show the overall shape of a distribution. In this chapter we take the next step: *Can we describe that distribution with a few meaningful numbers?* Psychology relies heavily on this kind of summary. We say things like: * "On average, the treatment group slept longer than the control group." * "There was a lot of variability in stress scores." * "Most participants were near the mean, but a few scored far out in the tails." This chapter introduces: * measures of **central tendency** (where the distribution is centered), and * measures of **variability** (how spread out the scores are). We will keep connecting these ideas back to the sleep-study dataset and other realistic psychological examples. 5.1 Why central tendency and variability both matter ---------------------------------------------------- If you only know the *average* of a set of scores, you know almost nothing about how individuals actually behaved. Imagine two classes that took the same exam: * Class A: Most students scored between 78 and 82. * Class B: Half the students scored around 50 and the other half around 100. Both classes could have the **same mean**, but the stories these data tell are very different. To understand a distribution, we need **both**: * a number that summarizes "typical" or "central" performance, and * a number that summarizes how much scores vary around that center. 5.2 Measures of central tendency: mean, median, mode ----------------------------------------------------- There are three classic measures of central tendency. The mean (arithmetic average) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **mean** is what most people casually call the "average". It is defined as .. math:: \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i, where :math:`x_i` are the individual scores and :math:`N` is the number of participants. *Psychology use case:* We often report the mean reaction time, mean depression score, mean hours of sleep, etc. **Strengths** * Uses *all* the data. * Works well with many statistical models (especially those based on the Normal distribution). **Weaknesses** * Extremely sensitive to **outliers** (one person who barely slept can drag down the mean sleep hours). * Can be misleading for heavily skewed distributions. The median (the middle score) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **median** is the value that splits the distribution in half: * 50% of scores are at or below the median. * 50% are at or above. To find the median, sort the scores from smallest to largest, then pick the middle one (or average the two middle scores if there are an even number of observations). **Strengths** * Robust against outliers and skewed data. * Often a better description of "typical" behavior when the distribution is highly skewed (e.g., income, number of social media followers). The mode (most frequent score) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **mode** is simply the most common value in a distribution. * For continuous variables (like reaction time), the mode is often less useful. * For categorical variables (e.g., study method, therapy type), the mode tells you which category is most popular. **In practice** In applied psychology, we usually report **mean** and **standard deviation** for roughly symmetric, continuous variables, and we may report **median** and a measure of spread (e.g., interquartile range) when the distribution is skewed. 5.3 The problem with averages: when the mean misleads ----------------------------------------------------- The mean can give a false sense of what is "typical". Example: Sleep and outliers ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose we measured hours of sleep last night for 10 participants: .. math:: 6, 6.5, 7, 7, 7.5, 8, 8, 8.5, 9, 2 Most participants slept between 6 and 9 hours, but one person only slept 2. The mean is .. math:: \bar{x} = \frac{6 + 6.5 + \dots + 9 + 2}{10} = 7.1 \text{ hours (approx).} The median, however, is 7.5 hours. *If you were designing a sleep intervention, which number better captures what is typical in this group?* This demonstrates: * **Outliers** can pull the mean away from where most data lie. * Reporting the median alongside the mean can help detect this problem. * Graphs (like histograms) are essential companions to numerical summaries. 5.4 Measures of variability: range, IQR, variance, SD ----------------------------------------------------- Central tendency tells us *where* scores cluster. Variability tells us *how tightly* they cluster. The range ~~~~~~~~~ The **range** is the simplest measure: .. math:: \text{Range} = \text{Maximum} - \text{Minimum}. It shows the width of the distribution but is extremely sensitive to outliers. The interquartile range (IQR) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **interquartile range (IQR)** focuses on the middle 50% of the data: * :math:`Q_1` (first quartile): 25th percentile. * :math:`Q_3` (third quartile): 75th percentile. .. math:: \text{IQR} = Q_3 - Q_1. A large IQR means scores are spread out; a small IQR means participants are relatively similar. The variance and standard deviation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The **variance** and **standard deviation (SD)** go beyond extremes and quantiles by using all the data. For a **sample** of scores :math:`x_1, \dots, x_N`, the sample variance is .. math:: s^2 = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \bar{x})^2. The standard deviation is the square root: .. math:: s = \sqrt{s^2}. Interpretation: * If scores are tightly clustered around the mean, :math:`s` is small. * If scores are widely spread out, :math:`s` is large. * Under many models, most scores fall within about 1–2 standard deviations of the mean. In psychological research reports we almost always see something like: *“Participants slept an average of 7.2 hours (SD = 1.1).”* 5.5 Degrees of freedom: why divide by N − 1? -------------------------------------------- You may have noticed that the variance formula uses :math:`N - 1` rather than :math:`N` in the denominator. This is related to the idea of **degrees of freedom (df)**. Informally, degrees of freedom are the number of independent pieces of information available for estimating a parameter. For the sample variance: * Once you know the sample mean :math:`\bar{x}`, the deviations :math:`(x_i - \bar{x})` must sum to zero. * That means if you know :math:`N - 1` of the deviations, the last one is already determined. So there are only :math:`N - 1` independent deviations, and we divide by :math:`N - 1` to obtain an **unbiased** estimate of the population variance. This idea of degrees of freedom will appear again in t-tests and ANOVAs later in the mini-book. 5.6 PyStatsV1 Lab: Summarizing the sleep-study data --------------------------------------------------- In this lab we return to the **sleep study** dataset. We will: * compute mean, median, and mode for hours of sleep, * compute range, IQR, and standard deviation, * compare summaries across study methods. If you ever need to regenerate the underlying CSV file for this dataset, see the instructor note in :ref:`psych_ch4` about running ``scripts/sim_psych_sleep_study.py``. Loading the dataset ~~~~~~~~~~~~~~~~~~~ If you have cloned the PyStatsV1 repository, the CSV file will be located in the ``data`` folder. You can load it with pandas: .. code-block:: python import pandas as pd data = pd.read_csv("data/psych_sleep_study.csv") print(data.head()) print(data.dtypes) You should see variables such as: * ``participant_id`` – unique ID per participant, * ``sleep_hours`` – hours of sleep last night (continuous), * ``study_method`` – preferred study method (categorical), * ``chronotype`` – morning/evening type (categorical), * possibly additional variables (e.g., stress score) depending on the simulation. Overall summaries ~~~~~~~~~~~~~~~~~ First, let us compute basic summaries for the entire sample: .. code-block:: python sleep = data["sleep_hours"] mean_sleep = sleep.mean() median_sleep = sleep.median() mode_sleep = sleep.mode() # may return more than one value print(f"Mean sleep: {mean_sleep:.2f} hours") print(f"Median sleep: {median_sleep:.2f} hours") print("Mode(s):") print(mode_sleep.values) # Measures of spread sleep_range = sleep.max() - sleep.min() iqr_sleep = sleep.quantile(0.75) - sleep.quantile(0.25) sd_sleep = sleep.std(ddof=1) print(f"Range: {sleep_range:.2f} hours") print(f"IQR: {iqr_sleep:.2f} hours") print(f"SD: {sd_sleep:.2f} hours") As you run this code, ask: * Is the mean close to the median, or are there signs of skewness? * Does the SD seem small (participants similar) or large (participants differ widely)? * Do the numbers match what you saw in the histogram from Chapter 4? Group summaries by study method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now let us see whether preferred study method is associated with how much students slept. We compute group-wise means and SDs: .. code-block:: python grouped = ( data .groupby("study_method")["sleep_hours"] .agg(["count", "mean", "median", "std"]) .rename(columns={"std": "sd"}) .sort_values("mean", ascending=False) ) print(grouped) This table tells us, for each study method: * how many students chose it (``count``), * their average hours of sleep (``mean``), * the median hours of sleep (``median``), * how much their sleep varies (``sd``). You might find, for example, that students who use practice tests have slightly different sleep patterns than those who rely on re-reading. Connecting back to research design ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In a real study, we might ask: * Is study method causing differences in sleep? * Or are both sleep and study method being influenced by some third variable, like stress or personality? For now, our goal is more modest: using central tendency and variability to summarize what is happening in this sample. 5.7 What you should take away ----------------------------- By the end of this chapter and lab you should be able to: * explain the difference between **mean**, **median**, and **mode**, and when each is most appropriate, * describe why averages can be misleading in the presence of **outliers** or **skewed** distributions, * compute and interpret common measures of variability (range, IQR, variance, standard deviation), * understand, at an intuitive level, why the sample variance uses :math:`N - 1` in the denominator (degrees of freedom), * use Python and pandas to compute these summaries for a realistic psychological dataset, * compare central tendency and spread across groups (e.g., different study methods) to generate new research questions. In the next chapter we connect these ideas to the **Normal distribution** and **z-scores**, which provide a bridge from descriptive summaries to probabilities and, eventually, to hypothesis testing.