.. _psych_ch4: Psychological Science & Statistics – Chapter 4 ============================================== Frequency Distributions and Visualization: Finding Patterns in Data ------------------------------------------------------------------- Before we run statistics, we need to *see* the data. Psychology students often jump straight to t-tests or correlations, but rigorous researchers begin with a simpler, essential question: **“What does the data *look* like?”** This chapter introduces **Exploratory Data Analysis (EDA)**. These tools allow us to detect patterns, spot data entry errors (e.g., a participant aged 150), and determine whether our data meets the assumptions required for complex testing later. Why this chapter matters ------------------------ Exploratory Data Analysis (EDA) is the first real step of psychological science. Before testing hypotheses, we must ensure our data are clean, plausible, and interpretable. Visualizing and tabulating data helps researchers: * detect errors (e.g., impossible ages or reaction times), * understand participant behavior patterns, * diagnose assumptions required for later inferential tests, * and communicate results clearly using APA-style figures. This chapter builds the foundation for every analysis that follows. 4.1 Frequency tables: organizing data ------------------------------------- A **frequency table** counts how many times each value occurs. However, raw counts aren't always enough. In psychology, we often need two additional columns: 1. **Relative Frequency (%)** – the percentage of the total sample. 2. **Cumulative Frequency** – the running total (helps compute percentiles). Relative Frequency Formula ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. math:: \text{relative frequency} = \frac{f_i}{N} \text{percent} = \frac{f_i}{N} \times 100 Cumulative Frequency Formula ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. math:: \text{cum } f_i = f_1 + f_2 + \ldots + f_i Categorical Example ~~~~~~~~~~~~~~~~~~~ Imagine we asked 50 students about their preferred study method: .. list-table:: :widths: 40 20 20 :header-rows: 1 * - Study Method - Frequency (f) - Relative Freq (%) * - Flashcards - 18 - 36% * - Re-reading - 12 - 24% * - Practice tests - 15 - 30% * - Other - 5 - 10% * - **Total** - **50** - **100%** *Psychological Insight:* While cognitive science shows retrieval practice ("Practice tests") is highly effective, only 30% of this sample uses it. Continuous Example (Grouped) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Continuous variables (like reaction time or sleep duration) have too many unique values to list individually. We group them into **bins** (intervals). .. list-table:: Sleep Duration (N=100) :widths: 30 30 40 :header-rows: 1 * - Hours Slept - Frequency - Cumulative Frequency * - 4.0 – 5.9 - 6 - 6 * - 6.0 – 6.9 - 21 - 27 * - 7.0 – 7.9 - 44 - 71 * - 8.0 – 8.9 - 23 - 94 * - 9.0 – 10.0 - 6 - 100 .. note:: **The "Real Limits" Rule:** In PyStatsV1 and most statistical software, bins are treated as inclusive of the lower bound and exclusive of the upper bound (e.g., ``6.0 <= x < 7.0``). 4.2 Visualizing continuous data: histograms ------------------------------------------- Numbers are helpful, but humans are visual creatures. A histogram looks like a bar chart, but the bars **touch**. This signifies that the variable is continuous—there is no gap between 6.99 hours and 7.00 hours. Key interpretation checks: * **Peaks:** Where is the data clustered? * **Spread:** Tight vs. wide distributions. * **Outliers:** Lone bars far from the others. The Frequency Polygon ~~~~~~~~~~~~~~~~~~~~~ If we place a dot at the top-center of every histogram bar and connect the dots, we get a **Frequency Polygon**. Why polygons? * Comparing groups becomes cleaner. * Two overlaid histograms are messy; two polygons are readable. .. note:: **Accessibility Reminder:** Use colorblind-safe palettes (e.g., blue/orange) and never encode group differences by color alone. Add labels or line styles. 4.3 Visualizing categorical data: bar charts -------------------------------------------- For nominal variables, we use **bar charts**. Rules for APA-Style Bar Charts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Bars **should not** touch (these are discrete categories). 2. Order bars by **frequency**, not alphabetically, unless the categories have a natural order. 3. Keep labeling clean and descriptive. .. warning:: **The Prohibition of Pie Charts** Pie charts are rarely used in scientific psychology. * Humans struggle to compare angles. * Small differences are nearly invisible. * More than 4 categories = unreadable. **Use bar charts instead.** 4.4 The shape of data: skewness and kurtosis -------------------------------------------- Understanding the *shape* helps diagnose phenomena and potential design issues. Skewness: the tails ~~~~~~~~~~~~~~~~~~~ **Positive Skew (Right Skew)** Tail extends to the right. *Psychology Context:* **Floor Effects** Example: A very difficult memory test — most people score low, but a few score high. **Negative Skew (Left Skew)** Tail extends to the left. *Psychology Context:* **Ceiling Effects** Example: An exam that was too easy — most people score high, with a small tail of low performers. Kurtosis: the peak ~~~~~~~~~~~~~~~~~~ Kurtosis describes thickness of tails vs. center. * **Leptokurtic:** Tall, thin — very similar scores (e.g., elite athletes). * **Platykurtic:** Flat — large variability (e.g., general population samples). * **Mesokurtic:** Normal distribution — moderate shape. 4.5 PyStatsV1 Lab: Exploring the sleep study dataset ---------------------------------------------------- In this chapter’s lab we will use the synthetic *sleep study* dataset you saw earlier in this mini-book plan. It lives in the PyStatsV1 repository and is generated by the helper module ``scripts/sim_psych_sleep_study.py``. .. note:: For instructors and maintainers The ``sleep_study`` dataset used in this and later chapters is generated from a small simulation script in the PyStatsV1 repository. If you ever need to recreate the CSV from scratch (for example, after editing the data-generating assumptions), run the following command from the project root:: python scripts/sim_psych_sleep_study.py By default this will (re)write the file ``data/psych_sleep_study.csv`` with the same structure and random seed that the textbook examples and tests expect. See ``scripts/sim_psych_sleep_study.py`` for more details and optional arguments. The goal is to give you a repeatable pattern: * load a realistic psychology dataset in Python, * inspect its variables, * and create basic plots that match the ideas in this chapter. Loading the data ~~~~~~~~~~~~~~~~ From the root of the PyStatsV1 repository, open a Python session or Jupyter notebook and run: .. code-block:: python from scripts.sim_psych_sleep_study import load_sleep_study df = load_sleep_study() df.head() You should see columns like: * ``id`` – participant ID, * ``class_year`` – first_year, second_year, third_year, or fourth_year, * ``sleep_hours`` – average weeknight sleep (hours), * ``study_method`` – flashcards, rereading, practice_tests, or mixed, * ``exam_score`` – exam percentage (0–100). The first time you call :func:`load_sleep_study`, it will create the CSV file ``data/synthetic/psych_sleep_study.csv``. Later calls simply read that file so you get the *same* dataset each time. Frequency tables for study methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let’s build a frequency table for the categorical variable ``study_method``: .. code-block:: python # Frequency (counts) freq = df["study_method"].value_counts().rename("f") # Relative frequency (percentages) rel_freq = (freq / len(df) * 100).round(1).rename("percent") # Combine into one DataFrame table = ( pd.concat([freq, rel_freq], axis=1) .reset_index() .rename(columns={"index": "study_method"}) .sort_values("f", ascending=False) ) print(table) This prints a table like: .. code-block:: text study_method f percent practice_tests 36 30.0 flashcards 34 28.3 rereading 30 25.0 mixed 20 16.7 This mirrors the frequency tables earlier in the chapter, but now the numbers come from data you could plausibly collect in a real study skills experiment. Histogram of sleep hours ~~~~~~~~~~~~~~~~~~~~~~~~ Now make a histogram of the continuous variable ``sleep_hours``: .. code-block:: python import matplotlib.pyplot as plt plt.hist(df["sleep_hours"], bins=10, edgecolor="black") plt.xlabel("Sleep hours (weeknight average)") plt.ylabel("Number of students") plt.title("Distribution of sleep duration") plt.show() When you look at the plot, ask: * Where is the data clustered (what is the *mode*)? * Are there students sleeping very little or a lot (potential outliers)? * Does the shape look roughly symmetric, or skewed? Bar chart for study methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For the categorical ``study_method`` variable, use a bar chart: .. code-block:: python freq = df["study_method"].value_counts().sort_values(ascending=False) plt.bar(freq.index, freq.values) plt.ylabel("Number of students") plt.title("Preferred study method") plt.xticks(rotation=20) plt.show() Notice that the bars are separated (this is categorical, not continuous) and we have ordered them from most common to least common. 4.6 What you should take away ----------------------------- By the end of this chapter and lab you should be able to: * construct frequency tables for categorical and grouped continuous data, * choose appropriate visualizations (histograms for continuous data, bar charts for categorical data), * interpret the *shape* of a distribution (center, spread, skewness), * and run a small, reproducible analysis by loading ``load_sleep_study()`` from the PyStatsV1 repository. In later chapters we will reuse this same dataset when we talk about measures of central tendency, variability, and eventually correlation and regression. That way the plots and statistics you see in the text are always tied to a concrete psychological story.