Applied Statistics with Python – Chapter 4
==========================================

Summarizing data
----------------

This chapter parallels the “Summarizing Data” chapter from the R notes.
The statistical ideas are the same:

* For **numeric variables**, we summarize the distribution using measures
  of **center** and **spread**.
* For **categorical variables**, we summarize using **counts** and
  **proportions**.
* We then use **plots** to visualize those summaries.

In the R version you see functions like ``mean()``, ``median()``, ``sd()``,
``IQR()``, ``hist()``, ``boxplot()``, and ``plot()`` for scatterplots.  Here
we’ll use Python, NumPy, pandas, and Matplotlib to achieve the same goals.

Throughout, imagine we have a DataFrame ``mpg`` that mirrors the R
``ggplot2::mpg`` dataset, with columns like:

* ``cty`` – city miles per gallon
* ``hwy`` – highway miles per gallon
* ``drv`` – drivetrain (``"f"``, ``"r"``, ``"4"``)
* ``displ`` – engine displacement in liters

You could obtain this DataFrame in several ways, for example:

.. code-block:: python

   import pandas as pd
   import seaborn as sns  # only needed if you want to load the example

   # Option 1: seaborn’s built-in mpg dataset
   mpg = sns.load_dataset("mpg")

   # Option 2: read from a CSV bundled with your project
   # mpg = pd.read_csv("data/mpg.csv")


4.1 Summary statistics
----------------------

We start with **summary statistics**: numbers that describe center, spread,
and distribution shape for a variable.

4.1.1 Numeric variables: center and spread
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In R you saw a table of summaries like:

* mean
* median
* variance
* standard deviation
* interquartile range (IQR)
* minimum, maximum, range

We can compute the same quantities with pandas:

.. code-block:: python

   import numpy as np
   import pandas as pd

   # City miles per gallon
   cty = mpg["cty"]

   # Center
   cty_mean = cty.mean()      # average
   cty_median = cty.median()  # median

   # Spread
   cty_var = cty.var(ddof=1)  # sample variance (n-1)
   cty_sd = cty.std(ddof=1)   # sample standard deviation (n-1)
   cty_iqr = cty.quantile(0.75) - cty.quantile(0.25)

   cty_min = cty.min()
   cty_max = cty.max()
   cty_range = cty_max - cty_min

   summary = {
       "mean": cty_mean,
       "median": cty_median,
       "variance": cty_var,
       "sd": cty_sd,
       "IQR": cty_iqr,
       "min": cty_min,
       "max": cty_max,
       "range": cty_range,
   }

   summary

A quick shortcut for many of these is ``describe``:

.. code-block:: python

   mpg["cty"].describe()

which returns count, mean, standard deviation, quartiles, min, and max.

**Conceptual recap**

* Mean: arithmetic average; sensitive to outliers.
* Median: middle value; robust to outliers.
* Variance/SD: average squared (or square-rooted) distance from the mean.
* IQR: distance between the 25th and 75th percentiles (middle 50% of data).
* Min/Max/Range: show the extremes of the distribution.

Python vs R differences:

* R’s ``var()`` and ``sd()`` use ``n-1`` by default (unbiased estimators).
* pandas uses ``ddof=1`` for ``DataFrame.var`` and ``DataFrame.std`` by default.
* NumPy’s ``np.var`` and ``np.std`` default to ``ddof=0`` (divide by ``n``).
  Use ``ddof=1`` to match the R textbook.


4.1.2 Categorical variables: counts and proportions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For categorical variables, we care about **how often** each level appears.

In R, you saw ``table(mpg$drv)`` and relative frequencies with
``table(mpg$drv) / nrow(mpg)``.

In pandas:

.. code-block:: python

   drv_counts = mpg["drv"].value_counts()
   drv_props = mpg["drv"].value_counts(normalize=True)

   drv_counts
   drv_props

This gives frequency and proportion for each drivetrain category.

Key ideas:

* ``value_counts()`` is the pandas analogue of ``table()`` in R.
* ``normalize=True`` turns counts into proportions.
* These summaries are the numerical counterpart of a bar chart.


4.2 Plotting
------------

Numeric tables are useful, but most of the time we learn more from good
visualization.

We’ll mirror the same four plot types as the R chapter:

* Histograms
* Bar charts
* Boxplots
* Scatterplots

We will use Matplotlib and pandas plotting helpers. These examples assume:

.. code-block:: python

   import matplotlib.pyplot as plt


4.2.1 Histograms
~~~~~~~~~~~~~~~~

When you have **one numeric variable**, a histogram is the workhorse plot.

In R: ``hist(mpg$cty)`` and a more polished version with axis labels,
title, breaks, colors.

In Python/pandas:

.. code-block:: python

   fig, ax = plt.subplots()

   mpg["cty"].hist(
       bins=12,                # similar idea to breaks =
       color="dodgerblue",
       edgecolor="darkorange",
       ax=ax,
   )

   ax.set_xlabel("Miles per gallon (city)")
   ax.set_ylabel("Frequency")
   ax.set_title("Histogram of MPG (city)")

   plt.tight_layout()

Notes:

* ``bins`` is analogous to R’s ``breaks`` argument.
* Always label axes and add a clear title.
* ``hist`` gives the familiar histogram shape: bars whose area
  corresponds to counts (or densities).


4.2.2 Bar charts
~~~~~~~~~~~~~~~~

Bar charts summarize **categorical** variables (or numeric variables
with a small number of distinct values).

R example: ``barplot(table(mpg$drv))``.

Python:

.. code-block:: python

   drv_counts = mpg["drv"].value_counts().sort_index()

   fig, ax = plt.subplots()

   drv_counts.plot(
       kind="bar",
       color="dodgerblue",
       edgecolor="darkorange",
       ax=ax,
   )

   ax.set_xlabel("Drivetrain (f = FWD, r = RWD, 4 = 4WD)")
   ax.set_ylabel("Frequency")
   ax.set_title("Drivetrains")

   plt.tight_layout()

If you want **proportions** instead of counts, apply ``value_counts(normalize=True)``:

.. code-block:: python

   drv_props = mpg["drv"].value_counts(normalize=True).sort_index()
   drv_props.plot(kind="bar", ax=ax)   # same idea; y-axis now in [0, 1]


4.2.3 Boxplots
~~~~~~~~~~~~~~

Boxplots are ideal when you want to summarize the distribution of a numeric
variable, especially **across groups** defined by a categorical variable.

Single boxplot
^^^^^^^^^^^^^^

R: ``boxplot(mpg$hwy)``

Python/pandas:

.. code-block:: python

   fig, ax = plt.subplots()

   mpg["hwy"].plot(kind="box", vert=True, ax=ax)

   ax.set_ylabel("Miles per gallon (highway)")
   ax.set_title("Highway MPG – overall distribution")

   plt.tight_layout()

Grouped boxplots
^^^^^^^^^^^^^^^^

R syntax: ``boxplot(hwy ~ drv, data = mpg)`` – highway MPG by drivetrain.

In pandas, we group then call ``boxplot``:

.. code-block:: python

   fig, ax = plt.subplots()

   mpg.boxplot(
       column="hwy",
       by="drv",
       ax=ax,
       grid=False,
   )

   ax.set_xlabel("Drivetrain (f = FWD, r = RWD, 4 = 4WD)")
   ax.set_ylabel("Miles per gallon (highway)")
   ax.set_title("MPG (highway) vs drivetrain")
   # pandas adds its own super-title; remove if you like:
   fig.suptitle("")

   plt.tight_layout()

Interpretation reminders:

* The box shows the interquartile range (IQR): 25th to 75th percentile.
* The line inside the box is the median.
* Whiskers extend to typical minimum/maximum values.
* Points beyond the whiskers are potential outliers.

In Chapter 4 of the R notes, there is also emphasis on the formula syntax
``y ~ x``. The conceptual equivalent here is:

* “Take numeric column ``hwy``”
* “Group by drivetrain ``drv``”
* “Draw separate boxplots for each group”


4.2.4 Scatterplots
~~~~~~~~~~~~~~~~~~

Scatterplots show the relationship between **two numeric variables**.

The R chapter uses

.. code-block:: r

   plot(hwy ~ displ, data = mpg)

We can mirror this with pandas:

.. code-block:: python

   fig, ax = plt.subplots()

   ax.scatter(
       mpg["displ"],
       mpg["hwy"],
       s=30,
       color="dodgerblue",
   )

   ax.set_xlabel("Engine displacement (liters)")
   ax.set_ylabel("Miles per gallon (highway)")
   ax.set_title("MPG (highway) vs engine displacement")

   plt.tight_layout()

Typical interpretation for the ``mpg`` data:

* As engine displacement increases, highway MPG tends to decrease.
* The scatterplot shows not only the trend but also variability and
  potential clusters (e.g., different vehicle types).

A tiny bit of code to add a fitted line (optional, for later chapters):

.. code-block:: python

   import numpy as np

   X = mpg["displ"].to_numpy()
   y = mpg["hwy"].to_numpy()

   # simple least-squares line via NumPy polyfit
   m, b = np.polyfit(X, y, deg=1)

   x_grid = np.linspace(X.min(), X.max(), 100)
   y_hat = m * x_grid + b

   fig, ax = plt.subplots()
   ax.scatter(X, y, s=30, color="dodgerblue", alpha=0.7)
   ax.plot(x_grid, y_hat, color="darkorange", linewidth=2)

   ax.set_xlabel("Engine displacement (liters)")
   ax.set_ylabel("Miles per gallon (highway)")
   ax.set_title("MPG (highway) vs engine displacement with fitted line")

   plt.tight_layout()

You do *not* need to understand regression yet; here the line is just a visual
summary of the overall trend. Later chapters will unpack the model behind it.


4.3 What you should take away
-----------------------------

By the end of this chapter (R + Python versions), you should be comfortable with:

* Computing basic **summary statistics** for numeric data:

  - ``mean``, ``median``, ``variance``, ``sd``, ``IQR``, ``min``, ``max``, ``range``.

* Computing **frequency tables** and **proportions** for categorical variables
  using ``value_counts`` (and ``normalize=True`` for proportions).

* Matching each summary to an appropriate **plot type**:

  - histogram for one numeric variable,
  - bar chart for one categorical variable,
  - boxplot for numeric vs categorical,
  - scatterplot for two numeric variables.

* Translating R’s functions and syntax to Python/pandas/NumPy:

  - ``mean`` ↔ ``Series.mean()``,
  - ``sd`` ↔ ``Series.std(ddof=1)``,
  - ``IQR`` ↔ quantiles / ``Series.quantile``,
  - ``table`` ↔ ``value_counts``,
  - ``hist`` / ``barplot`` / ``boxplot`` / ``plot`` ↔ Matplotlib / pandas plotting.

Most importantly:

*You can now look at a variable, decide whether it is numeric or categorical,
and quickly choose a summary and a plot that make sense.*

These skills will be used constantly in later PyStatsV1 chapters—before
we fit any models, we will always:

1. **Summarize the data numerically** (center, spread, and counts), and
2. **Visualize** the data with one or more of the plots from this chapter.