Applied Statistics with Python – Chapter 4
Summarizing data
This chapter parallels the “Summarizing Data” chapter from the R notes. The statistical ideas are the same:
For numeric variables, we summarize the distribution using measures of center and spread.
For categorical variables, we summarize using counts and proportions.
We then use plots to visualize those summaries.
In the R version you see functions like mean(), median(), sd(),
IQR(), hist(), boxplot(), and plot() for scatterplots. Here
we’ll use Python, NumPy, pandas, and Matplotlib to achieve the same goals.
Throughout, imagine we have a DataFrame mpg that mirrors the R
ggplot2::mpg dataset, with columns like:
cty– city miles per gallonhwy– highway miles per gallondrv– drivetrain ("f","r","4")displ– engine displacement in liters
You could obtain this DataFrame in several ways, for example:
import pandas as pd
import seaborn as sns # only needed if you want to load the example
# Option 1: seaborn’s built-in mpg dataset
mpg = sns.load_dataset("mpg")
# Option 2: read from a CSV bundled with your project
# mpg = pd.read_csv("data/mpg.csv")
4.1 Summary statistics
We start with summary statistics: numbers that describe center, spread, and distribution shape for a variable.
4.1.1 Numeric variables: center and spread
In R you saw a table of summaries like:
mean
median
variance
standard deviation
interquartile range (IQR)
minimum, maximum, range
We can compute the same quantities with pandas:
import numpy as np
import pandas as pd
# City miles per gallon
cty = mpg["cty"]
# Center
cty_mean = cty.mean() # average
cty_median = cty.median() # median
# Spread
cty_var = cty.var(ddof=1) # sample variance (n-1)
cty_sd = cty.std(ddof=1) # sample standard deviation (n-1)
cty_iqr = cty.quantile(0.75) - cty.quantile(0.25)
cty_min = cty.min()
cty_max = cty.max()
cty_range = cty_max - cty_min
summary = {
"mean": cty_mean,
"median": cty_median,
"variance": cty_var,
"sd": cty_sd,
"IQR": cty_iqr,
"min": cty_min,
"max": cty_max,
"range": cty_range,
}
summary
A quick shortcut for many of these is describe:
mpg["cty"].describe()
which returns count, mean, standard deviation, quartiles, min, and max.
Conceptual recap
Mean: arithmetic average; sensitive to outliers.
Median: middle value; robust to outliers.
Variance/SD: average squared (or square-rooted) distance from the mean.
IQR: distance between the 25th and 75th percentiles (middle 50% of data).
Min/Max/Range: show the extremes of the distribution.
Python vs R differences:
R’s
var()andsd()usen-1by default (unbiased estimators).pandas uses
ddof=1forDataFrame.varandDataFrame.stdby default.NumPy’s
np.varandnp.stddefault toddof=0(divide byn). Useddof=1to match the R textbook.
4.1.2 Categorical variables: counts and proportions
For categorical variables, we care about how often each level appears.
In R, you saw table(mpg$drv) and relative frequencies with
table(mpg$drv) / nrow(mpg).
In pandas:
drv_counts = mpg["drv"].value_counts()
drv_props = mpg["drv"].value_counts(normalize=True)
drv_counts
drv_props
This gives frequency and proportion for each drivetrain category.
Key ideas:
value_counts()is the pandas analogue oftable()in R.normalize=Trueturns counts into proportions.These summaries are the numerical counterpart of a bar chart.
4.2 Plotting
Numeric tables are useful, but most of the time we learn more from good visualization.
We’ll mirror the same four plot types as the R chapter:
Histograms
Bar charts
Boxplots
Scatterplots
We will use Matplotlib and pandas plotting helpers. These examples assume:
import matplotlib.pyplot as plt
4.2.1 Histograms
When you have one numeric variable, a histogram is the workhorse plot.
In R: hist(mpg$cty) and a more polished version with axis labels,
title, breaks, colors.
In Python/pandas:
fig, ax = plt.subplots()
mpg["cty"].hist(
bins=12, # similar idea to breaks =
color="dodgerblue",
edgecolor="darkorange",
ax=ax,
)
ax.set_xlabel("Miles per gallon (city)")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of MPG (city)")
plt.tight_layout()
Notes:
binsis analogous to R’sbreaksargument.Always label axes and add a clear title.
histgives the familiar histogram shape: bars whose area corresponds to counts (or densities).
4.2.2 Bar charts
Bar charts summarize categorical variables (or numeric variables with a small number of distinct values).
R example: barplot(table(mpg$drv)).
Python:
drv_counts = mpg["drv"].value_counts().sort_index()
fig, ax = plt.subplots()
drv_counts.plot(
kind="bar",
color="dodgerblue",
edgecolor="darkorange",
ax=ax,
)
ax.set_xlabel("Drivetrain (f = FWD, r = RWD, 4 = 4WD)")
ax.set_ylabel("Frequency")
ax.set_title("Drivetrains")
plt.tight_layout()
If you want proportions instead of counts, apply value_counts(normalize=True):
drv_props = mpg["drv"].value_counts(normalize=True).sort_index()
drv_props.plot(kind="bar", ax=ax) # same idea; y-axis now in [0, 1]
4.2.3 Boxplots
Boxplots are ideal when you want to summarize the distribution of a numeric variable, especially across groups defined by a categorical variable.
Single boxplot
R: boxplot(mpg$hwy)
Python/pandas:
fig, ax = plt.subplots()
mpg["hwy"].plot(kind="box", vert=True, ax=ax)
ax.set_ylabel("Miles per gallon (highway)")
ax.set_title("Highway MPG – overall distribution")
plt.tight_layout()
Grouped boxplots
R syntax: boxplot(hwy ~ drv, data = mpg) – highway MPG by drivetrain.
In pandas, we group then call boxplot:
fig, ax = plt.subplots()
mpg.boxplot(
column="hwy",
by="drv",
ax=ax,
grid=False,
)
ax.set_xlabel("Drivetrain (f = FWD, r = RWD, 4 = 4WD)")
ax.set_ylabel("Miles per gallon (highway)")
ax.set_title("MPG (highway) vs drivetrain")
# pandas adds its own super-title; remove if you like:
fig.suptitle("")
plt.tight_layout()
Interpretation reminders:
The box shows the interquartile range (IQR): 25th to 75th percentile.
The line inside the box is the median.
Whiskers extend to typical minimum/maximum values.
Points beyond the whiskers are potential outliers.
In Chapter 4 of the R notes, there is also emphasis on the formula syntax
y ~ x. The conceptual equivalent here is:
“Take numeric column
hwy”“Group by drivetrain
drv”“Draw separate boxplots for each group”
4.2.4 Scatterplots
Scatterplots show the relationship between two numeric variables.
The R chapter uses
plot(hwy ~ displ, data = mpg)
We can mirror this with pandas:
fig, ax = plt.subplots()
ax.scatter(
mpg["displ"],
mpg["hwy"],
s=30,
color="dodgerblue",
)
ax.set_xlabel("Engine displacement (liters)")
ax.set_ylabel("Miles per gallon (highway)")
ax.set_title("MPG (highway) vs engine displacement")
plt.tight_layout()
Typical interpretation for the mpg data:
As engine displacement increases, highway MPG tends to decrease.
The scatterplot shows not only the trend but also variability and potential clusters (e.g., different vehicle types).
A tiny bit of code to add a fitted line (optional, for later chapters):
import numpy as np
X = mpg["displ"].to_numpy()
y = mpg["hwy"].to_numpy()
# simple least-squares line via NumPy polyfit
m, b = np.polyfit(X, y, deg=1)
x_grid = np.linspace(X.min(), X.max(), 100)
y_hat = m * x_grid + b
fig, ax = plt.subplots()
ax.scatter(X, y, s=30, color="dodgerblue", alpha=0.7)
ax.plot(x_grid, y_hat, color="darkorange", linewidth=2)
ax.set_xlabel("Engine displacement (liters)")
ax.set_ylabel("Miles per gallon (highway)")
ax.set_title("MPG (highway) vs engine displacement with fitted line")
plt.tight_layout()
You do not need to understand regression yet; here the line is just a visual summary of the overall trend. Later chapters will unpack the model behind it.
4.3 What you should take away
By the end of this chapter (R + Python versions), you should be comfortable with:
Computing basic summary statistics for numeric data:
mean,median,variance,sd,IQR,min,max,range.
Computing frequency tables and proportions for categorical variables using
value_counts(andnormalize=Truefor proportions).Matching each summary to an appropriate plot type:
histogram for one numeric variable,
bar chart for one categorical variable,
boxplot for numeric vs categorical,
scatterplot for two numeric variables.
Translating R’s functions and syntax to Python/pandas/NumPy:
mean↔Series.mean(),sd↔Series.std(ddof=1),IQR↔ quantiles /Series.quantile,table↔value_counts,hist/barplot/boxplot/plot↔ Matplotlib / pandas plotting.
Most importantly:
You can now look at a variable, decide whether it is numeric or categorical, and quickly choose a summary and a plot that make sense.
These skills will be used constantly in later PyStatsV1 chapters—before we fit any models, we will always:
Summarize the data numerically (center, spread, and counts), and
Visualize the data with one or more of the plots from this chapter.