Applied Statistics with Python – Chapter 3

Data and Programming (Python-first view)

This chapter is the Python companion to the “Data and Programming” chapter from the R notes. The statistical ideas are the same:

You need a small set of data types (numbers, text, booleans).
You store them in data structures (vectors/arrays, matrices, tables).
You use programming tools (control flow + functions) to glue analyses together.

The R book uses R’s vocabulary (vectors, matrices, lists, data frames). Here we’ll use the Python stack that maps to the same concepts:

Core Python (built-in types and control flow)
NumPy (for arrays, vectorization, and linear algebra)
pandas (for tabular data like R data frames / tibbles)

The goal is not to turn you into a software engineer. The goal is:

Think “what is the data?” and “what operation am I doing?” and then choose the Python object that matches that mental model.

3.1 Data Types

R has numeric, integer, complex, logical, character. Python has very similar building blocks:

int – integers: 1, 42, -3
float – real numbers (double precision): 1.0, 3.14, -0.001
complex – complex numbers: 4+2j
bool – logical values: True or False
str – text: "a", "Statistics", "1 plus 2"

A few quick parallels:

R’s TRUE / FALSE ↔ Python’s True / False
R’s NA ↔ Python’s None (missing in general) or numpy.nan (missing numeric)
R’s automatic coercion (e.g., mixing numbers and strings in a vector) ↔ in Python, lists can hold mixed types, but numerical containers like NumPy arrays and pandas columns are usually homogeneous.

3.2 Data Structures: R vs Python mental map

R distinguishes between “homogeneous” (everything the same type) and “heterogeneous” (mixed types). Same idea in Python, just with different names.

Dimension	Homogeneous (R)	Homogeneous (Python)
1D	vector	NumPy `ndarray` (1D), pandas `Series`
2D	matrix	NumPy 2D `ndarray`, pandas `DataFrame`
3D+	array	higher-dim NumPy `ndarray`

Dimension	Heterogeneous (R)	Heterogeneous (Python)
1D	list	Python `list`, `dict`, `dataclass`
2D	data frame	pandas `DataFrame`

We’ll mostly use:

Python lists for small, generic sequences.
NumPy arrays when we mean “numeric vector/matrix.”
pandas DataFrames when we mean “rectangular data with named columns.”

3.2.1 One-dimensional containers: lists, ranges, and NumPy arrays

Python list: flexible sequence

This is the closest analogue to an R “generic” vector (but can hold mixed types):

x = [1, 3, 5, 7, 8, 9]
x[0]      # 1 (0-based indexing in Python)
x[2]      # 5
x[-1]     # 9 (last element)

Remember: Python indexes from 0, not 1. That’s one of the biggest mental differences from R.

Creating sequences

R uses c(), : and seq(). Python equivalents:

# Explicit list
x = [1, 3, 5, 7, 8, 9]

# A sequence of integers (like 1:100 in R)
y = list(range(1, 101))  # 1, 2, ..., 100

# A sequence with a step (like seq(1.5, 4.2, by = 0.1))
import numpy as np

seq = np.arange(1.5, 4.3, 0.1)  # up to (but not including) 4.3

Repetition

R has rep(). In Python:

["A"] * 10             # ['A', 'A', ..., 'A']
x * 3                  # repeats the list x three times

# with NumPy for numeric work:
x_arr = np.array(x)
rep_arr = np.tile(x_arr, 3)  # repeat the vector x three times

Vector length

R: length(x)

Python:

len(x)         # length of a list
len(x_arr)     # length of a NumPy array

3.2.1.1 Subsetting and slicing

R uses x[1], x[1:3], negative indices to drop elements, and logical vectors. Python has similar ideas but with different syntax.

Indexing by position

x = [1, 3, 5, 7, 8, 9]

x[0]       # 1  (first element)
x[2]       # 5  (third element)
x[1:4]     # [3, 5, 7]  (slice: start inclusive, stop exclusive)
x[:3]      # [1, 3, 5]
x[3:]      # [7, 8, 9]
x[-1]      # 9 (last)
x[-2:]     # [8, 9] (last two)

NumPy arrays support exactly the same slice notation:

x_arr = np.array(x)
x_arr[0]      # 1
x_arr[1:4]    # array([3, 5, 7])

Boolean indexing (logical subsetting)

This is where NumPy and pandas line up very nicely with R.

R:

x[x > 3]
x[x != 3]

NumPy:

mask = x_arr > 3          # array([False, False, True, True, True, True])
x_arr[mask]               # array([5, 7, 8, 9])

x_arr[x_arr != 3]         # array([1, 5, 7, 8, 9])

3.2.2 Vectorization in Python

The R chapter emphasises that R is “vectorized”: operations apply to whole vectors at once. Same idea in the scientific Python stack:

Pure Python lists: arithmetic is not vectorized.
NumPy arrays and pandas objects: arithmetic is vectorized.

Compare:

x_list = [1, 2, 3, 4, 5]

# NOT vectorized – this concatenates lists
x_list + [1]           # [1, 2, 3, 4, 5, 1]

# Vectorized: use NumPy arrays
x = np.array([1, 2, 3, 4, 5])

x + 1                  # array([2, 3, 4, 5, 6])
2 * x                  # array([ 2,  4,  6,  8, 10])
2 ** x                 # powers, elementwise
np.sqrt(x)
np.log(x)

Same mental model as in R:

“If I apply a numeric function to a whole vector, I get a vector back.”

Length recycling vs broadcasting

In R, x + y can silently recycle the shorter vector and even warn if lengths don’t match nicely.

In NumPy:

Shapes must be compatible for broadcasting.
Shape mismatch gives an error instead of a warning (which is usually safer).

Example:

x = np.array([1, 3, 5, 7, 8, 9])
y = np.arange(1, 61)

x + y      # works: NumPy broadcasts x along y’s length (6 divides 60)

# If shapes truly don't match, you'll get a ValueError instead of a “silent” recycle.

3.2.3 Logical operators

R operators: <, >, <=, >=, ==, !=, !, &, |.

Python has very similar operators:

x = np.array([1, 3, 5, 7, 8, 9])

x > 3        # array([False, False,  True,  True,  True,  True])
x < 3        # array([ True, False, False, False, False, False])
x == 3       # array([False,  True, False, False, False, False])
x != 3       # array([ True, False,  True,  True,  True,  True])

A few important notes:

For NumPy arrays, use & and | for elementwise AND/OR, with parentheses:

(x > 3) & (x < 8)    # both conditions
(x == 3) | (x == 9)  # either condition

For pure Python booleans (not arrays), use and / or:
```
(3 < 4) and (42 > 13)
```

Counting and coercion

R shows that logical values act like 0/1 in numeric calculations (sum(x > 3)). Same in Python/NumPy:

mask = x > 3
mask           # array([False, False,  True,  True,  True,  True])

mask.sum()     # 4 (True acts like 1, False like 0)
np.sum(mask)   # also 4

mask.astype(int)   # array([0, 0, 1, 1, 1, 1])

3.2.4 Matrices and linear algebra (NumPy)

R uses matrix(), %*%, t(), solve(), diag() and friends. In Python, these live in NumPy:

Creating matrices

x = np.arange(1, 10)          # 1..9
X = x.reshape(3, 3, order="F")  # like R’s column-major matrix()
X

# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])

Y = x.reshape(3, 3, order="C")  # row-wise (byrow = TRUE in R)
Y

Z = np.zeros((2, 4))           # 2x4 matrix of zeros

Subsetting

X[0, 1]     # element in first row, second column  (4)
X[0, :]     # first row
X[:, 1]     # second column
X[1, [0, 2]]  # row 2, columns 1 and 3

Matrix operations

Elementwise operations:

X + Y
X - Y
X * Y    # elementwise product
X / Y    # elementwise division

Matrix multiplication and linear algebra:

# matrix multiplication (like R's %*%)
X @ Y
np.matmul(X, Y)

# transpose
X_T = X.T

# identity and diagonal matrices
np.eye(3)          # 3x3 identity
np.diag([1, 2, 3]) # diagonal with 1,2,3 on the diagonal

# inverse (if invertible)
Z = np.array([[9, 2, -3],
              [2, 4, -2],
              [-3, -2, 16]])

Z_inv = np.linalg.inv(Z)

Z_inv @ Z
# approximately the identity matrix

Floating point equality

R uses all.equal to compare floating-point matrices. NumPy equivalent:

np.allclose(Z_inv @ Z, np.eye(3))   # True
(Z_inv @ Z == np.eye(3)).all()      # often False due to tiny round-off

Dot product and outer product

R uses a_vec %*% b_vec and a_vec %o% b_vec; also crossprod.

Python:

a_vec = np.array([1, 2, 3])
b_vec = np.array([2, 2, 2])

# Inner (dot) product
a_vec @ b_vec           # 12
np.dot(a_vec, b_vec)    # 12

# Outer product
np.outer(a_vec, b_vec)

# “crossprod(X, Y)” (X^T Y) in NumPy:
C_mat = np.array([[1, 2, 3],
                  [4, 5, 6]])
D_mat = np.array([[2, 2, 2],
                  [2, 2, 2]])

C_mat.T @ D_mat    # like crossprod(C_mat, D_mat)
np.allclose(C_mat.T @ D_mat, C_mat.T.dot(D_mat))

3.2.5 Heterogeneous containers: lists and dicts

The R chapter introduces lists as “one-dimensional containers that can hold anything”: vectors, matrices, functions, etc.

In Python we have:

list – ordered sequence (can be mixed types)
dict – mapping from names to values (key–value store)

An R list like:

ex_list = list(
  a = c(1, 2, 3, 4),
  b = TRUE,
  c = "Hello!",
  d = function(arg = 42) { print("Hello World!") },
  e = diag(5)
)

could be represented roughly as:

def say_hello(arg=42):
    print("Hello World!")

ex_dict = {
    "a": np.array([1, 2, 3, 4]),
    "b": True,
    "c": "Hello!",
    "d": say_hello,
    "e": np.diag(np.arange(1, 6))
}

Accessing elements:

ex_dict["e"]      # matrix
ex_dict["a"]      # array
ex_dict["d"](arg=1)

3.2.6 Tabular data: pandas DataFrames

R’s data frame / tibble ↔ Python’s pandas DataFrame.

Minimal example:

import pandas as pd

example_data = pd.DataFrame({
    "x": [1, 3, 5, 7, 9, 1, 3, 5, 7, 9],
    "y": ["Hello"] * 9 + ["Goodbye"],
    "z": [True, False] * 5
})

example_data
example_data.head()      # first rows
example_data.info()      # structure, types
example_data.shape       # (n_rows, n_cols)
example_data.columns     # column names

Reading from CSV (similar to read_csv in R):

cars = pd.read_csv("data/example-data.csv")

# glimpse the data
cars.head(10)
cars.info()

Subsetting rows and columns

Like R:

# single column as a Series
example_data["x"]

# multiple columns as a DataFrame
example_data[["x", "y"]]

# Boolean filter: “fuel efficient cars”
mask = example_data["x"] > 5
example_data[mask]

# Equivalent to subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year")):
mpg = cars  # imagine we loaded the mpg data
mpg[mpg["hwy"] > 35][["manufacturer", "model", "year"]]

You can also use query for more R-like syntax:

mpg.query("hwy > 35")[["manufacturer", "model", "year"]]

3.3 Programming Basics in Python

Now we connect data structures with basic programming tools: control flow and functions.

3.3.1 Control flow

If / elif / else

R:

if (x > y) {
  # ...
} else {
  # ...
}

Python:

x = 1
y = 3

if x > y:
    z = x * y
    print("x is larger than y")
else:
    z = x + 5 * y
    print("x is less than or equal to y")

There is also a short expression form (similar spirit to ifelse for scalars):

result = 1 if 4 > 3 else 0     # 1

Vectorized “if” with NumPy/pandas

R’s ifelse(condition, value_if_true, value_if_false) is used for vectors.

In Python we use np.where or pandas methods:

fib = np.array([1, 1, 2, 3, 5, 8, 13, 21])
np.where(fib > 6, "Foo", "Bar")
# array(['Bar', 'Bar', 'Bar', 'Bar', 'Bar', 'Foo', 'Foo', 'Foo'], dtype='<U3')

For pandas Series:

mpg["label"] = np.where(mpg["hwy"] > 35, "Efficient", "Regular")

For loops vs vectorization

The R chapter shows that explicit loops are often replaced by vectorized code.

Same in Python:

# Loop version
x = [11, 12, 13, 14, 15]
for i in range(len(x)):
    x[i] = x[i] * 2

# Vectorized version with NumPy
x_arr = np.array([11, 12, 13, 14, 15])
x_arr = x_arr * 2

3.3.2 Defining functions

Basic structure

R:

standardize = function(x) {
  (x - mean(x)) / sd(x)
}

Python:

import numpy as np

def standardize(x: np.ndarray) -> np.ndarray:
    """
    Standardize a numeric vector/array:
    subtract the mean and divide by the sample standard deviation.
    """
    m = x.mean()
    s = x.std(ddof=1)   # ddof=1 for sample SD (like R's sd)
    return (x - m) / s

Test it:

sample = np.random.normal(loc=2, scale=5, size=10)
z = standardize(sample)

z.mean()      # close to 0
z.std(ddof=1) # close to 1

Default arguments

R:

power_of_num = function(num, power = 2) {
  num ^ power
}

Python:

def power_of_num(num, power=2):
    return num ** power

power_of_num(10)                # 100
power_of_num(num=10, power=2)   # 100
power_of_num(power=3, num=2)    # 8

Variance example (biased vs unbiased)

The R notes define two forms of variance: unbiased (divide by n−1) and biased (divide by n).

We can mirror this:

def sample_variance(x: np.ndarray, biased: bool = False) -> float:
    """
    Compute the variance of x.

    biased = False  -> divide by (n-1)  (unbiased, like R's var)
    biased = True   -> divide by n      (ML / population variance)
    """
    x = np.asarray(x)
    n = x.size
    ddof = 0 if biased else 1
    return x.var(ddof=ddof)

sample = np.random.normal(size=10)
sample_variance(sample)            # unbiased (n-1)
sample_variance(sample, True)      # biased (n)

3.4 What you should take away

By the end of this chapter (R + Python versions), you should be comfortable with:

Distinguishing data types (int, float, bool, str, complex).
Choosing an appropriate data structure:
- list vs NumPy array vs pandas DataFrame
- when you want homogeneity (numeric computation) vs heterogeneity.
Using vectorized operations instead of unnecessary loops:
- arithmetic on whole arrays
- logical masks and boolean indexing
- basic linear algebra with NumPy.
Writing small helper functions with clear arguments and defaults to standardize repeated analysis steps.

In later PyStatsV1 chapters, you’ll see these ideas used to:

build reusable simulation functions,
manipulate data for case studies,
and express models in a compact, vectorized way.

If any of the Python code in this chapter feels new, it’s worth experimenting interactively in a notebook or Python shell:

create a small vector or DataFrame,
try out indexing and filtering,
write a tiny function and call it on real data.

That practice will pay off quickly in the applied chapters.