Applied Statistics with Python – Chapter 3

Data and Programming (Python-first view)

This chapter is the Python companion to the “Data and Programming” chapter from the R notes. The statistical ideas are the same:

  • You need a small set of data types (numbers, text, booleans).

  • You store them in data structures (vectors/arrays, matrices, tables).

  • You use programming tools (control flow + functions) to glue analyses together.

The R book uses R’s vocabulary (vectors, matrices, lists, data frames). Here we’ll use the Python stack that maps to the same concepts:

  • Core Python (built-in types and control flow)

  • NumPy (for arrays, vectorization, and linear algebra)

  • pandas (for tabular data like R data frames / tibbles)

The goal is not to turn you into a software engineer. The goal is:

Think “what is the data?” and “what operation am I doing?” and then choose the Python object that matches that mental model.

3.1 Data Types

R has numeric, integer, complex, logical, character. Python has very similar building blocks:

  • int – integers: 1, 42, -3

  • float – real numbers (double precision): 1.0, 3.14, -0.001

  • complex – complex numbers: 4+2j

  • bool – logical values: True or False

  • str – text: "a", "Statistics", "1 plus 2"

A few quick parallels:

  • R’s TRUE / FALSE ↔ Python’s True / False

  • R’s NA ↔ Python’s None (missing in general) or numpy.nan (missing numeric)

  • R’s automatic coercion (e.g., mixing numbers and strings in a vector) ↔ in Python, lists can hold mixed types, but numerical containers like NumPy arrays and pandas columns are usually homogeneous.

3.2 Data Structures: R vs Python mental map

R distinguishes between “homogeneous” (everything the same type) and “heterogeneous” (mixed types). Same idea in Python, just with different names.

Dimension

Homogeneous (R)

Homogeneous (Python)

1D

vector

NumPy ndarray (1D), pandas Series

2D

matrix

NumPy 2D ndarray, pandas DataFrame

3D+

array

higher-dim NumPy ndarray

Dimension

Heterogeneous (R)

Heterogeneous (Python)

1D

list

Python list, dict, dataclass

2D

data frame

pandas DataFrame

We’ll mostly use:

  • Python lists for small, generic sequences.

  • NumPy arrays when we mean “numeric vector/matrix.”

  • pandas DataFrames when we mean “rectangular data with named columns.”

3.2.1 One-dimensional containers: lists, ranges, and NumPy arrays

Python list: flexible sequence

This is the closest analogue to an R “generic” vector (but can hold mixed types):

x = [1, 3, 5, 7, 8, 9]
x[0]      # 1 (0-based indexing in Python)
x[2]      # 5
x[-1]     # 9 (last element)

Remember: Python indexes from 0, not 1. That’s one of the biggest mental differences from R.

Creating sequences

R uses c(), : and seq(). Python equivalents:

# Explicit list
x = [1, 3, 5, 7, 8, 9]

# A sequence of integers (like 1:100 in R)
y = list(range(1, 101))  # 1, 2, ..., 100

# A sequence with a step (like seq(1.5, 4.2, by = 0.1))
import numpy as np

seq = np.arange(1.5, 4.3, 0.1)  # up to (but not including) 4.3

Repetition

R has rep(). In Python:

["A"] * 10             # ['A', 'A', ..., 'A']
x * 3                  # repeats the list x three times

# with NumPy for numeric work:
x_arr = np.array(x)
rep_arr = np.tile(x_arr, 3)  # repeat the vector x three times

Vector length

R: length(x)

Python:

len(x)         # length of a list
len(x_arr)     # length of a NumPy array

3.2.1.1 Subsetting and slicing

R uses x[1], x[1:3], negative indices to drop elements, and logical vectors. Python has similar ideas but with different syntax.

Indexing by position

x = [1, 3, 5, 7, 8, 9]

x[0]       # 1  (first element)
x[2]       # 5  (third element)
x[1:4]     # [3, 5, 7]  (slice: start inclusive, stop exclusive)
x[:3]      # [1, 3, 5]
x[3:]      # [7, 8, 9]
x[-1]      # 9 (last)
x[-2:]     # [8, 9] (last two)

NumPy arrays support exactly the same slice notation:

x_arr = np.array(x)
x_arr[0]      # 1
x_arr[1:4]    # array([3, 5, 7])

Boolean indexing (logical subsetting)

This is where NumPy and pandas line up very nicely with R.

R:

x[x > 3]
x[x != 3]

NumPy:

mask = x_arr > 3          # array([False, False, True, True, True, True])
x_arr[mask]               # array([5, 7, 8, 9])

x_arr[x_arr != 3]         # array([1, 5, 7, 8, 9])

3.2.2 Vectorization in Python

The R chapter emphasises that R is “vectorized”: operations apply to whole vectors at once. Same idea in the scientific Python stack:

  • Pure Python lists: arithmetic is not vectorized.

  • NumPy arrays and pandas objects: arithmetic is vectorized.

Compare:

x_list = [1, 2, 3, 4, 5]

# NOT vectorized – this concatenates lists
x_list + [1]           # [1, 2, 3, 4, 5, 1]

# Vectorized: use NumPy arrays
x = np.array([1, 2, 3, 4, 5])

x + 1                  # array([2, 3, 4, 5, 6])
2 * x                  # array([ 2,  4,  6,  8, 10])
2 ** x                 # powers, elementwise
np.sqrt(x)
np.log(x)

Same mental model as in R:

“If I apply a numeric function to a whole vector, I get a vector back.”

Length recycling vs broadcasting

In R, x + y can silently recycle the shorter vector and even warn if lengths don’t match nicely.

In NumPy:

  • Shapes must be compatible for broadcasting.

  • Shape mismatch gives an error instead of a warning (which is usually safer).

Example:

x = np.array([1, 3, 5, 7, 8, 9])
y = np.arange(1, 61)

x + y      # works: NumPy broadcasts x along y’s length (6 divides 60)

# If shapes truly don't match, you'll get a ValueError instead of a “silent” recycle.

3.2.3 Logical operators

R operators: <, >, <=, >=, ==, !=, !, &, |.

Python has very similar operators:

x = np.array([1, 3, 5, 7, 8, 9])

x > 3        # array([False, False,  True,  True,  True,  True])
x < 3        # array([ True, False, False, False, False, False])
x == 3       # array([False,  True, False, False, False, False])
x != 3       # array([ True, False,  True,  True,  True,  True])

A few important notes:

  • For NumPy arrays, use & and | for elementwise AND/OR, with parentheses:

    (x > 3) & (x < 8)    # both conditions
    (x == 3) | (x == 9)  # either condition
    
  • For pure Python booleans (not arrays), use and / or:

    (3 < 4) and (42 > 13)
    

Counting and coercion

R shows that logical values act like 0/1 in numeric calculations (sum(x > 3)). Same in Python/NumPy:

mask = x > 3
mask           # array([False, False,  True,  True,  True,  True])

mask.sum()     # 4 (True acts like 1, False like 0)
np.sum(mask)   # also 4

mask.astype(int)   # array([0, 0, 1, 1, 1, 1])

3.2.4 Matrices and linear algebra (NumPy)

R uses matrix(), %*%, t(), solve(), diag() and friends. In Python, these live in NumPy:

Creating matrices

x = np.arange(1, 10)          # 1..9
X = x.reshape(3, 3, order="F")  # like R’s column-major matrix()
X

# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])

Y = x.reshape(3, 3, order="C")  # row-wise (byrow = TRUE in R)
Y

Z = np.zeros((2, 4))           # 2x4 matrix of zeros

Subsetting

X[0, 1]     # element in first row, second column  (4)
X[0, :]     # first row
X[:, 1]     # second column
X[1, [0, 2]]  # row 2, columns 1 and 3

Matrix operations

Elementwise operations:

X + Y
X - Y
X * Y    # elementwise product
X / Y    # elementwise division

Matrix multiplication and linear algebra:

# matrix multiplication (like R's %*%)
X @ Y
np.matmul(X, Y)

# transpose
X_T = X.T

# identity and diagonal matrices
np.eye(3)          # 3x3 identity
np.diag([1, 2, 3]) # diagonal with 1,2,3 on the diagonal

# inverse (if invertible)
Z = np.array([[9, 2, -3],
              [2, 4, -2],
              [-3, -2, 16]])

Z_inv = np.linalg.inv(Z)

Z_inv @ Z
# approximately the identity matrix

Floating point equality

R uses all.equal to compare floating-point matrices. NumPy equivalent:

np.allclose(Z_inv @ Z, np.eye(3))   # True
(Z_inv @ Z == np.eye(3)).all()      # often False due to tiny round-off

Dot product and outer product

R uses a_vec %*% b_vec and a_vec %o% b_vec; also crossprod.

Python:

a_vec = np.array([1, 2, 3])
b_vec = np.array([2, 2, 2])

# Inner (dot) product
a_vec @ b_vec           # 12
np.dot(a_vec, b_vec)    # 12

# Outer product
np.outer(a_vec, b_vec)

# “crossprod(X, Y)” (X^T Y) in NumPy:
C_mat = np.array([[1, 2, 3],
                  [4, 5, 6]])
D_mat = np.array([[2, 2, 2],
                  [2, 2, 2]])

C_mat.T @ D_mat    # like crossprod(C_mat, D_mat)
np.allclose(C_mat.T @ D_mat, C_mat.T.dot(D_mat))

3.2.5 Heterogeneous containers: lists and dicts

The R chapter introduces lists as “one-dimensional containers that can hold anything”: vectors, matrices, functions, etc.

In Python we have:

  • list – ordered sequence (can be mixed types)

  • dict – mapping from names to values (key–value store)

An R list like:

ex_list = list(
  a = c(1, 2, 3, 4),
  b = TRUE,
  c = "Hello!",
  d = function(arg = 42) { print("Hello World!") },
  e = diag(5)
)

could be represented roughly as:

def say_hello(arg=42):
    print("Hello World!")

ex_dict = {
    "a": np.array([1, 2, 3, 4]),
    "b": True,
    "c": "Hello!",
    "d": say_hello,
    "e": np.diag(np.arange(1, 6))
}

Accessing elements:

ex_dict["e"]      # matrix
ex_dict["a"]      # array
ex_dict["d"](arg=1)

3.2.6 Tabular data: pandas DataFrames

R’s data frame / tibble ↔ Python’s pandas DataFrame.

Minimal example:

import pandas as pd

example_data = pd.DataFrame({
    "x": [1, 3, 5, 7, 9, 1, 3, 5, 7, 9],
    "y": ["Hello"] * 9 + ["Goodbye"],
    "z": [True, False] * 5
})

example_data
example_data.head()      # first rows
example_data.info()      # structure, types
example_data.shape       # (n_rows, n_cols)
example_data.columns     # column names

Reading from CSV (similar to read_csv in R):

cars = pd.read_csv("data/example-data.csv")

# glimpse the data
cars.head(10)
cars.info()

Subsetting rows and columns

Like R:

# single column as a Series
example_data["x"]

# multiple columns as a DataFrame
example_data[["x", "y"]]

# Boolean filter: “fuel efficient cars”
mask = example_data["x"] > 5
example_data[mask]

# Equivalent to subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year")):
mpg = cars  # imagine we loaded the mpg data
mpg[mpg["hwy"] > 35][["manufacturer", "model", "year"]]

You can also use query for more R-like syntax:

mpg.query("hwy > 35")[["manufacturer", "model", "year"]]

3.3 Programming Basics in Python

Now we connect data structures with basic programming tools: control flow and functions.

3.3.1 Control flow

If / elif / else

R:

if (x > y) {
  # ...
} else {
  # ...
}

Python:

x = 1
y = 3

if x > y:
    z = x * y
    print("x is larger than y")
else:
    z = x + 5 * y
    print("x is less than or equal to y")

There is also a short expression form (similar spirit to ifelse for scalars):

result = 1 if 4 > 3 else 0     # 1

Vectorized “if” with NumPy/pandas

R’s ifelse(condition, value_if_true, value_if_false) is used for vectors.

In Python we use np.where or pandas methods:

fib = np.array([1, 1, 2, 3, 5, 8, 13, 21])
np.where(fib > 6, "Foo", "Bar")
# array(['Bar', 'Bar', 'Bar', 'Bar', 'Bar', 'Foo', 'Foo', 'Foo'], dtype='<U3')

For pandas Series:

mpg["label"] = np.where(mpg["hwy"] > 35, "Efficient", "Regular")

For loops vs vectorization

The R chapter shows that explicit loops are often replaced by vectorized code.

Same in Python:

# Loop version
x = [11, 12, 13, 14, 15]
for i in range(len(x)):
    x[i] = x[i] * 2

# Vectorized version with NumPy
x_arr = np.array([11, 12, 13, 14, 15])
x_arr = x_arr * 2

3.3.2 Defining functions

Basic structure

R:

standardize = function(x) {
  (x - mean(x)) / sd(x)
}

Python:

import numpy as np

def standardize(x: np.ndarray) -> np.ndarray:
    """
    Standardize a numeric vector/array:
    subtract the mean and divide by the sample standard deviation.
    """
    m = x.mean()
    s = x.std(ddof=1)   # ddof=1 for sample SD (like R's sd)
    return (x - m) / s

Test it:

sample = np.random.normal(loc=2, scale=5, size=10)
z = standardize(sample)

z.mean()      # close to 0
z.std(ddof=1) # close to 1

Default arguments

R:

power_of_num = function(num, power = 2) {
  num ^ power
}

Python:

def power_of_num(num, power=2):
    return num ** power

power_of_num(10)                # 100
power_of_num(num=10, power=2)   # 100
power_of_num(power=3, num=2)    # 8

Variance example (biased vs unbiased)

The R notes define two forms of variance: unbiased (divide by n−1) and biased (divide by n).

We can mirror this:

def sample_variance(x: np.ndarray, biased: bool = False) -> float:
    """
    Compute the variance of x.

    biased = False  -> divide by (n-1)  (unbiased, like R's var)
    biased = True   -> divide by n      (ML / population variance)
    """
    x = np.asarray(x)
    n = x.size
    ddof = 0 if biased else 1
    return x.var(ddof=ddof)

sample = np.random.normal(size=10)
sample_variance(sample)            # unbiased (n-1)
sample_variance(sample, True)      # biased (n)

3.4 What you should take away

By the end of this chapter (R + Python versions), you should be comfortable with:

  • Distinguishing data types (int, float, bool, str, complex).

  • Choosing an appropriate data structure:

    • list vs NumPy array vs pandas DataFrame

    • when you want homogeneity (numeric computation) vs heterogeneity.

  • Using vectorized operations instead of unnecessary loops:

    • arithmetic on whole arrays

    • logical masks and boolean indexing

    • basic linear algebra with NumPy.

  • Writing small helper functions with clear arguments and defaults to standardize repeated analysis steps.

In later PyStatsV1 chapters, you’ll see these ideas used to:

  • build reusable simulation functions,

  • manipulate data for case studies,

  • and express models in a compact, vectorized way.

If any of the Python code in this chapter feels new, it’s worth experimenting interactively in a notebook or Python shell:

  • create a small vector or DataFrame,

  • try out indexing and filtering,

  • write a tiny function and call it on real data.

That practice will pay off quickly in the applied chapters.