The math you actually need

Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.

How deep to go. Aim for fluency, not proofs. If you can re-derive backpropagation on a 2-layer MLP on paper and explain why PCA picks the top eigenvectors, you're past the bar.

Linear algebra

Key identity for ML: for matrix X with rows = data points, the covariance matrix is C = (1/n) Xᵀ X (after centering). Its top eigenvectors are the principal components.

Probability & statistics

Multivariable calculus

Gradient descent: x ← x − η · ∇f(x). Reduce η if the loss oscillates; increase η if it crawls.

Convex optimization

  1. Implement matrix multiplication, transpose, and dot product from scratch in NumPy without built-ins. Verify against np.matmul.
  2. Compute PCA by hand on a 5-point 2-D dataset: center, covariance, eigendecomposition, project.
  3. Derive the gradient of MSE loss for linear regression. Verify with PyTorch autograd.
  4. For a 2-layer MLP with one hidden ReLU and a sigmoid output, write out the four partial derivatives needed for backprop on a single sample.
  5. Run gradient descent on a 1-D convex function (e.g. f(x) = (x − 3)²) and plot the trajectory for three learning rates.

Exercises with full solutions

Try each before opening the solution. The point is to derive, not just to know the answer.

E-1 · Determinant from eigenvalues

A 3×3 matrix has eigenvalues 1, 2, 4. What are det(A) and trace(A)?

Solution. Determinant = product of eigenvalues = 1 · 2 · 4 = 8. Trace = sum of eigenvalues = 1 + 2 + 4 = 7.

E-2 · Bayes' rule

Disease prevalence is 1%. A test has 95% sensitivity and 90% specificity. Given a positive test, find P(disease | +).

Solution. P(+ | D) P(D) = 0.95 · 0.01 = 0.0095. P(+) = P(+ | D)P(D) + P(+ | ¬D)P(¬D) = 0.0095 + 0.10 · 0.99 = 0.1085. P(D | +) = 0.0095 / 0.1085 ≈ 0.0876. About 8.8% — counter-intuitively low because the base rate is low.

E-3 · PCA fraction of variance

Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top 2 PCs?

Solution. (12 + 6) / (12 + 6 + 1.5 + 0.5) = 18 / 20 = 0.9090%.

E-4 · Gradient at a point

For f(x, y) = x² + 4y², compute ∇f at (1, 1). One gradient-descent step at η = 0.1 lands you where?

Solution. ∇f = (2x, 8y) = (2, 8). New point: (1, 1) − 0.1 · (2, 8) = (0.8, 0.2).

E-5 · Bernoulli MLE

n i.i.d. observations x₁, …, xₙ from Bernoulli(p). Derive the MLE of p.

Solution. Log-likelihood: ℓ(p) = (Σ xᵢ) log p + (n − Σ xᵢ) log(1 − p). Differentiate: dℓ/dp = (Σ xᵢ)/p − (n − Σ xᵢ)/(1 − p) = 0. Solving gives p̂ = (1/n) Σ xᵢ — the sample mean.

E-6 · Chain rule on a tiny MLP

One-hidden-layer net: z = w x + b, h = max(0, z), ŷ = v h + c, L = (ŷ − y)². Compute ∂L/∂w.

Solution. ∂L/∂ŷ = 2(ŷ − y). ∂ŷ/∂h = v. ∂h/∂z = 1 if z > 0 else 0. ∂z/∂w = x. Chain: ∂L/∂w = 2(ŷ − y) · v · 1[z > 0] · x.

E-7 · Convex combinations and Jensen

Show that for convex f and weights w₁ + w₂ = 1 (both ≥ 0), f(w₁ x + w₂ y) ≤ w₁ f(x) + w₂ f(y).

Solution. By definition of convex function: any chord lies above the graph. The chord from (x, f(x)) to (y, f(y)) at parameter t = w₂ evaluates to w₁ f(x) + w₂ f(y) at the x-coordinate w₁ x + w₂ y; the function value at that x-coordinate, f(w₁ x + w₂ y), lies on or below the chord. Done.

E-8 · SVD shape

X is a 100×5 data matrix. What are the shapes of U, Σ, Vᵀ in its (thin) SVD?

Solution. Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5. Then X = U Σ Vᵀ.

E-9 · Linearity of expectation

Flip a fair coin 10 times. Let X be the number of heads. Compute E[X] and Var(X).

Solution. X = Σ Xᵢ where Xᵢ ∈ {0,1}. E[Xᵢ] = 0.5E[X] = 5. Xᵢ independent: Var(Xᵢ) = 0.5 · 0.5 = 0.25Var(X) = 10 · 0.25 = 2.5.

E-10 · Lagrange multiplier

Minimize f(x, y) = x² + y² subject to x + y = 1.

Solution. Lagrangian: L = x² + y² − λ(x + y − 1). Set ∂L/∂x = 2x − λ = 0, ∂L/∂y = 2y − λ = 0x = y. Constraint: 2x = 1x = y = 1/2, minimum value = 1/2.