The math you actually need
Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.
Linear algebra
- Vectors and norms. Dot product, projection, L1 / L2 / L∞ norms. Geometric intuition matters: a dot product is "how much one vector points in the direction of another."
- Matrices as linear maps. Matrix-vector multiplication is "apply this transformation." Rotation, scaling, projection are all matrices.
- Matrix multiplication. Composing transformations. Know that
(AB)x = A(Bx)and why it's not commutative. - Rank, null space, column space. What information a matrix preserves vs. destroys.
- Eigenvalues & eigenvectors. The directions a matrix scales without rotating. Foundation for PCA and spectral methods.
- Singular value decomposition (SVD). The factorization you'll keep meeting in dimensionality reduction, recommender systems, and low-rank approximation.
Probability & statistics
- Random variables, expectation, variance. The two summary statistics that drive almost every loss function.
- Common distributions. Bernoulli, binomial, Gaussian, uniform, Poisson. Recognize them from their PDFs.
- Joint, marginal, conditional probability.
P(A, B) = P(A | B) P(B). Bayes' rule. - Independence vs. correlation. Independence implies zero correlation; the reverse is false except for Gaussians.
- Maximum likelihood estimation (MLE). The probabilistic justification behind least-squares regression and cross-entropy classification.
- Central limit theorem. Why averaging always tends toward Gaussian and why we trust mean estimates.
- Hypothesis testing basics. p-values, confidence intervals — useful for evaluating whether model A really beats model B.
Multivariable calculus
- Partial derivatives. ∂f/∂xᵢ — "how does f change if I wiggle xᵢ alone."
- Gradient. The vector of all partials. Points in the direction of steepest increase.
- Chain rule. The single most important calculus fact for ML — backpropagation is just the chain rule applied repeatedly.
- Jacobian & Hessian. Matrix of partials (Jacobian) and second-order partials (Hessian). The Hessian's eigenvalues tell you about local curvature and convexity.
- Taylor expansion. Approximate a function locally by its first or second derivative. Underlies Newton's method.
Convex optimization
- Convex sets and functions. A function is convex if every chord lies above the graph. Convex problems have a unique global minimum.
- Convex losses. Squared error, logistic loss, hinge loss, cross-entropy — all convex in the linear-model setting.
- Lagrange multipliers & KKT conditions. Handle constraints. Show up in SVMs explicitly.
- Gradient descent & variants. SGD, momentum, Adam. Know the update rules from memory.
- Convergence intuition. Learning rate too high → divergence. Too low → slow crawl. Adam smooths over both by adapting per-parameter rates.
Drills (hands-on)
- Implement matrix multiplication, transpose, and dot product from scratch in NumPy without built-ins. Verify against
np.matmul. - Compute PCA by hand on a 5-point 2-D dataset: center, covariance, eigendecomposition, project.
- Derive the gradient of MSE loss for linear regression. Verify with PyTorch autograd.
- For a 2-layer MLP with one hidden ReLU and a sigmoid output, write out the four partial derivatives needed for backprop on a single sample.
- Run gradient descent on a 1-D convex function (e.g. f(x) = (x − 3)²) and plot the trajectory for three learning rates.
Exercises with full solutions
Try each before opening the solution. The point is to derive, not just to know the answer.
E-1 · Determinant from eigenvalues
A 3×3 matrix has eigenvalues 1, 2, 4. What are det(A) and trace(A)?
Solution. Determinant = product of eigenvalues = 1 · 2 · 4 = 8. Trace = sum of eigenvalues = 1 + 2 + 4 = 7.
E-2 · Bayes' rule
Disease prevalence is 1%. A test has 95% sensitivity and 90% specificity. Given a positive test, find P(disease | +).
Solution.
P(+ | D) P(D) = 0.95 · 0.01 = 0.0095.
P(+) = P(+ | D)P(D) + P(+ | ¬D)P(¬D) = 0.0095 + 0.10 · 0.99 = 0.1085.
P(D | +) = 0.0095 / 0.1085 ≈ 0.0876. About 8.8% — counter-intuitively low because the base rate is low.
E-3 · PCA fraction of variance
Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top 2 PCs?
Solution. (12 + 6) / (12 + 6 + 1.5 + 0.5) = 18 / 20 = 0.90 → 90%.
E-4 · Gradient at a point
For f(x, y) = x² + 4y², compute ∇f at (1, 1). One gradient-descent step at η = 0.1 lands you where?
Solution. ∇f = (2x, 8y) = (2, 8). New point: (1, 1) − 0.1 · (2, 8) = (0.8, 0.2).
E-5 · Bernoulli MLE
n i.i.d. observations x₁, …, xₙ from Bernoulli(p). Derive the MLE of p.
Solution. Log-likelihood: ℓ(p) = (Σ xᵢ) log p + (n − Σ xᵢ) log(1 − p). Differentiate: dℓ/dp = (Σ xᵢ)/p − (n − Σ xᵢ)/(1 − p) = 0. Solving gives p̂ = (1/n) Σ xᵢ — the sample mean.
E-6 · Chain rule on a tiny MLP
One-hidden-layer net: z = w x + b, h = max(0, z), ŷ = v h + c, L = (ŷ − y)². Compute ∂L/∂w.
Solution. ∂L/∂ŷ = 2(ŷ − y). ∂ŷ/∂h = v. ∂h/∂z = 1 if z > 0 else 0. ∂z/∂w = x. Chain: ∂L/∂w = 2(ŷ − y) · v · 1[z > 0] · x.
E-7 · Convex combinations and Jensen
Show that for convex f and weights w₁ + w₂ = 1 (both ≥ 0), f(w₁ x + w₂ y) ≤ w₁ f(x) + w₂ f(y).
Solution. By definition of convex function: any chord lies above the graph. The chord from (x, f(x)) to (y, f(y)) at parameter t = w₂ evaluates to w₁ f(x) + w₂ f(y) at the x-coordinate w₁ x + w₂ y; the function value at that x-coordinate, f(w₁ x + w₂ y), lies on or below the chord. Done.
E-8 · SVD shape
X is a 100×5 data matrix. What are the shapes of U, Σ, Vᵀ in its (thin) SVD?
Solution. Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5. Then X = U Σ Vᵀ.
E-9 · Linearity of expectation
Flip a fair coin 10 times. Let X be the number of heads. Compute E[X] and Var(X).
Solution. X = Σ Xᵢ where Xᵢ ∈ {0,1}. E[Xᵢ] = 0.5 → E[X] = 5. Xᵢ independent: Var(Xᵢ) = 0.5 · 0.5 = 0.25 → Var(X) = 10 · 0.25 = 2.5.
E-10 · Lagrange multiplier
Minimize f(x, y) = x² + y² subject to x + y = 1.
Solution. Lagrangian: L = x² + y² − λ(x + y − 1). Set ∂L/∂x = 2x − λ = 0, ∂L/∂y = 2y − λ = 0 → x = y. Constraint: 2x = 1 → x = y = 1/2, minimum value = 1/2.