The math you actually need

Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.

How deep to go. Aim for fluency, not proofs. If you can re-derive backpropagation on a 2-layer MLP on paper and explain why PCA picks the top eigenvectors, you're past the bar.

Linear algebra

Vectors and norms. Dot product, projection, L1 / L2 / L∞ norms. Geometric intuition matters: a dot product is "how much one vector points in the direction of another."
Matrices as linear maps. Matrix-vector multiplication is "apply this transformation." Rotation, scaling, projection are all matrices.
Matrix multiplication. Composing transformations. Know that (AB)x = A(Bx) and why it's not commutative.
Rank, null space, column space. What information a matrix preserves vs. destroys.
Eigenvalues & eigenvectors. The directions a matrix scales without rotating. Foundation for PCA and spectral methods.
Singular value decomposition (SVD). The factorization you'll keep meeting in dimensionality reduction, recommender systems, and low-rank approximation.

Key identity for ML: for matrix X with rows = data points, the covariance matrix is C = (1/n) Xᵀ X (after centering). Its top eigenvectors are the principal components.

Probability & statistics

Random variables, expectation, variance. The two summary statistics that drive almost every loss function.
Common distributions. Bernoulli, binomial, Gaussian, uniform, Poisson. Recognize them from their PDFs.
Joint, marginal, conditional probability. P(A, B) = P(A | B) P(B). Bayes' rule.
Independence vs. correlation. Independence implies zero correlation; the reverse is false except for Gaussians.
Maximum likelihood estimation (MLE). The probabilistic justification behind least-squares regression and cross-entropy classification.
Central limit theorem. Why averaging always tends toward Gaussian and why we trust mean estimates.
Hypothesis testing basics. p-values, confidence intervals — useful for evaluating whether model A really beats model B.

Multivariable calculus

Partial derivatives. ∂f/∂xᵢ — "how does f change if I wiggle xᵢ alone."
Gradient. The vector of all partials. Points in the direction of steepest increase.
Chain rule. The single most important calculus fact for ML — backpropagation is just the chain rule applied repeatedly.
Jacobian & Hessian. Matrix of partials (Jacobian) and second-order partials (Hessian). The Hessian's eigenvalues tell you about local curvature and convexity.
Taylor expansion. Approximate a function locally by its first or second derivative. Underlies Newton's method.

Gradient descent: x ← x − η · ∇f(x). Reduce η if the loss oscillates; increase η if it crawls.

Convex optimization

Convex sets and functions. A function is convex if every chord lies above the graph. Convex problems have a unique global minimum.
Convex losses. Squared error, logistic loss, hinge loss, cross-entropy — all convex in the linear-model setting.
Lagrange multipliers & KKT conditions. Handle constraints. Show up in SVMs explicitly.
Gradient descent & variants. SGD, momentum, Adam. Know the update rules from memory.
Convergence intuition. Learning rate too high → divergence. Too low → slow crawl. Adam smooths over both by adapting per-parameter rates.

Drills (hands-on)

Implement matrix multiplication, transpose, and dot product from scratch in NumPy without built-ins. Verify against np.matmul.
Compute PCA by hand on a 5-point 2-D dataset: center, covariance, eigendecomposition, project.
Derive the gradient of MSE loss for linear regression. Verify with PyTorch autograd.
For a 2-layer MLP with one hidden ReLU and a sigmoid output, write out the four partial derivatives needed for backprop on a single sample.
Run gradient descent on a 1-D convex function (e.g. f(x) = (x − 3)²) and plot the trajectory for three learning rates.

Exercises with full solutions

Try each before opening the solution. The point is to derive, not just to know the answer.

E-1 · Determinant from eigenvalues

A 3×3 matrix has eigenvalues 1, 2, 4. What are det(A) and trace(A)?

Solution. Determinant = product of eigenvalues = 1 · 2 · 4 = 8. Trace = sum of eigenvalues = 1 + 2 + 4 = 7.

E-2 · Bayes' rule

Disease prevalence is 1%. A test has 95% sensitivity and 90% specificity. Given a positive test, find P(disease | +).

Solution. P(+ | D) P(D) = 0.95 · 0.01 = 0.0095. P(+) = P(+ | D)P(D) + P(+ | ¬D)P(¬D) = 0.0095 + 0.10 · 0.99 = 0.1085. P(D | +) = 0.0095 / 0.1085 ≈ 0.0876. About 8.8% — counter-intuitively low because the base rate is low.

E-3 · PCA fraction of variance

Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top 2 PCs?

Solution. (12 + 6) / (12 + 6 + 1.5 + 0.5) = 18 / 20 = 0.90 → 90%.

E-4 · Gradient at a point

For f(x, y) = x² + 4y², compute ∇f at (1, 1). One gradient-descent step at η = 0.1 lands you where?

Solution. ∇f = (2x, 8y) = (2, 8). New point: (1, 1) − 0.1 · (2, 8) = (0.8, 0.2).

E-5 · Bernoulli MLE

n i.i.d. observations x₁, …, xₙ from Bernoulli(p). Derive the MLE of p.

Solution. Log-likelihood: ℓ(p) = (Σ xᵢ) log p + (n − Σ xᵢ) log(1 − p). Differentiate: dℓ/dp = (Σ xᵢ)/p − (n − Σ xᵢ)/(1 − p) = 0. Solving gives p̂ = (1/n) Σ xᵢ — the sample mean.

E-6 · Chain rule on a tiny MLP

One-hidden-layer net: z = w x + b, h = max(0, z), ŷ = v h + c, L = (ŷ − y)². Compute ∂L/∂w.

Solution. ∂L/∂ŷ = 2(ŷ − y). ∂ŷ/∂h = v. ∂h/∂z = 1 if z > 0 else 0. ∂z/∂w = x. Chain: ∂L/∂w = 2(ŷ − y) · v · 1[z > 0] · x.

E-7 · Convex combinations and Jensen

Show that for convex f and weights w₁ + w₂ = 1 (both ≥ 0), f(w₁ x + w₂ y) ≤ w₁ f(x) + w₂ f(y).

Solution. By definition of convex function: any chord lies above the graph. The chord from (x, f(x)) to (y, f(y)) at parameter t = w₂ evaluates to w₁ f(x) + w₂ f(y) at the x-coordinate w₁ x + w₂ y; the function value at that x-coordinate, f(w₁ x + w₂ y), lies on or below the chord. Done.

E-8 · SVD shape

X is a 100×5 data matrix. What are the shapes of U, Σ, Vᵀ in its (thin) SVD?

Solution. Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5. Then X = U Σ Vᵀ.

E-9 · Linearity of expectation

Flip a fair coin 10 times. Let X be the number of heads. Compute E[X] and Var(X).

Solution. X = Σ Xᵢ where Xᵢ ∈ {0,1}. E[Xᵢ] = 0.5 → E[X] = 5. Xᵢ independent: Var(Xᵢ) = 0.5 · 0.5 = 0.25 → Var(X) = 10 · 0.25 = 2.5.

E-10 · Lagrange multiplier

Minimize f(x, y) = x² + y² subject to x + y = 1.

Solution. Lagrangian: L = x² + y² − λ(x + y − 1). Set ∂L/∂x = 2x − λ = 0, ∂L/∂y = 2y − λ = 0 → x = y. Constraint: 2x = 1 → x = y = 1/2, minimum value = 1/2.