Problem sets by topic

Original USAAIO-style theory questions and short coding tasks, organized by topic. Use these between mock contests to drill a specific weakness. For the actual past problems, always cross-reference usaaio.org.

Theory questions have a short answer or multiple-choice format. Coding tasks ask you to produce a notebook that hits a target metric on a held-out dataset. The contest mixes both.

Math

M-1 · Linear algebra

Let A be a 3×3 matrix with eigenvalues 1, 2, and 4. What is det(A)? What is trace(A)?

Answer: det(A) = 1 · 2 · 4 = 8; trace(A) = 1 + 2 + 4 = 7. The determinant equals the product of eigenvalues and the trace equals the sum.

M-2 · PCA

You compute the covariance matrix of a centered dataset and find eigenvalues 12, 6, 1.5, 0.5. What fraction of the total variance is captured by the top 2 principal components?

Answer: (12 + 6) / (12 + 6 + 1.5 + 0.5) = 18 / 20 = 0.9, i.e. 90%.

M-3 · Probability

A disease occurs in 1% of a population. A test is 95% sensitive (true-positive rate) and 90% specific (true-negative rate). A randomly selected person tests positive. What is the probability that they actually have the disease?

Answer: Use Bayes' rule. P(D | +) = P(+ | D) P(D) / P(+). Numerator = 0.95 × 0.01 = 0.0095. P(+) = 0.95 × 0.01 + 0.10 × 0.99 = 0.0095 + 0.099 = 0.1085. P(D | +) ≈ 0.0876, i.e. about 8.8%.

M-4 · Gradient descent

For f(x, y) = x² + 4y², compute the gradient at (1, 1). If you take one gradient-descent step with learning rate η = 0.1, what is the new point?

Answer: ∇f = (2x, 8y) = (2, 8) at (1, 1). New point: (1, 1) − 0.1 (2, 8) = (0.8, 0.2).

M-5 · Maximum likelihood

You observe n i.i.d. samples x₁, …, xₙ from a Bernoulli(p) distribution. What is the maximum-likelihood estimate of p?

Answer: p̂ = (1/n) Σ xᵢ — the sample mean. Derive by maximizing the log-likelihood Σ xᵢ log p + (1 − xᵢ) log(1 − p).

Python / data wrangling

P-1 · NumPy: row-wise softmax

Implement a numerically stable row-wise softmax for a 2-D NumPy array X of shape (n, k).

import numpy as np

def softmax_rows(X):
    # subtract per-row max for numerical stability
    X = X - X.max(axis=1, keepdims=True)
    expX = np.exp(X)
    return expX / expX.sum(axis=1, keepdims=True)

Why subtract the per-row max? Without it, exp(X) overflows for inputs above ~709.

P-2 · pandas: group-and-rank

Given a DataFrame df with columns store, product, sales, compute the top 3 products by total sales within each store.

top3 = (
    df.groupby(["store", "product"], as_index=False)["sales"]
      .sum()
      .sort_values(["store", "sales"], ascending=[True, False])
      .groupby("store")
      .head(3)
)

P-3 · pandas: missing-value strategy

A dataset has a numerical column income with 8% missing values. Describe two valid strategies and one strategy that leaks information.

Valid: (1) fill missing with the training-set median, then apply the same value to validation/test. (2) Use a model (e.g. HistGradientBoostingRegressor) that natively handles NaN.

Leak: Fill with the median computed across train + validation + test combined — the fill value contains future information.

Classical ML

ML-1 · Overfitting diagnosis

You train a random forest and observe: training accuracy = 0.99, validation accuracy = 0.72. Name three concrete fixes.

Fixes: (1) Reduce max_depth or increase min_samples_leaf. (2) Reduce n_estimators won't help — random forest doesn't overfit with more trees, but each individual tree can. (3) Add more training data, or use stronger regularization (e.g. switch to gradient boosting with early stopping).

ML-2 · Cross-validation pitfall

A student does the following: scaler.fit(X), then cross_val_score(model, scaler.transform(X), y, cv=5). What's wrong, and what's the fix?

Problem: The scaler was fit on all data including each fold's validation set — leak. The reported score is optimistically biased.

Fix: Wrap scaler and model in a Pipeline: Pipeline([("scale", StandardScaler()), ("clf", model)]), then pass the pipeline to cross_val_score. Each fold refits the scaler on its training half.

ML-3 · Imbalanced classification

You're classifying fraudulent vs. legitimate transactions. The positive (fraud) class is 0.5% of the data. Why is accuracy a poor metric, and what should you use instead?

Why: Predicting "not fraud" for every transaction gives 99.5% accuracy and zero useful information.

Use instead: Precision, recall, F1, ROC-AUC, precision-recall AUC. Often the relevant business metric is recall at a fixed precision (or vice versa).

ML-4 · K-Means initialization

What problem does k-means++ initialization solve?

Answer: Random initialization can place initial centroids close together, producing poor local minima and slow convergence. K-means++ picks each new initial centroid with probability proportional to its squared distance from the nearest already-chosen centroid, spreading centroids out and dramatically improving the chance of converging to a good clustering.

Deep learning

DL-1 · Forward pass dimensions

A batch of 32 images, each 3 channels × 64 × 64, is fed to a Conv2d layer with in_channels=3, out_channels=16, kernel_size=3, padding=1, stride=1. What is the output shape?

Answer: (32, 16, 64, 64). Same-padded 3×3 convolution preserves spatial dimensions; channels go from 3 to 16; batch size unchanged.

DL-2 · Cross-entropy loss

For a 3-class classifier, a sample has true label 1 (one-hot [0, 1, 0]) and predicted probabilities [0.2, 0.5, 0.3]. Compute the cross-entropy loss for this sample.

Answer: −log(0.5) ≈ 0.693. Cross-entropy = −Σ yᵢ log(pᵢ) = −log(p_true_class).

DL-3 · NaN loss

Your training loss is non-NaN for the first 20 steps, then jumps to NaN and stays. Name three things to check.

Check: (1) Learning rate too high — try lowering by 10×. (2) Numerical instability: are you using BCEWithLogitsLoss instead of BCELoss(sigmoid(...))? Did you take log(0) somewhere? (3) Exploding gradients on a deep net — apply gradient clipping with torch.nn.utils.clip_grad_norm_.

DL-4 · Train / eval mode

Your model evaluates well during training but produces very different numbers when you save the weights and run inference on a single sample. Why?

Answer: You probably forgot to put the model in inference mode. BatchNorm computes per-batch statistics in training mode but uses the running averages in inference mode; Dropout zeros random activations in training but is the identity in inference. A batch size of 1 with BatchNorm in training mode is especially broken.

Transformers

T-1 · Attention complexity

For an input sequence of length n and model dimension d, what is the time and memory complexity of standard scaled dot-product self-attention (single head)?

Answer: Time = O(n² · d); memory = O(n²) for the attention matrix plus O(n · d) for Q, K, V. The quadratic-in-n cost is why long-context models use sparse / linear / flash variants.

T-2 · Why scale by √d_k?

The attention formula is softmax(Q Kᵀ / √dₖ) V. Why divide by √dₖ?

Answer: Q · K is a sum of dₖ products of (assumed standard normal) random variables, so its variance grows as dₖ and its standard deviation as √dₖ. Without scaling, the softmax inputs grow with dₖ, pushing the softmax into a saturated regime where one entry is ~1 and all others ~0 — gradients vanish. Dividing by √dₖ keeps the variance ~1.

T-3 · Causal mask

For a decoder-only language model, what shape and pattern should the attention mask have, and how is it applied?

Answer: Shape (n, n). Lower-triangular (including the diagonal) of 1s; upper triangle of 0s. Applied by setting masked positions in the pre-softmax score matrix to −∞ (in practice, a large negative number) so the softmax assigns them zero weight. This prevents each position from attending to future positions.

T-4 · Positional encoding

Why is positional encoding necessary in a transformer?

Answer: Self-attention is permutation-equivariant — shuffling the input tokens shuffles the outputs in the same way. The model has no notion of token order. Positional encoding (sinusoidal, learned, or rotary) injects position information into the token embeddings so the model can distinguish "cat sat on mat" from "mat sat on cat."

More math drills

M-6 · Matrix multiplication shape

Matrix A is 4×3, matrix B is 3×5. Shape of A B? Of Bᵀ Aᵀ?

Answer: AB is 4×5; Bᵀ Aᵀ is 5×4 (= (AB)ᵀ).

M-7 · Variance of a sum

X and Y are independent with Var(X) = 4, Var(Y) = 9. Compute Var(2X − 3Y).

Answer: Var(2X − 3Y) = 4 · Var(X) + 9 · Var(Y) = 16 + 81 = 97.

M-8 · Gradient of squared norm

For f(w) = ‖Xw − y‖², compute ∇_w f.

Answer: ∇_w f = 2 Xᵀ (Xw − y). Setting to zero gives the normal equations Xᵀ X w = Xᵀ y.

M-9 · Eigenvalues of a 2×2

For A = [[4, 1], [2, 3]], find the eigenvalues.

Answer: λ = 2, 5. det(A − λI) = (4 − λ)(3 − λ) − 2 = λ² − 7λ + 10 = (λ − 2)(λ − 5).

M-10 · Convex check

Is f(x) = x⁴ − 6x² + 5 convex on all of ℝ?

Answer: No. f''(x) = 12x² − 12 is negative for |x| < 1, so f is concave on that interval.

More Python / data drills

P-4 · NumPy: stable log-sum-exp

Write a numerically stable log-sum-exp for a 1-D NumPy array.

def lse(x):
    m = x.max()
    return m + np.log(np.exp(x - m).sum())

P-5 · Pandas: merge with indicator

Given DataFrames orders and customers joined on customer_id, count how many orders have no matching customer.

merged = orders.merge(customers, on="customer_id", how="left", indicator=True)
n_orphan = (merged["_merge"] == "left_only").sum()

P-6 · NumPy: distance matrix without loops

Given X shape (n, d), produce the n×n matrix of pairwise Euclidean distances using broadcasting.

diff = X[:, None, :] - X[None, :, :]   # (n, n, d)
D = np.sqrt((diff ** 2).sum(axis=-1))   # (n, n)

P-7 · Reproducibility

How do you make an sklearn pipeline reproducible across runs?

Answer: Use random_state=42 on train_test_split, on StratifiedKFold, and on every model that exposes random_state (forests, boosters, MLPs). For PyTorch also seed torch.manual_seed and set the deterministic flags.

More classical ML drills

ML-5 · Bias-variance tradeoff

A simple model gives high train error and high test error. A complex model gives low train error and high test error. Name each regime and one fix.

Underfitting: add features, increase capacity, reduce regularization. Overfitting: add data, increase regularization, reduce capacity, use ensembling / dropout / early stopping.

ML-6 · ROC vs PR curve

Heavily imbalanced classification (1% positives). Compare models by ROC-AUC or PR-AUC?

Answer: PR-AUC. ROC-AUC inflates with a large negative class dominating the true-negative rate. PR focuses on the positive class.

ML-7 · Target leakage

You're predicting loan default. A feature days_since_late_payment is suspiciously predictive. Concern?

Answer: The feature may only exist after default. The model learns a future-leaking signal that isn't there at prediction time. Drop features computable only after the label is known; build all features with a strict as-of timestamp.

ML-8 · Hyperparameter search

Grid vs random vs Bayesian search.

Grid: exhaustive on a discrete grid; cost is the product of per-dim sizes. Random: samples each hyperparameter independently from a defined distribution; often beats grid because most hyperparameters aren't equally important. Bayesian (Optuna): builds a surrogate model of the loss surface and samples promising regions; best when each trial is expensive.

More deep learning drills

DL-5 · Parameter count for an MLP

An MLP has layers 784 → 256 → 128 → 10 (with biases). Total parameters?

Answer: (784·256 + 256) + (256·128 + 128) + (128·10 + 10) = 200,960 + 32,896 + 1,290 = 235,146.

DL-6 · BatchNorm at inference

You train with BatchNorm at batch size 128, then deploy at batch size 1. Inference is wildly off. Why?

Answer: You must switch the model to inference mode so BatchNorm uses the running mean/variance accumulated during training. In training mode with batch size 1, BatchNorm computes statistics over a single sample and zeros activations within the layer.

DL-7 · Receptive field

Three stacked 3×3 convolutions, stride 1, padding 1. Effective receptive field?

Answer: 7×7. Each 3×3 adds 2: 3 → 5 → 7.

DL-8 · Optimizer choice

SGD + momentum vs Adam — when to prefer which?

SGD + momentum: large-scale image training with a well-tuned LR schedule; often beats Adam on ImageNet-class benchmarks. Adam / AdamW: default for most other workloads — sparse gradients, NLP / embeddings, smaller tuning budgets. AdamW is the modern default for transformers.

More transformer drills

T-5 · Multi-head head size

d_model = 768, n_heads = 12. Per-head dimension?

Answer: d_head = 768 / 12 = 64.

T-6 · KV cache

Why cache K and V during autoregressive generation, and what does this trade?

Answer: At generation step t you only need Q for the new token; K and V for all prior tokens are reused. Recomputing K and V every step is O(n² d) per step; caching makes it O(n d) per step, at the cost of O(n · d · L) memory across L layers. KV cache is the central memory bottleneck of long-context inference.

T-7 · LayerNorm vs BatchNorm in transformers

Why LayerNorm rather than BatchNorm in transformers?

Answer: BatchNorm depends on batch statistics: variable sequence lengths and padded positions break it; training-vs-inference behavior diverges; small batches hurt. LayerNorm normalizes across the feature dimension within a single token, so it's batch-independent.

T-8 · Pre-norm vs post-norm

Original transformer used post-norm; modern transformers use pre-norm. Why?

Answer: Pre-norm yields more stable gradient flow at depth (the residual path is clean), letting you train deeper stacks without LR-warmup tricks. Post-norm with deep stacks often needs careful warmup and can diverge.

Mock contests — Four mini-mocks: 2 theory + 2 coding.
After every miss: log it, identify the gap, drill that topic this week.