Original mock contests
Two original USAAIO-style mocks: a theory mini-mock (10 short-answer questions, 45 min) and a coding mini-mock (one end-to-end ML task on a synthetic dataset, 90 min). Use these between major Kaggle attempts.
Mock 1 · Theory mini-mock (10 questions, 45 min)
- Linear algebra. For matrix
A = [[2, 1], [1, 2]], find both eigenvalues. - Probability. Two fair dice are rolled. What is the probability that the sum is at least 10?
- Calculus. For
f(x, y) = x² y + 3y, compute∂f/∂yat(2, 1). - Statistics. A sample has values 4, 8, 10, 14, 14. Compute the median and the (population) standard deviation.
- Numpy / ML. Given a 2-D array
Xof shape (100, 5), write one line of NumPy that standardizes each column (zero mean, unit variance). - Classical ML. A model has 95% training accuracy and 70% validation accuracy. Is this overfitting or underfitting? Name two interventions.
- Loss functions. Why use
BCEWithLogitsLossinstead ofBCELossapplied to a sigmoid output? - Deep learning. A Conv2d with
in_channels=16,out_channels=32,kernel_size=5, no bias. How many parameters? - Transformers. A self-attention layer has model dimension
d_model = 512andn_heads = 8. What is the dimension per head? - Optimization. Name one situation where you'd prefer SGD with momentum over AdamW.
Answer key
Reveal answers
- Eigenvalues: λ = 1 and λ = 3. (Characteristic polynomial:
(2 − λ)² − 1 = 0.) - Outcomes with sum ≥ 10: (4,6), (5,5), (5,6), (6,4), (6,5), (6,6) → 6 out of 36 →
1/6. ∂f/∂y = x² + 3 = 4 + 3 = 7.- Median = 10. Mean = 10. Squared deviations: 36, 4, 0, 16, 16 → sum 72 → variance 14.4 → SD ≈ 3.79.
(X - X.mean(axis=0)) / X.std(axis=0).- Overfitting. Interventions: more regularization (weight decay, dropout, smaller model); more training data; data augmentation; early stopping.
- Numerical stability:
BCEWithLogitsLossuses the log-sum-exp trick to handle very large or very small logits without intermediate overflow. 16 × 32 × 5 × 5 = 12 800parameters.512 / 8 = 64dimensions per head.- Large-scale vision training (e.g. ImageNet ResNet) with a well-tuned learning-rate schedule — SGD+momentum often generalizes slightly better than Adam in this regime.
Mock 2 · Coding mini-mock (90 min)
Task: "Synthetic species classification"
You are given a small tabular dataset with 8 numerical features and a 3-class label. The training set has 2 000 rows; the held-out test set has 500 rows. Your task is to produce a notebook that trains a model and outputs predictions for the test set, maximizing macro-F1.
Synthetic data generator (use this to create your dataset locally)
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=2500, n_features=8, n_informative=5, n_redundant=2,
n_classes=3, class_sep=1.2, weights=[0.5, 0.3, 0.2], random_state=42,
)
feature_cols = [f"f{i}" for i in range(8)]
df = pd.DataFrame(X, columns=feature_cols)
df["label"] = y
train_df, test_df = train_test_split(df, test_size=500, random_state=0, stratify=df["label"])
train_df.to_csv("train.csv", index=False)
test_df.drop(columns=["label"]).to_csv("test.csv", index=False)
test_df[["label"]].to_csv("test_labels.csv", index=False) # for your own scoring
Deliverables
- A single notebook
solution.ipynbthat runs end-to-end. - A
predictions.csvwith one prediction per row of the test set (header:label). - A 3-sentence write-up of your approach.
Scoring rubric
| Macro-F1 achieved | Score (out of 100) |
|---|---|
| < 0.55 | 0 |
| 0.55 – 0.65 | 40 |
| 0.65 – 0.75 | 70 |
| 0.75 – 0.82 | 90 |
| ≥ 0.82 | 100 |
Reference baseline (open after your attempt)
Reveal reference baseline
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
X = train.drop(columns=["label"])
y = train["label"]
# scale (HGB doesn't need it, but it doesn't hurt)
scaler = StandardScaler().fit(X)
X_s = scaler.transform(X)
test_s = scaler.transform(test)
model = HistGradientBoostingClassifier(
max_depth=6, learning_rate=0.05, max_iter=400, random_state=42,
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_s, y, cv=cv, scoring="f1_macro")
print("CV macro-F1:", scores.mean(), "±", scores.std())
model.fit(X_s, y)
preds = model.predict(test_s)
pd.DataFrame({"label": preds}).to_csv("predictions.csv", index=False)
Reference baseline typically scores around macro-F1 ≈ 0.78–0.82 on this generator with the default seed. Beating it needs careful CV-driven hyperparameter tuning or a small ensemble (HGB + logistic regression + MLP averaged).
Tips during the mock
- Build a baseline first. A 5-line logistic regression takes 60 seconds and grounds your CV scores.
- Trust cross-validation, not the single train/val split. Use 5-fold stratified.
- Track the seed. Reproducibility is part of the grade — pin
random_stateon every model and splitter. - Last 10 minutes: stop tuning, write the write-up, save the notebook, verify it runs from a clean kernel.
Mock 3 · Theory mini-mock #2 (10 questions, 45 min)
- Linear algebra. A matrix
Ais 4×4 with rank 2. What is the dimension of its null space? - Probability. You flip a fair coin 5 times. What is the probability of getting exactly 3 heads?
- Statistics. A normal distribution has mean 50 and standard deviation 8. Roughly what fraction of samples lie in [42, 58]?
- Information theory. A discrete random variable takes 4 values uniformly. What is its entropy in bits?
- NumPy. Given a 1-D array
x, write a one-liner that returns the indices of its top 3 largest values, in descending order of value. - Classical ML. Why does adding more trees to a Random Forest typically stop helping after some point, while adding more trees to a boosted ensemble can keep helping (or hurt)?
- Regularization. What is the practical difference between L1 and L2 regularization on the resulting weights?
- Deep learning. A network's training loss is going down but validation loss is going up. Name three plausible interventions.
- Transformers. Why is the softmax in scaled dot-product attention divided by √dk?
- Reproducibility. List three sources of nondeterminism in a PyTorch training run and how to control each.
Answer key
Reveal answers
- Null space dimension = 4 − rank = 4 − 2 = 2. (Rank-nullity theorem.)
C(5,3) · (1/2)⁵ = 10/32 = 5/16 ≈ 0.3125.- ±1σ window → ≈ 68% (empirical 68–95–99.7 rule).
- H = log₂(4) = 2 bits.
np.argsort(x)[-3:][::-1].- Random Forest: trees are i.i.d. weak learners; the variance averages out and plateaus. Boosting: each tree is fit to the residual of the prior ensemble, so it can keep adding signal but also keep overfitting if you go past the validation optimum.
- L1 drives many weights to exactly zero (sparse solution / built-in feature selection). L2 shrinks all weights toward zero but rarely to exactly zero (smoother solution).
- Add dropout / weight decay; reduce model size; add data augmentation; collect more training data; early stopping; lower learning rate.
- Without the scaling, dot products grow with √dk, pushing the softmax into saturation (one entry near 1, others near 0). The division keeps gradients well-conditioned during training.
- (1) Weight initialization → set
torch.manual_seed; (2) DataLoader shuffle order → setgenerator+ worker seeds; (3) cuDNN nondeterministic kernels → settorch.backends.cudnn.deterministic = Trueand disable benchmark mode.
Mock 4 · Coding mini-mock #2 (90 min) — tiny regression
A second end-to-end task, this time supervised regression instead of classification.
Use the generator below to materialize train.csv and test.csv, then build a model
and produce predictions.csv. Target metric: R² on the held-out test set.
Synthetic data generator
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=2000, n_features=10, n_informative=6, noise=4.0,
bias=10.0, random_state=11,
)
# Inject a non-linear feature interaction and one outlier batch
y = y + 0.5 * X[:, 0] * X[:, 1]
rng = np.random.default_rng(11)
outliers = rng.choice(len(y), size=30, replace=False)
y[outliers] += rng.normal(0, 40, size=30)
cols = [f"f{i}" for i in range(10)]
df = pd.DataFrame(X, columns=cols)
df["target"] = y
train_df, test_df = train_test_split(df, test_size=400, random_state=0)
train_df.to_csv("train.csv", index=False)
test_df.drop(columns=["target"]).to_csv("test.csv", index=False)
test_df[["target"]].to_csv("test_labels.csv", index=False)
Deliverables
- A single notebook
regression.ipynbthat runs end-to-end. - A
predictions.csvwith one prediction per row of the test set (header:target). - A 3-sentence note explaining what feature interactions or outlier handling you used.
Scoring rubric
| Test R² achieved | Score (out of 100) |
|---|---|
| < 0.50 | 0 |
| 0.50 – 0.70 | 40 |
| 0.70 – 0.85 | 70 |
| 0.85 – 0.92 | 90 |
| ≥ 0.92 | 100 |
What this mock tests
- Whether you spot the
f0 · f1interaction without being told — basic EDA (pairwise scatter, correlation ofX·Xwith residuals) gives it away. - Whether you handle the injected outliers (try Huber regression or a tree model that's robust to them; don't just MSE-fit a linear model).
- Whether you produce sane CV scores rather than overfitting to one train/val split.
Reference baseline (open after your attempt)
Reveal reference baseline
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
X = train.drop(columns=["target"])
y = train["target"]
# Add the explicit interaction so a tree model doesn't need to learn it from scratch
X["f0_f1"] = X["f0"] * X["f1"]
test["f0_f1"] = test["f0"] * test["f1"]
model = HistGradientBoostingRegressor(
max_depth=6, learning_rate=0.05, max_iter=600, loss="absolute_error",
random_state=42,
)
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="r2")
print("CV R²:", scores.mean(), "±", scores.std())
model.fit(X, y)
pd.DataFrame({"target": model.predict(test)}).to_csv("predictions.csv", index=False)
With loss="absolute_error" (robust to outliers) and the explicit interaction term, this baseline typically clears R² ≈ 0.88–0.92 on the default seed. Pure squared-error baselines without interaction handling tend to land near R² ≈ 0.75.
After each mock
- Score honestly using the rubric.
- For each missed theory item: re-derive without looking, then read the explanation.
- For the coding mock: identify your single biggest score-leak (was it feature engineering? model choice? CV?), and drill that next week.
- Log everything into the same error log you use for problem sets.