Original mock contests

Two original USAAIO-style mocks: a theory mini-mock (10 short-answer questions, 45 min) and a coding mini-mock (one end-to-end ML task on a synthetic dataset, 90 min). Use these between major Kaggle attempts.

How to take a mock. Strict timer. Closed editorial. After the timer, score honestly, log every miss, then re-attempt failed problems with no time limit.

Mock 1 · Theory mini-mock (10 questions, 45 min)

Linear algebra. For matrix A = [[2, 1], [1, 2]], find both eigenvalues.
Probability. Two fair dice are rolled. What is the probability that the sum is at least 10?
Calculus. For f(x, y) = x² y + 3y, compute ∂f/∂y at (2, 1).
Statistics. A sample has values 4, 8, 10, 14, 14. Compute the median and the (population) standard deviation.
Numpy / ML. Given a 2-D array X of shape (100, 5), write one line of NumPy that standardizes each column (zero mean, unit variance).
Classical ML. A model has 95% training accuracy and 70% validation accuracy. Is this overfitting or underfitting? Name two interventions.
Loss functions. Why use BCEWithLogitsLoss instead of BCELoss applied to a sigmoid output?
Deep learning. A Conv2d with in_channels=16, out_channels=32, kernel_size=5, no bias. How many parameters?
Transformers. A self-attention layer has model dimension d_model = 512 and n_heads = 8. What is the dimension per head?
Optimization. Name one situation where you'd prefer SGD with momentum over AdamW.

Answer key

Reveal answers

Eigenvalues: λ = 1 and λ = 3. (Characteristic polynomial: (2 − λ)² − 1 = 0.)
Outcomes with sum ≥ 10: (4,6), (5,5), (5,6), (6,4), (6,5), (6,6) → 6 out of 36 → 1/6.
∂f/∂y = x² + 3 = 4 + 3 = 7.
Median = 10. Mean = 10. Squared deviations: 36, 4, 0, 16, 16 → sum 72 → variance 14.4 → SD ≈ 3.79.
(X - X.mean(axis=0)) / X.std(axis=0).
Overfitting. Interventions: more regularization (weight decay, dropout, smaller model); more training data; data augmentation; early stopping.
Numerical stability: BCEWithLogitsLoss uses the log-sum-exp trick to handle very large or very small logits without intermediate overflow.
16 × 32 × 5 × 5 = 12 800 parameters.
512 / 8 = 64 dimensions per head.
Large-scale vision training (e.g. ImageNet ResNet) with a well-tuned learning-rate schedule — SGD+momentum often generalizes slightly better than Adam in this regime.

Mock 2 · Coding mini-mock (90 min)

Task: "Synthetic species classification"

You are given a small tabular dataset with 8 numerical features and a 3-class label. The training set has 2 000 rows; the held-out test set has 500 rows. Your task is to produce a notebook that trains a model and outputs predictions for the test set, maximizing macro-F1.

Synthetic data generator (use this to create your dataset locally)

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=2500, n_features=8, n_informative=5, n_redundant=2,
    n_classes=3, class_sep=1.2, weights=[0.5, 0.3, 0.2], random_state=42,
)

feature_cols = [f"f{i}" for i in range(8)]
df = pd.DataFrame(X, columns=feature_cols)
df["label"] = y

train_df, test_df = train_test_split(df, test_size=500, random_state=0, stratify=df["label"])
train_df.to_csv("train.csv", index=False)
test_df.drop(columns=["label"]).to_csv("test.csv", index=False)
test_df[["label"]].to_csv("test_labels.csv", index=False)  # for your own scoring

Deliverables

A single notebook solution.ipynb that runs end-to-end.
A predictions.csv with one prediction per row of the test set (header: label).
A 3-sentence write-up of your approach.

Scoring rubric

Macro-F1 achieved	Score (out of 100)
< 0.55	0
0.55 – 0.65	40
0.65 – 0.75	70
0.75 – 0.82	90
≥ 0.82	100

Reference baseline (open after your attempt)

Reveal reference baseline

import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score

train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

X = train.drop(columns=["label"])
y = train["label"]

# scale (HGB doesn't need it, but it doesn't hurt)
scaler = StandardScaler().fit(X)
X_s = scaler.transform(X)
test_s = scaler.transform(test)

model = HistGradientBoostingClassifier(
    max_depth=6, learning_rate=0.05, max_iter=400, random_state=42,
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_s, y, cv=cv, scoring="f1_macro")
print("CV macro-F1:", scores.mean(), "±", scores.std())

model.fit(X_s, y)
preds = model.predict(test_s)
pd.DataFrame({"label": preds}).to_csv("predictions.csv", index=False)

Reference baseline typically scores around macro-F1 ≈ 0.78–0.82 on this generator with the default seed. Beating it needs careful CV-driven hyperparameter tuning or a small ensemble (HGB + logistic regression + MLP averaged).

Tips during the mock

Build a baseline first. A 5-line logistic regression takes 60 seconds and grounds your CV scores.
Trust cross-validation, not the single train/val split. Use 5-fold stratified.
Track the seed. Reproducibility is part of the grade — pin random_state on every model and splitter.
Last 10 minutes: stop tuning, write the write-up, save the notebook, verify it runs from a clean kernel.

Mock 3 · Theory mini-mock #2 (10 questions, 45 min)

Linear algebra. A matrix A is 4×4 with rank 2. What is the dimension of its null space?
Probability. You flip a fair coin 5 times. What is the probability of getting exactly 3 heads?
Statistics. A normal distribution has mean 50 and standard deviation 8. Roughly what fraction of samples lie in [42, 58]?
Information theory. A discrete random variable takes 4 values uniformly. What is its entropy in bits?
NumPy. Given a 1-D array x, write a one-liner that returns the indices of its top 3 largest values, in descending order of value.
Classical ML. Why does adding more trees to a Random Forest typically stop helping after some point, while adding more trees to a boosted ensemble can keep helping (or hurt)?
Regularization. What is the practical difference between L1 and L2 regularization on the resulting weights?
Deep learning. A network's training loss is going down but validation loss is going up. Name three plausible interventions.
Transformers. Why is the softmax in scaled dot-product attention divided by √d_k?
Reproducibility. List three sources of nondeterminism in a PyTorch training run and how to control each.

Answer key

Reveal answers

Null space dimension = 4 − rank = 4 − 2 = 2. (Rank-nullity theorem.)
C(5,3) · (1/2)⁵ = 10/32 = 5/16 ≈ 0.3125.
±1σ window → ≈ 68% (empirical 68–95–99.7 rule).
H = log₂(4) = 2 bits.
np.argsort(x)[-3:][::-1].
Random Forest: trees are i.i.d. weak learners; the variance averages out and plateaus. Boosting: each tree is fit to the residual of the prior ensemble, so it can keep adding signal but also keep overfitting if you go past the validation optimum.
L1 drives many weights to exactly zero (sparse solution / built-in feature selection). L2 shrinks all weights toward zero but rarely to exactly zero (smoother solution).
Add dropout / weight decay; reduce model size; add data augmentation; collect more training data; early stopping; lower learning rate.
Without the scaling, dot products grow with √d_k, pushing the softmax into saturation (one entry near 1, others near 0). The division keeps gradients well-conditioned during training.
(1) Weight initialization → set torch.manual_seed; (2) DataLoader shuffle order → set generator + worker seeds; (3) cuDNN nondeterministic kernels → set torch.backends.cudnn.deterministic = True and disable benchmark mode.

Mock 4 · Coding mini-mock #2 (90 min) — tiny regression

A second end-to-end task, this time supervised regression instead of classification. Use the generator below to materialize train.csv and test.csv, then build a model and produce predictions.csv. Target metric: R² on the held-out test set.

Synthetic data generator

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(
    n_samples=2000, n_features=10, n_informative=6, noise=4.0,
    bias=10.0, random_state=11,
)
# Inject a non-linear feature interaction and one outlier batch
y = y + 0.5 * X[:, 0] * X[:, 1]
rng = np.random.default_rng(11)
outliers = rng.choice(len(y), size=30, replace=False)
y[outliers] += rng.normal(0, 40, size=30)

cols = [f"f{i}" for i in range(10)]
df = pd.DataFrame(X, columns=cols)
df["target"] = y

train_df, test_df = train_test_split(df, test_size=400, random_state=0)
train_df.to_csv("train.csv", index=False)
test_df.drop(columns=["target"]).to_csv("test.csv", index=False)
test_df[["target"]].to_csv("test_labels.csv", index=False)

Deliverables

A single notebook regression.ipynb that runs end-to-end.
A predictions.csv with one prediction per row of the test set (header: target).
A 3-sentence note explaining what feature interactions or outlier handling you used.

Scoring rubric

Test R² achieved	Score (out of 100)
< 0.50	0
0.50 – 0.70	40
0.70 – 0.85	70
0.85 – 0.92	90
≥ 0.92	100

What this mock tests

Whether you spot the f0 · f1 interaction without being told — basic EDA (pairwise scatter, correlation of X·X with residuals) gives it away.
Whether you handle the injected outliers (try Huber regression or a tree model that's robust to them; don't just MSE-fit a linear model).
Whether you produce sane CV scores rather than overfitting to one train/val split.

Reference baseline (open after your attempt)

Reveal reference baseline

import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score

train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

X = train.drop(columns=["target"])
y = train["target"]

# Add the explicit interaction so a tree model doesn't need to learn it from scratch
X["f0_f1"] = X["f0"] * X["f1"]
test["f0_f1"] = test["f0"] * test["f1"]

model = HistGradientBoostingRegressor(
    max_depth=6, learning_rate=0.05, max_iter=600, loss="absolute_error",
    random_state=42,
)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="r2")
print("CV R²:", scores.mean(), "±", scores.std())

model.fit(X, y)
pd.DataFrame({"target": model.predict(test)}).to_csv("predictions.csv", index=False)

With loss="absolute_error" (robust to outliers) and the explicit interaction term, this baseline typically clears R² ≈ 0.88–0.92 on the default seed. Pure squared-error baselines without interaction handling tend to land near R² ≈ 0.75.

After each mock

Score honestly using the rubric.
For each missed theory item: re-derive without looking, then read the explanation.
For the coding mock: identify your single biggest score-leak (was it feature engineering? model choice? CV?), and drill that next week.
Log everything into the same error log you use for problem sets.