Classical machine learning

Before deep learning is on the table, classical ML solves most contest tabular problems faster, with fewer bugs, and with smaller compute. This page sweeps the scikit-learn families you must own: regression, classification, ensembles, cross-validation, clustering, dimensionality reduction.

The contest workflow

Load & inspect. pd.read_csv, .info(), .describe(), scan for NaNs, look at value distributions.
Train/val/test split. Stratified for classification; time-based for time series. Never mix.
Baseline model. Logistic regression or random forest with default params. Whatever you build later must beat this.
Feature engineering. Numerical: standardize / log-transform. Categorical: one-hot or target encode. Date: extract year / month / day-of-week.
Model search. Try 2–3 model families, tune each with cross-validation.
Ensemble. Average top models. Usually +1–3% on the leaderboard.
Reproduce. Fix random seeds, save the trained model, log the val score.

Regression

Linear regression

Closed-form solution; assumes linear relationship + Gaussian errors. The baseline every other model is judged against.

sklearn.linear_model.LinearRegression

Ridge & Lasso

Linear regression + L2 (Ridge) or L1 (Lasso) regularization. Lasso also performs feature selection by zeroing weights.

Ridge(alpha=1.0) · Lasso(alpha=0.1)

Elastic Net

L1 + L2 combined. Good default when you don't know which kind of regularization helps.

ElasticNet(alpha=0.1, l1_ratio=0.5)

Gradient-boosted trees

Sequential additive trees. State-of-the-art on tabular data — usually beats neural nets on small/medium structured datasets.

HistGradientBoostingRegressor() or external xgboost / lightgbm

Metrics: MAE, MSE, RMSE, R². Pick based on the problem — RMSE penalizes large errors more; MAE is robust to outliers.

Classification

Logistic regression

Linear decision boundary, probabilistic output via sigmoid. Fast, interpretable, hard to overfit.

LogisticRegression(C=1.0, max_iter=1000)

k-Nearest Neighbors

Lazy learner — no model is "fit"; prediction looks at the k closest training points. Sensitive to feature scaling.

KNeighborsClassifier(n_neighbors=5)

Decision trees

Recursive feature splits. Interpretable but high variance — single trees overfit. The building block for ensembles.

DecisionTreeClassifier(max_depth=8)

Random forest

Average of many decorrelated trees. Strong default; robust to outliers and irrelevant features.

RandomForestClassifier(n_estimators=200)

Gradient boosting

Boosted trees that fit residuals sequentially. Best-in-class for most tabular classification.

HistGradientBoostingClassifier()

SVM

Maximum-margin linear classifier; with kernels (RBF, polynomial) it handles non-linear data. Slow at large n.

SVC(kernel='rbf', C=1.0)

Metrics: accuracy, precision, recall, F1, ROC-AUC, log loss. For imbalanced classes, accuracy is misleading — always check precision/recall and the confusion matrix.

Cross-validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(scores.mean(), scores.std())

K-fold (typically k=5 or k=10): split the data into k folds, train on k−1, validate on the held-out one, repeat.
Stratified K-fold: preserves class proportions in each fold — use it for classification.
Time series split: never randomize time-ordered data. Use TimeSeriesSplit.
Reported score is mean ± std across folds. A high mean with high variance is less trustworthy than a slightly lower stable score.

Ensembles

Bagging. Train many models on bootstrapped samples and average. Random forest is bagging on trees.
Boosting. Train models sequentially, each focused on the previous model's errors. AdaBoost, gradient boosting, XGBoost, LightGBM.
Stacking. Train several different model types, then train a meta-model on their out-of-fold predictions. Often the final +1% on leaderboards.
Simple averaging. Average predictions from k different models. Almost always helps if the models are decorrelated.

Unsupervised learning

Clustering

K-Means. Partition n points into k clusters minimizing within-cluster variance. Sensitive to initialization (use k-means++) and to k.
DBSCAN. Density-based. Finds clusters of arbitrary shape and identifies outliers as noise.
Hierarchical (agglomerative). Build a dendrogram. Useful when you don't know k in advance.

Dimensionality reduction

PCA. Linear projection onto the top-k variance directions. The classical workhorse.
t-SNE. Non-linear, optimized for visualization. Distances in t-SNE space are not meaningful globally — use for plotting, not for downstream features.
UMAP. Faster than t-SNE, often preserves global structure better. Also a visualization tool.

Pitfalls

Data leakage. If your validation set influences training (scaler fit on all data, target leaking into features, future info in past rows) — your val score lies.
Overfitting to the validation set. If you tune 200 hyperparameter combos against one fixed val split, you're memorizing the val split. Use cross-validation or a held-out test set.
Class imbalance. A 99%-accurate model on 99/1 data may be predicting "majority" every time. Stratify, use class weights, or resample.
Forgotten random seeds. Two identical runs producing different scores wastes hours.
Forgetting to refit on full training data after cross-validation. Cross-val gives you the score estimate; for the final submission, refit on all available training data.