Classical machine learning
Before deep learning is on the table, classical ML solves most contest tabular problems faster, with fewer bugs, and with smaller compute. This page sweeps the scikit-learn families you must own: regression, classification, ensembles, cross-validation, clustering, dimensionality reduction.
The contest workflow
- Load & inspect.
pd.read_csv,.info(),.describe(), scan for NaNs, look at value distributions. - Train/val/test split. Stratified for classification; time-based for time series. Never mix.
- Baseline model. Logistic regression or random forest with default params. Whatever you build later must beat this.
- Feature engineering. Numerical: standardize / log-transform. Categorical: one-hot or target encode. Date: extract year / month / day-of-week.
- Model search. Try 2–3 model families, tune each with cross-validation.
- Ensemble. Average top models. Usually +1–3% on the leaderboard.
- Reproduce. Fix random seeds, save the trained model, log the val score.
Regression
Linear regression
Closed-form solution; assumes linear relationship + Gaussian errors. The baseline every other model is judged against.
sklearn.linear_model.LinearRegression
Ridge & Lasso
Linear regression + L2 (Ridge) or L1 (Lasso) regularization. Lasso also performs feature selection by zeroing weights.
Ridge(alpha=1.0) · Lasso(alpha=0.1)
Elastic Net
L1 + L2 combined. Good default when you don't know which kind of regularization helps.
ElasticNet(alpha=0.1, l1_ratio=0.5)
Gradient-boosted trees
Sequential additive trees. State-of-the-art on tabular data — usually beats neural nets on small/medium structured datasets.
HistGradientBoostingRegressor() or external xgboost / lightgbm
Metrics: MAE, MSE, RMSE, R². Pick based on the problem — RMSE penalizes large errors more; MAE is robust to outliers.
Classification
Logistic regression
Linear decision boundary, probabilistic output via sigmoid. Fast, interpretable, hard to overfit.
LogisticRegression(C=1.0, max_iter=1000)
k-Nearest Neighbors
Lazy learner — no model is "fit"; prediction looks at the k closest training points. Sensitive to feature scaling.
KNeighborsClassifier(n_neighbors=5)
Decision trees
Recursive feature splits. Interpretable but high variance — single trees overfit. The building block for ensembles.
DecisionTreeClassifier(max_depth=8)
Random forest
Average of many decorrelated trees. Strong default; robust to outliers and irrelevant features.
RandomForestClassifier(n_estimators=200)
Gradient boosting
Boosted trees that fit residuals sequentially. Best-in-class for most tabular classification.
HistGradientBoostingClassifier()
SVM
Maximum-margin linear classifier; with kernels (RBF, polynomial) it handles non-linear data. Slow at large n.
SVC(kernel='rbf', C=1.0)
Metrics: accuracy, precision, recall, F1, ROC-AUC, log loss. For imbalanced classes, accuracy is misleading — always check precision/recall and the confusion matrix.
Cross-validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(scores.mean(), scores.std())
- K-fold (typically k=5 or k=10): split the data into k folds, train on k−1, validate on the held-out one, repeat.
- Stratified K-fold: preserves class proportions in each fold — use it for classification.
- Time series split: never randomize time-ordered data. Use
TimeSeriesSplit. - Reported score is mean ± std across folds. A high mean with high variance is less trustworthy than a slightly lower stable score.
Ensembles
- Bagging. Train many models on bootstrapped samples and average. Random forest is bagging on trees.
- Boosting. Train models sequentially, each focused on the previous model's errors. AdaBoost, gradient boosting, XGBoost, LightGBM.
- Stacking. Train several different model types, then train a meta-model on their out-of-fold predictions. Often the final +1% on leaderboards.
- Simple averaging. Average predictions from k different models. Almost always helps if the models are decorrelated.
Unsupervised learning
Clustering
- K-Means. Partition n points into k clusters minimizing within-cluster variance. Sensitive to initialization (use k-means++) and to k.
- DBSCAN. Density-based. Finds clusters of arbitrary shape and identifies outliers as noise.
- Hierarchical (agglomerative). Build a dendrogram. Useful when you don't know k in advance.
Dimensionality reduction
- PCA. Linear projection onto the top-k variance directions. The classical workhorse.
- t-SNE. Non-linear, optimized for visualization. Distances in t-SNE space are not meaningful globally — use for plotting, not for downstream features.
- UMAP. Faster than t-SNE, often preserves global structure better. Also a visualization tool.
Pitfalls
- Data leakage. If your validation set influences training (scaler fit on all data, target leaking into features, future info in past rows) — your val score lies.
- Overfitting to the validation set. If you tune 200 hyperparameter combos against one fixed val split, you're memorizing the val split. Use cross-validation or a held-out test set.
- Class imbalance. A 99%-accurate model on 99/1 data may be predicting "majority" every time. Stratify, use class weights, or resample.
- Forgotten random seeds. Two identical runs producing different scores wastes hours.
- Forgetting to refit on full training data after cross-validation. Cross-val gives you the score estimate; for the final submission, refit on all available training data.