Deep learning with PyTorch
Tensors, autograd, a multi-layer perceptron from scratch, the standard layer toolkit, training loops, regularization, optimizers — every piece you need to go from raw data to a trained neural network.
Tensors
A PyTorch tensor is an n-dimensional array that supports automatic differentiation and GPU acceleration.
import torch
x = torch.tensor([1.0, 2.0, 3.0])
M = torch.randn(4, 3)
z = torch.zeros(5, 5)
x.shape # torch.Size([3])
M.dtype # torch.float32
M.device # cpu (or 'mps' on Mac, 'cuda' on NVIDIA)
# move to device
M = M.to("mps") # Apple silicon
M = M.to("cuda") # NVIDIA
Bridge with NumPy: torch.from_numpy(a) and t.numpy() share memory (no copy).
Autograd
The single feature that makes deep learning practical: automatic gradient computation.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x
y.backward()
print(x.grad) # 3*x^2 + 2 = 14.0
When a tensor has requires_grad=True, every operation involving it is recorded in a computation graph. Calling .backward() on a scalar walks the graph and accumulates gradients into the leaf tensors' .grad attribute.
optimizer.zero_grad() at the start of each training step, or gradients from previous batches pile up.
A multi-layer perceptron from scratch
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden)
self.fc2 = nn.Linear(hidden, hidden)
self.fc3 = nn.Linear(hidden, out_dim)
self.act = nn.ReLU()
def forward(self, x):
x = self.act(self.fc1(x))
x = self.act(self.fc2(x))
return self.fc3(x)
model = MLP(in_dim=20, hidden=64, out_dim=3)
print(model)
print(sum(p.numel() for p in model.parameters()), "parameters")
Three linear layers with ReLU activations between them; the final layer outputs raw logits (apply softmax or cross-entropy on top).
The standard layer toolkit
| Layer | Purpose | Typical use |
|---|---|---|
nn.Linear(in, out) | Fully-connected layer: y = Wx + b | MLPs, classifier heads |
nn.Conv2d(in_ch, out_ch, k) | 2-D convolution | Image features |
nn.MaxPool2d(k) / AvgPool2d | Spatial downsampling | CNN backbones |
nn.BatchNorm1d / 2d | Normalize feature activations | Stabilize deep nets |
nn.LayerNorm(dim) | Normalize over the last dim | Transformers |
nn.Dropout(p) | Zero out p fraction during training | Regularization |
nn.ReLU / GELU / Tanh / Sigmoid | Activations | ReLU default; GELU for transformers |
nn.Embedding(vocab, dim) | Lookup table for discrete tokens | NLP, categorical features |
The standard training loop
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = MLP(20, 64, 3).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()
train_ds = TensorDataset(X_train_t, y_train_t)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True)
for epoch in range(20):
model.train()
running = 0.0
for xb, yb in train_dl:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad()
logits = model(xb)
loss = loss_fn(logits, yb)
loss.backward()
opt.step()
running += loss.item() * xb.size(0)
train_loss = running / len(train_ds)
# validation
model.eval()
with torch.no_grad():
val_logits = model(X_val_t.to(device))
val_loss = loss_fn(val_logits, y_val_t.to(device)).item()
val_acc = (val_logits.argmax(-1) == y_val_t.to(device)).float().mean().item()
print(f"epoch {epoch:02d} train {train_loss:.4f} val {val_loss:.4f} acc {val_acc:.4f}")
Memorize this skeleton. It's the same shape for every training task you'll write.
Losses you'll actually use
- Regression:
nn.MSELoss,nn.L1Loss,nn.SmoothL1Loss. - Binary classification:
nn.BCEWithLogitsLoss(combines sigmoid + BCE, numerically stable). - Multi-class classification:
nn.CrossEntropyLoss(combines softmax + NLL; takes raw logits). - Multi-label classification:
nn.BCEWithLogitsLossapplied per label.
Optimizers
- SGD — the original.
torch.optim.SGD(params, lr=0.01, momentum=0.9). Good for vision models with proper LR schedules. - Adam / AdamW — adaptive per-parameter learning rates.
AdamWis the default for most modern work (it decouples weight decay correctly). - Learning-rate schedules:
StepLR,CosineAnnealingLR,OneCycleLR. Linear warmup + cosine decay is a strong default for transformers.
Regularization
- Weight decay (L2 on weights, via
weight_decay=in the optimizer). - Dropout in MLPs and transformer blocks.
- BatchNorm or LayerNorm — partially regularizing as a side effect.
- Early stopping — monitor val loss and stop when it plateaus.
- Data augmentation — for vision: flips, crops, color jitter; for NLP: masking, back-translation.
- Smaller model — often the most effective regularizer.
CNNs in one minute
class SmallCNN(nn.Module):
def __init__(self, n_classes=10):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 32x16x16
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 64x8x8
nn.Flatten(),
nn.Linear(64 * 8 * 8, 128), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, n_classes)
)
def forward(self, x):
return self.net(x)
The 3 invariants behind CNNs:
- Local connectivity: each conv kernel only looks at a small spatial neighborhood.
- Weight sharing: the same kernel slides across the whole image — same edge detector everywhere.
- Pooling: shrinks spatial dimensions, building larger receptive fields and translation tolerance.
Pitfalls
- Forgetting
model.train()/ inference-mode toggle — dropout and batchnorm behave differently in each mode. - Forgetting
optimizer.zero_grad()— gradients accumulate across steps. - Loss going to NaN. Usually a too-high learning rate, exploding gradients, or numerical instability (use
BCEWithLogitsLossnotBCELoss(sigmoid(x))). - Loss plateaued. Try lower LR, longer training, or a larger model.
- Validation accuracy > training accuracy. Look for a data split bug — usually you're evaluating on training data by accident.
- GPU OOM. Reduce batch size, use gradient accumulation, or use mixed precision (
torch.cuda.amp).