Deep learning with PyTorch

Tensors, autograd, a multi-layer perceptron from scratch, the standard layer toolkit, training loops, regularization, optimizers — every piece you need to go from raw data to a trained neural network.

PyTorch only. The syllabus specifies PyTorch. Build muscle memory in this API and don't waste cycles on TensorFlow / JAX.

Tensors

A PyTorch tensor is an n-dimensional array that supports automatic differentiation and GPU acceleration.

import torch

x = torch.tensor([1.0, 2.0, 3.0])
M = torch.randn(4, 3)
z = torch.zeros(5, 5)

x.shape         # torch.Size([3])
M.dtype         # torch.float32
M.device        # cpu (or 'mps' on Mac, 'cuda' on NVIDIA)

# move to device
M = M.to("mps")  # Apple silicon
M = M.to("cuda") # NVIDIA

Bridge with NumPy: torch.from_numpy(a) and t.numpy() share memory (no copy).

Autograd

The single feature that makes deep learning practical: automatic gradient computation.

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x

y.backward()
print(x.grad)   # 3*x^2 + 2 = 14.0

When a tensor has requires_grad=True, every operation involving it is recorded in a computation graph. Calling .backward() on a scalar walks the graph and accumulates gradients into the leaf tensors' .grad attribute.

Gradients accumulate. Always call optimizer.zero_grad() at the start of each training step, or gradients from previous batches pile up.

A multi-layer perceptron from scratch

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, out_dim)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.act(self.fc1(x))
        x = self.act(self.fc2(x))
        return self.fc3(x)

model = MLP(in_dim=20, hidden=64, out_dim=3)
print(model)
print(sum(p.numel() for p in model.parameters()), "parameters")

Three linear layers with ReLU activations between them; the final layer outputs raw logits (apply softmax or cross-entropy on top).

The standard layer toolkit

Layer	Purpose	Typical use
`nn.Linear(in, out)`	Fully-connected layer: y = Wx + b	MLPs, classifier heads
`nn.Conv2d(in_ch, out_ch, k)`	2-D convolution	Image features
`nn.MaxPool2d(k)` / `AvgPool2d`	Spatial downsampling	CNN backbones
`nn.BatchNorm1d / 2d`	Normalize feature activations	Stabilize deep nets
`nn.LayerNorm(dim)`	Normalize over the last dim	Transformers
`nn.Dropout(p)`	Zero out p fraction during training	Regularization
`nn.ReLU` / `GELU` / `Tanh` / `Sigmoid`	Activations	ReLU default; GELU for transformers
`nn.Embedding(vocab, dim)`	Lookup table for discrete tokens	NLP, categorical features

The standard training loop

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

device = "mps" if torch.backends.mps.is_available() else "cpu"
model = MLP(20, 64, 3).to(device)
opt   = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

train_ds = TensorDataset(X_train_t, y_train_t)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True)

for epoch in range(20):
    model.train()
    running = 0.0
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        opt.step()
        running += loss.item() * xb.size(0)
    train_loss = running / len(train_ds)

    # validation
    model.eval()
    with torch.no_grad():
        val_logits = model(X_val_t.to(device))
        val_loss   = loss_fn(val_logits, y_val_t.to(device)).item()
        val_acc    = (val_logits.argmax(-1) == y_val_t.to(device)).float().mean().item()

    print(f"epoch {epoch:02d}  train {train_loss:.4f}  val {val_loss:.4f}  acc {val_acc:.4f}")

Memorize this skeleton. It's the same shape for every training task you'll write.

Losses you'll actually use

Regression: nn.MSELoss, nn.L1Loss, nn.SmoothL1Loss.
Binary classification: nn.BCEWithLogitsLoss (combines sigmoid + BCE, numerically stable).
Multi-class classification: nn.CrossEntropyLoss (combines softmax + NLL; takes raw logits).
Multi-label classification: nn.BCEWithLogitsLoss applied per label.

Optimizers

SGD — the original. torch.optim.SGD(params, lr=0.01, momentum=0.9). Good for vision models with proper LR schedules.
Adam / AdamW — adaptive per-parameter learning rates. AdamW is the default for most modern work (it decouples weight decay correctly).
Learning-rate schedules: StepLR, CosineAnnealingLR, OneCycleLR. Linear warmup + cosine decay is a strong default for transformers.

Regularization

Weight decay (L2 on weights, via weight_decay= in the optimizer).
Dropout in MLPs and transformer blocks.
BatchNorm or LayerNorm — partially regularizing as a side effect.
Early stopping — monitor val loss and stop when it plateaus.
Data augmentation — for vision: flips, crops, color jitter; for NLP: masking, back-translation.
Smaller model — often the most effective regularizer.

CNNs in one minute

class SmallCNN(nn.Module):
    def __init__(self, n_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),  # 32x16x16
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 64x8x8
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, n_classes)
        )
    def forward(self, x):
        return self.net(x)

The 3 invariants behind CNNs:

Local connectivity: each conv kernel only looks at a small spatial neighborhood.
Weight sharing: the same kernel slides across the whole image — same edge detector everywhere.
Pooling: shrinks spatial dimensions, building larger receptive fields and translation tolerance.

Pitfalls

Forgetting model.train() / inference-mode toggle — dropout and batchnorm behave differently in each mode.
Forgetting optimizer.zero_grad() — gradients accumulate across steps.
Loss going to NaN. Usually a too-high learning rate, exploding gradients, or numerical instability (use BCEWithLogitsLoss not BCELoss(sigmoid(x))).
Loss plateaued. Try lower LR, longer training, or a larger model.
Validation accuracy > training accuracy. Look for a data split bug — usually you're evaluating on training data by accident.
GPU OOM. Reduce batch size, use gradient accumulation, or use mixed precision (torch.cuda.amp).