Deep learning with PyTorch

Tensors, autograd, a multi-layer perceptron from scratch, the standard layer toolkit, training loops, regularization, optimizers — every piece you need to go from raw data to a trained neural network.

PyTorch only. The syllabus specifies PyTorch. Build muscle memory in this API and don't waste cycles on TensorFlow / JAX.

Tensors

A PyTorch tensor is an n-dimensional array that supports automatic differentiation and GPU acceleration.

import torch

x = torch.tensor([1.0, 2.0, 3.0])
M = torch.randn(4, 3)
z = torch.zeros(5, 5)

x.shape         # torch.Size([3])
M.dtype         # torch.float32
M.device        # cpu (or 'mps' on Mac, 'cuda' on NVIDIA)

# move to device
M = M.to("mps")  # Apple silicon
M = M.to("cuda") # NVIDIA

Bridge with NumPy: torch.from_numpy(a) and t.numpy() share memory (no copy).

Autograd

The single feature that makes deep learning practical: automatic gradient computation.

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x

y.backward()
print(x.grad)   # 3*x^2 + 2 = 14.0

When a tensor has requires_grad=True, every operation involving it is recorded in a computation graph. Calling .backward() on a scalar walks the graph and accumulates gradients into the leaf tensors' .grad attribute.

Gradients accumulate. Always call optimizer.zero_grad() at the start of each training step, or gradients from previous batches pile up.

A multi-layer perceptron from scratch

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, out_dim)
        self.act = nn.ReLU()

    def forward(self, x):
        x = self.act(self.fc1(x))
        x = self.act(self.fc2(x))
        return self.fc3(x)

model = MLP(in_dim=20, hidden=64, out_dim=3)
print(model)
print(sum(p.numel() for p in model.parameters()), "parameters")

Three linear layers with ReLU activations between them; the final layer outputs raw logits (apply softmax or cross-entropy on top).

The standard layer toolkit

LayerPurposeTypical use
nn.Linear(in, out)Fully-connected layer: y = Wx + bMLPs, classifier heads
nn.Conv2d(in_ch, out_ch, k)2-D convolutionImage features
nn.MaxPool2d(k) / AvgPool2dSpatial downsamplingCNN backbones
nn.BatchNorm1d / 2dNormalize feature activationsStabilize deep nets
nn.LayerNorm(dim)Normalize over the last dimTransformers
nn.Dropout(p)Zero out p fraction during trainingRegularization
nn.ReLU / GELU / Tanh / SigmoidActivationsReLU default; GELU for transformers
nn.Embedding(vocab, dim)Lookup table for discrete tokensNLP, categorical features

The standard training loop

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

device = "mps" if torch.backends.mps.is_available() else "cpu"
model = MLP(20, 64, 3).to(device)
opt   = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

train_ds = TensorDataset(X_train_t, y_train_t)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True)

for epoch in range(20):
    model.train()
    running = 0.0
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        opt.step()
        running += loss.item() * xb.size(0)
    train_loss = running / len(train_ds)

    # validation
    model.eval()
    with torch.no_grad():
        val_logits = model(X_val_t.to(device))
        val_loss   = loss_fn(val_logits, y_val_t.to(device)).item()
        val_acc    = (val_logits.argmax(-1) == y_val_t.to(device)).float().mean().item()

    print(f"epoch {epoch:02d}  train {train_loss:.4f}  val {val_loss:.4f}  acc {val_acc:.4f}")

Memorize this skeleton. It's the same shape for every training task you'll write.

Losses you'll actually use

Optimizers

Regularization

CNNs in one minute

class SmallCNN(nn.Module):
    def __init__(self, n_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),  # 32x16x16
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 64x8x8
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, n_classes)
        )
    def forward(self, x):
        return self.net(x)

The 3 invariants behind CNNs:

Pitfalls