Adjoints, the backward pass, and backpropagation. How to differentiate a million parameters in one pass.
The Scaling Problem
Forward mode needs $n$ passes for the gradient of $f: \mathbb{R}^n \to \mathbb{R}$. For a neural network loss with $n = 10^6$ parameters, that's a million passes. We need something better.
Reverse mode computes the entire gradient in a single forward pass plus a single backward pass -- regardless of $n$.
The Two Phases
Phase 1: Forward pass
Evaluate the function normally, computing each $v_k$ in order. Record the tape: store every operation and its intermediate values. These stored values will be needed in the backward pass.
Phase 2: Backward pass
Walk the tape in reverse. At each step, compute the adjoint $\bar{v}_k = \frac{\partial f}{\partial v_k}$ -- how much the final output $f$ changes per unit change in $v_k$.
Adjoints: What and Why
The adjoint of a variable $v_i$ is defined as:
$$\bar{v}_i = \frac{\partial f}{\partial v_i}$$
This is the sensitivity of the final output to a small change in $v_i$. The backward pass computes all adjoints efficiently by applying the chain rule in reverse.
The Adjoint Propagation Rule
Suppose $v_k = g(v_i, v_j)$ is an operation on the tape. During the backward pass, when we process $v_k$, we push its adjoint back to its inputs:
The $\mathrel{+}=$ is crucial! If $v_i$ is used in multiple operations (e.g., $v_i$ appears in both $v_3$ and $v_4$), then $v_i$ receives adjoint contributions from each use. This is the multivariate chain rule:
$$\bar{v}_i = \sum_{k : v_i \text{ is input to } v_k} \bar{v}_k \cdot \frac{\partial v_k}{\partial v_i}$$
Notice: these are the same local derivatives as forward mode, but now they're multiplied by the adjoint $\bar{v}_k$ coming from "above" instead of the tangent $\dot{v}_i$ coming from "below". The information flows in the opposite direction.
Interactive: Reverse Mode Step-Through
First step through the forward pass (blue button) to record the tape. Then step through the backward pass (orange button) to propagate adjoints. Watch how each node pushes its adjoint to its inputs.
Worked Example: $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$
At $x_1 = \pi/4, x_2 = 3$. Let's trace both phases completely.
Both partial derivatives computed in one backward pass!
Key observation: $v_1$ (i.e., $x_1$) was used in two operations ($v_3$ and $v_4$). Its adjoint accumulated contributions from both: $\bar{v}_1 = 3 + 0.7071$. This is the multivariate chain rule in action.
The Tape: Memory Trade-offs
During the forward pass, we must store intermediate values (the tape) because the backward pass needs them to compute local derivatives. For example, the adjoint of $v_k = v_i \times v_j$ needs the values $v_i$ and $v_j$ from the forward pass.
Memory cost: The tape stores all intermediate values. For a deep neural network with $L$ layers and $d$ neurons per layer, the tape uses $O(L \cdot d)$ memory. This is why large models need enormous GPU memory for training.
Techniques to reduce memory:
Checkpointing: store only some intermediate values; recompute the rest during the backward pass
Gradient accumulation: process mini-batches sequentially instead of all at once
Mixed precision: store activations in float16 instead of float32
Reverse Mode as VJP
Just as forward mode computes a Jacobian-vector product (JVP: $J \cdot v$), reverse mode computes a vector-Jacobian product (VJP: $v^T \cdot J$).
Starting with $\bar{f} = 1$ and propagating backward gives $\bar{x} = \nabla f^T$ -- the gradient as a row vector. More generally, for $f: \mathbb{R}^n \to \mathbb{R}^m$, seeding with a row vector $\bar{y} \in \mathbb{R}^{1 \times m}$ gives:
$$\bar{x} = \bar{y} \cdot J_f$$
One backward pass gives one row of the Jacobian. For $m = 1$ (scalar output), one pass gives the entire gradient.
Real-World Example: Graph Convolutional Network
Let's trace reverse mode through something more realistic: a 2-layer Graph Convolutional Network (GCN) for node classification. The forward computation is:
where $\hat{\mathbf{A}}$ is the normalized adjacency matrix (aggregation over neighbors), $\mathbf{X}$ is the input feature matrix, $\mathbf{W}_1, \mathbf{W}_2$ are learnable weight matrices, and $\mathbf{y}$ are labels.
The Tape (Forward Pass)
v1 = X (input features, n x d)
v2 = W1 (weights, d x h)
v3 = W2 (weights, h x c)
v4 = A_hat (normalized adjacency, n x n)
v5 = v4 @ v1 (neighbor aggregation, n x d)
v6 = v5 @ v2 (linear transform, n x h)
v7 = relu(v6) (activation, n x h)
v8 = v4 @ v7 (second aggregation, n x h)
v9 = v8 @ v3 (linear transform, n x c)
v10 = softmax(v9, axis=1) (class probabilities, n x c)
v11 = cross_entropy(v10, y) (scalar loss)
The Backward Pass
Start: $\bar{v}_{11} = 1$. Walk backward through the tape.
v11 = cross_entropy(v10, y)
$\bar{v}_{10} = \frac{\partial \text{CE}}{\partial v_{10}}$. For cross-entropy with softmax, this simplifies to: $\bar{v}_{10,ij} = v_{10,ij} - \mathbf{1}[j = y_i]$ (predicted minus one-hot target).
v10 = softmax(v9)
The softmax Jacobian is $\text{diag}(p) - pp^T$ per row. Pushes: $\bar{v}_9 = \bar{v}_{10} \odot v_{10} - v_{10} \odot (\bar{v}_{10} \cdot v_{10}^T)$.
In practice, combined CE+softmax backward is just $\bar{v}_9 = v_{10} - \text{onehot}(y)$, avoiding the full Jacobian.
$\bar{v}_7 \mathrel{+}= v_4^T \cdot \bar{v}_8$ (adjoint aggregates over neighbors, transposed!)
$\hat{\mathbf{A}}$ is constant (not learned), so we don't need $\bar{v}_4$.
v7 = relu(v6)
ReLU adjoint: $\bar{v}_6 = \bar{v}_7 \odot \mathbf{1}[v_6 > 0]$ (zero out where input was negative).
This is why ReLU is popular: the backward pass is just a mask!
$\bar{v}_1 \mathrel{+}= v_4^T \cdot \bar{v}_5$ (gradient w.r.t. input features -- not needed for training, but useful for feature attribution)
Done! We now have $\bar{v}_2 = \frac{\partial f}{\partial \mathbf{W}_1}$ and $\bar{v}_3 = \frac{\partial f}{\partial \mathbf{W}_2}$.
One forward + one backward pass gave us gradients for all parameters ($\mathbf{W}_1$ and $\mathbf{W}_2$) simultaneously. If $\mathbf{W}_1$ is $d \times h$ and $\mathbf{W}_2$ is $h \times c$, that's $dh + hc$ partial derivatives from a single backward pass. Forward mode would have needed $dh + hc$ passes.
Adjoint Rules for Common Matrix Operations
The scalar adjoint rules extend naturally to matrices. Here are the key ones:
Pattern: The adjoint of a matrix multiply $\mathbf{C} = \mathbf{A}\mathbf{B}$ involves transposes: $\bar{\mathbf{A}} = \bar{\mathbf{C}}\mathbf{B}^T$ and $\bar{\mathbf{B}} = \mathbf{A}^T\bar{\mathbf{C}}$. This is the matrix version of the product rule: for $c = a \cdot b$, $\bar{a} = \bar{c} \cdot b$ and $\bar{b} = a \cdot \bar{c}$.
This IS Backpropagation
If you've studied neural networks, you've seen backpropagation. Here's the reveal: backprop is exactly reverse mode AD applied to the computation graph of a neural network.
Backprop term
AD term
Forward pass
Evaluate tape, store activations
Backward pass
Reverse sweep, propagate adjoints
$\delta$ (error signal)
Adjoint $\bar{v}$
Weight gradient
Adjoint of weight variable
Activation caching
Tape storage
Backprop was invented specifically for neural networks. AD is the general-purpose version that works for any computation. Every deep learning framework (PyTorch, JAX, TensorFlow) is, at its core, a reverse-mode AD engine.
Interactive: Cost Comparison
Set the input and output dimensions to see which mode is more efficient.