Reverse Mode AD

Adjoints, the backward pass, and backpropagation. How to differentiate a million parameters in one pass.

The Scaling Problem

Forward mode needs $n$ passes for the gradient of $f: \mathbb{R}^n \to \mathbb{R}$. For a neural network loss with $n = 10^6$ parameters, that's a million passes. We need something better.

Reverse mode computes the entire gradient in a single forward pass plus a single backward pass -- regardless of $n$.

The Two Phases

Phase 1: Forward pass
Evaluate the function normally, computing each $v_k$ in order. Record the tape: store every operation and its intermediate values. These stored values will be needed in the backward pass.
Phase 2: Backward pass
Walk the tape in reverse. At each step, compute the adjoint $\bar{v}_k = \frac{\partial f}{\partial v_k}$ -- how much the final output $f$ changes per unit change in $v_k$.

Adjoints: What and Why

The adjoint of a variable $v_i$ is defined as:

$$\bar{v}_i = \frac{\partial f}{\partial v_i}$$

This is the sensitivity of the final output to a small change in $v_i$. The backward pass computes all adjoints efficiently by applying the chain rule in reverse.

The Adjoint Propagation Rule

Suppose $v_k = g(v_i, v_j)$ is an operation on the tape. During the backward pass, when we process $v_k$, we push its adjoint back to its inputs:

$$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \frac{\partial v_k}{\partial v_i} \qquad \bar{v}_j \mathrel{+}= \bar{v}_k \cdot \frac{\partial v_k}{\partial v_j}$$
The $\mathrel{+}=$ is crucial! If $v_i$ is used in multiple operations (e.g., $v_i$ appears in both $v_3$ and $v_4$), then $v_i$ receives adjoint contributions from each use. This is the multivariate chain rule: $$\bar{v}_i = \sum_{k : v_i \text{ is input to } v_k} \bar{v}_k \cdot \frac{\partial v_k}{\partial v_i}$$

The adjoint rules for each elementary operation:

Operation $v_k = \ldots$Adjoint pushed to inputs
$v_i + v_j$$\bar{v}_i \mathrel{+}= \bar{v}_k$,   $\bar{v}_j \mathrel{+}= \bar{v}_k$
$v_i \times v_j$$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot v_j$,   $\bar{v}_j \mathrel{+}= \bar{v}_k \cdot v_i$
$v_i - v_j$$\bar{v}_i \mathrel{+}= \bar{v}_k$,   $\bar{v}_j \mathrel{+}= -\bar{v}_k$
$\sin(v_i)$$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \cos(v_i)$
$\exp(v_i)$$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \exp(v_i) = \bar{v}_k \cdot v_k$
$\ln(v_i)$$\bar{v}_i \mathrel{+}= \bar{v}_k / v_i$
$v_i^2$$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot 2 v_i$
Notice: these are the same local derivatives as forward mode, but now they're multiplied by the adjoint $\bar{v}_k$ coming from "above" instead of the tangent $\dot{v}_i$ coming from "below". The information flows in the opposite direction.
Interactive: Reverse Mode Step-Through

First step through the forward pass (blue button) to record the tape. Then step through the backward pass (orange button) to propagate adjoints. Watch how each node pushes its adjoint to its inputs.

Worked Example: $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$

At $x_1 = \pi/4, x_2 = 3$. Let's trace both phases completely.

Forward Pass

v1 = x1 = 0.7854
v2 = x2 = 3
v3 = v1 * v2 = 2.356
v4 = sin(v1) = 0.7071
v5 = v3 + v4 = 3.063    ← f

Backward Pass

Initialize: $\bar{v}_5 = 1$ (adjoint of output). All other adjoints start at 0.

Process v5 = v3 + v4
Addition pushes $\bar{v}_5$ to both inputs unchanged:
$\bar{v}_3 \mathrel{+}= \bar{v}_5 \cdot 1 = 1$
$\bar{v}_4 \mathrel{+}= \bar{v}_5 \cdot 1 = 1$
Process v4 = sin(v1)
$\bar{v}_1 \mathrel{+}= \bar{v}_4 \cdot \cos(v_1) = 1 \cdot \cos(\pi/4) = 0.7071$
Process v3 = v1 × v2
Product rule pushes to both inputs:
$\bar{v}_1 \mathrel{+}= \bar{v}_3 \cdot v_2 = 1 \cdot 3 = 3$  (now $\bar{v}_1 = 0.7071 + 3 = 3.707$)
$\bar{v}_2 \mathrel{+}= \bar{v}_3 \cdot v_1 = 1 \cdot 0.7854 = 0.7854$
Result: read off the gradient
$\frac{\partial f}{\partial x_1} = \bar{v}_1 = 3.707 = x_2 + \cos(x_1)$ ✔
$\frac{\partial f}{\partial x_2} = \bar{v}_2 = 0.7854 = x_1$ ✔

Both partial derivatives computed in one backward pass!
Key observation: $v_1$ (i.e., $x_1$) was used in two operations ($v_3$ and $v_4$). Its adjoint accumulated contributions from both: $\bar{v}_1 = 3 + 0.7071$. This is the multivariate chain rule in action.

The Tape: Memory Trade-offs

During the forward pass, we must store intermediate values (the tape) because the backward pass needs them to compute local derivatives. For example, the adjoint of $v_k = v_i \times v_j$ needs the values $v_i$ and $v_j$ from the forward pass.

Memory cost: The tape stores all intermediate values. For a deep neural network with $L$ layers and $d$ neurons per layer, the tape uses $O(L \cdot d)$ memory. This is why large models need enormous GPU memory for training.

Techniques to reduce memory:

Reverse Mode as VJP

Just as forward mode computes a Jacobian-vector product (JVP: $J \cdot v$), reverse mode computes a vector-Jacobian product (VJP: $v^T \cdot J$).

Starting with $\bar{f} = 1$ and propagating backward gives $\bar{x} = \nabla f^T$ -- the gradient as a row vector. More generally, for $f: \mathbb{R}^n \to \mathbb{R}^m$, seeding with a row vector $\bar{y} \in \mathbb{R}^{1 \times m}$ gives:

$$\bar{x} = \bar{y} \cdot J_f$$

One backward pass gives one row of the Jacobian. For $m = 1$ (scalar output), one pass gives the entire gradient.

Real-World Example: Graph Convolutional Network

Let's trace reverse mode through something more realistic: a 2-layer Graph Convolutional Network (GCN) for node classification. The forward computation is:

$$f(\mathbf{W}_1, \mathbf{W}_2) = \text{CrossEntropy}\!\left(\text{softmax}\!\left(\hat{\mathbf{A}} \cdot \text{ReLU}\!\left(\hat{\mathbf{A}} \mathbf{X} \mathbf{W}_1\right) \mathbf{W}_2\right), \mathbf{y}\right)$$

where $\hat{\mathbf{A}}$ is the normalized adjacency matrix (aggregation over neighbors), $\mathbf{X}$ is the input feature matrix, $\mathbf{W}_1, \mathbf{W}_2$ are learnable weight matrices, and $\mathbf{y}$ are labels.

The Tape (Forward Pass)

v1  = X                           (input features, n x d)
v2  = W1                          (weights, d x h)
v3  = W2                          (weights, h x c)
v4  = A_hat                       (normalized adjacency, n x n)
v5  = v4 @ v1                     (neighbor aggregation, n x d)
v6  = v5 @ v2                     (linear transform, n x h)
v7  = relu(v6)                    (activation, n x h)
v8  = v4 @ v7                     (second aggregation, n x h)
v9  = v8 @ v3                     (linear transform, n x c)
v10 = softmax(v9, axis=1)         (class probabilities, n x c)
v11 = cross_entropy(v10, y)       (scalar loss)

The Backward Pass

Start: $\bar{v}_{11} = 1$. Walk backward through the tape.

v11 = cross_entropy(v10, y)
$\bar{v}_{10} = \frac{\partial \text{CE}}{\partial v_{10}}$. For cross-entropy with softmax, this simplifies to: $\bar{v}_{10,ij} = v_{10,ij} - \mathbf{1}[j = y_i]$ (predicted minus one-hot target).
v10 = softmax(v9)
The softmax Jacobian is $\text{diag}(p) - pp^T$ per row. Pushes: $\bar{v}_9 = \bar{v}_{10} \odot v_{10} - v_{10} \odot (\bar{v}_{10} \cdot v_{10}^T)$.
In practice, combined CE+softmax backward is just $\bar{v}_9 = v_{10} - \text{onehot}(y)$, avoiding the full Jacobian.
v9 = v8 @ v3  (matrix multiply)
Matrix multiply adjoint rule: $d(\mathbf{A}\mathbf{B}) = (d\mathbf{A})\mathbf{B} + \mathbf{A}(d\mathbf{B})$
$\bar{v}_8 \mathrel{+}= \bar{v}_9 \cdot v_3^T$  (gradient w.r.t. left factor)
$\bar{v}_3 \mathrel{+}= v_8^T \cdot \bar{v}_9$  (this is $\frac{\partial f}{\partial \mathbf{W}_2}$!)
v8 = v4 @ v7  (neighbor aggregation)
$\bar{v}_7 \mathrel{+}= v_4^T \cdot \bar{v}_8$  (adjoint aggregates over neighbors, transposed!)
$\hat{\mathbf{A}}$ is constant (not learned), so we don't need $\bar{v}_4$.
v7 = relu(v6)
ReLU adjoint: $\bar{v}_6 = \bar{v}_7 \odot \mathbf{1}[v_6 > 0]$  (zero out where input was negative).
This is why ReLU is popular: the backward pass is just a mask!
v6 = v5 @ v2
$\bar{v}_5 \mathrel{+}= \bar{v}_6 \cdot v_2^T$
$\bar{v}_2 \mathrel{+}= v_5^T \cdot \bar{v}_6$  (this is $\frac{\partial f}{\partial \mathbf{W}_1}$!)
v5 = v4 @ v1
$\bar{v}_1 \mathrel{+}= v_4^T \cdot \bar{v}_5$  (gradient w.r.t. input features -- not needed for training, but useful for feature attribution)

Done! We now have $\bar{v}_2 = \frac{\partial f}{\partial \mathbf{W}_1}$ and $\bar{v}_3 = \frac{\partial f}{\partial \mathbf{W}_2}$.
One forward + one backward pass gave us gradients for all parameters ($\mathbf{W}_1$ and $\mathbf{W}_2$) simultaneously. If $\mathbf{W}_1$ is $d \times h$ and $\mathbf{W}_2$ is $h \times c$, that's $dh + hc$ partial derivatives from a single backward pass. Forward mode would have needed $dh + hc$ passes.

Adjoint Rules for Common Matrix Operations

The scalar adjoint rules extend naturally to matrices. Here are the key ones:

Forward: $\mathbf{V}_k = \ldots$Backward: adjoint propagation
$\mathbf{A} \mathbf{B}$  (matmul)$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \mathbf{B}^T$,   $\bar{\mathbf{B}} \mathrel{+}= \mathbf{A}^T \bar{\mathbf{V}}_k$
$\mathbf{A} + \mathbf{B}$$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k$,   $\bar{\mathbf{B}} \mathrel{+}= \bar{\mathbf{V}}_k$
$\text{ReLU}(\mathbf{A})$$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \odot \mathbf{1}[\mathbf{A} > 0]$
$\sigma(\mathbf{A})$  (elementwise)$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \odot \sigma'(\mathbf{A})$
$\text{sum}(\mathbf{A})$$\bar{\mathbf{A}} \mathrel{+}= \bar{v}_k \cdot \mathbf{1}$
Pattern: The adjoint of a matrix multiply $\mathbf{C} = \mathbf{A}\mathbf{B}$ involves transposes: $\bar{\mathbf{A}} = \bar{\mathbf{C}}\mathbf{B}^T$ and $\bar{\mathbf{B}} = \mathbf{A}^T\bar{\mathbf{C}}$. This is the matrix version of the product rule: for $c = a \cdot b$, $\bar{a} = \bar{c} \cdot b$ and $\bar{b} = a \cdot \bar{c}$.

This IS Backpropagation

If you've studied neural networks, you've seen backpropagation. Here's the reveal: backprop is exactly reverse mode AD applied to the computation graph of a neural network.

Backprop termAD term
Forward passEvaluate tape, store activations
Backward passReverse sweep, propagate adjoints
$\delta$ (error signal)Adjoint $\bar{v}$
Weight gradientAdjoint of weight variable
Activation cachingTape storage

Backprop was invented specifically for neural networks. AD is the general-purpose version that works for any computation. Every deep learning framework (PyTorch, JAX, TensorFlow) is, at its core, a reverse-mode AD engine.

Interactive: Cost Comparison

Set the input and output dimensions to see which mode is more efficient.

The Complete Toolkit

Manual Calculus
Understand the math
Finite Differences
Check your work
Automatic Diff
Compute at scale

← Forward Mode Back to Overview