Reverse Mode AD

Adjoints, the backward pass, and backpropagation. How to differentiate a million parameters in one pass.

The Scaling Problem

Forward mode needs $n$ passes for the gradient of $f: \mathbb{R}^n \to \mathbb{R}$. For a neural network loss with $n = 10^6$ parameters, that's a million passes. We need something better.

Reverse mode computes the entire gradient in a single forward pass plus a single backward pass -- regardless of $n$.

The Two Phases

Phase 1: Forward pass
Evaluate the function normally, computing each $v_k$ in order. Record the tape: store every operation and its intermediate values. These stored values will be needed in the backward pass.

Phase 2: Backward pass
Walk the tape in reverse. At each step, compute the adjoint $\bar{v}_k = \frac{\partial f}{\partial v_k}$ -- how much the final output $f$ changes per unit change in $v_k$.

Adjoints: What and Why

The adjoint of a variable $v_i$ is defined as:

\bar{v}_i = \frac{\partial f}{\partial v_i}

This is the sensitivity of the final output to a small change in $v_i$. The backward pass computes all adjoints efficiently by applying the chain rule in reverse.

The Adjoint Propagation Rule

Suppose $v_k = g(v_i, v_j)$ is an operation on the tape. During the backward pass, when we process $v_k$, we push its adjoint back to its inputs:

\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \frac{\partial v_k}{\partial v_i} \qquad \bar{v}_j \mathrel{+}= \bar{v}_k \cdot \frac{\partial v_k}{\partial v_j}

The $\mathrel{+}=$ is crucial! If $v_i$ is used in multiple operations (e.g., $v_i$ appears in both $v_3$ and $v_4$), then $v_i$ receives adjoint contributions from each use. This is the multivariate chain rule: $$\bar{v}_i = \sum_{k : v_i \text{ is input to } v_k} \bar{v}_k \cdot \frac{\partial v_k}{\partial v_i}$$

The adjoint rules for each elementary operation:

Operation $v_k = \ldots$	Adjoint pushed to inputs
$v_i + v_j$	$\bar{v}_i \mathrel{+}= \bar{v}_k$, $\bar{v}_j \mathrel{+}= \bar{v}_k$
$v_i \times v_j$	$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot v_j$, $\bar{v}_j \mathrel{+}= \bar{v}_k \cdot v_i$
$v_i - v_j$	$\bar{v}_i \mathrel{+}= \bar{v}_k$, $\bar{v}_j \mathrel{+}= -\bar{v}_k$
$\sin(v_i)$	$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \cos(v_i)$
$\exp(v_i)$	$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot \exp(v_i) = \bar{v}_k \cdot v_k$
$\ln(v_i)$	$\bar{v}_i \mathrel{+}= \bar{v}_k / v_i$
$v_i^2$	$\bar{v}_i \mathrel{+}= \bar{v}_k \cdot 2 v_i$

Notice: these are the same local derivatives as forward mode, but now they're multiplied by the adjoint $\bar{v}_k$ coming from "above" instead of the tangent $\dot{v}_i$ coming from "below". The information flows in the opposite direction.

Worked Example: $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$

At $x_1 = \pi/4, x_2 = 3$. Let's trace both phases completely.

Forward Pass

v1 = x1 = 0.7854
v2 = x2 = 3
v3 = v1 * v2 = 2.356
v4 = sin(v1) = 0.7071
v5 = v3 + v4 = 3.063    ← f

Backward Pass

Initialize: $\bar{v}_5 = 1$ (adjoint of output). All other adjoints start at 0.

Process v5 = v3 + v4

Addition pushes $\bar{v}_5$ to both inputs unchanged:
$\bar{v}_3 \mathrel{+}= \bar{v}_5 \cdot 1 = 1$
$\bar{v}_4 \mathrel{+}= \bar{v}_5 \cdot 1 = 1$

Process v4 = sin(v1)

$\bar{v}_1 \mathrel{+}= \bar{v}_4 \cdot \cos(v_1) = 1 \cdot \cos(\pi/4) = 0.7071$

Process v3 = v1 × v2

Product rule pushes to both inputs:
$\bar{v}_1 \mathrel{+}= \bar{v}_3 \cdot v_2 = 1 \cdot 3 = 3$ (now $\bar{v}_1 = 0.7071 + 3 = 3.707$)
$\bar{v}_2 \mathrel{+}= \bar{v}_3 \cdot v_1 = 1 \cdot 0.7854 = 0.7854$

Result: read off the gradient

$\frac{\partial f}{\partial x_1} = \bar{v}_1 = 3.707 = x_2 + \cos(x_1)$ ✔
$\frac{\partial f}{\partial x_2} = \bar{v}_2 = 0.7854 = x_1$ ✔

Both partial derivatives computed in one backward pass!

Key observation: $v_1$ (i.e., $x_1$) was used in two operations ($v_3$ and $v_4$). Its adjoint accumulated contributions from both: $\bar{v}_1 = 3 + 0.7071$. This is the multivariate chain rule in action.

The Tape: Memory Trade-offs

During the forward pass, we must store intermediate values (the tape) because the backward pass needs them to compute local derivatives. For example, the adjoint of $v_k = v_i \times v_j$ needs the values $v_i$ and $v_j$ from the forward pass.

Memory cost: The tape stores all intermediate values. For a deep neural network with $L$ layers and $d$ neurons per layer, the tape uses $O(L \cdot d)$ memory. This is why large models need enormous GPU memory for training.

Techniques to reduce memory:

Checkpointing: store only some intermediate values; recompute the rest during the backward pass
Gradient accumulation: process mini-batches sequentially instead of all at once
Mixed precision: store activations in float16 instead of float32

Reverse Mode as VJP

Just as forward mode computes a Jacobian-vector product (JVP: $J \cdot v$), reverse mode computes a vector-Jacobian product (VJP: $v^T \cdot J$).

Starting with $\bar{f} = 1$ and propagating backward gives $\bar{x} = \nabla f^T$ -- the gradient as a row vector. More generally, for $f: \mathbb{R}^n \to \mathbb{R}^m$, seeding with a row vector $\bar{y} \in \mathbb{R}^{1 \times m}$ gives:

\bar{x} = \bar{y} \cdot J_f

One backward pass gives one row of the Jacobian. For $m = 1$ (scalar output), one pass gives the entire gradient.

Real-World Example: Graph Convolutional Network

Let's trace reverse mode through something more realistic: a 2-layer Graph Convolutional Network (GCN) for node classification. The forward computation is:

f(\mathbf{W}_1, \mathbf{W}_2) = \text{CrossEntropy}\!\left(\text{softmax}\!\left(\hat{\mathbf{A}} \cdot \text{ReLU}\!\left(\hat{\mathbf{A}} \mathbf{X} \mathbf{W}_1\right) \mathbf{W}_2\right), \mathbf{y}\right)

where $\hat{\mathbf{A}}$ is the normalized adjacency matrix (aggregation over neighbors), $\mathbf{X}$ is the input feature matrix, $\mathbf{W}_1, \mathbf{W}_2$ are learnable weight matrices, and $\mathbf{y}$ are labels.

The Tape (Forward Pass)

v1  = X                           (input features, n x d)
v2  = W1                          (weights, d x h)
v3  = W2                          (weights, h x c)
v4  = A_hat                       (normalized adjacency, n x n)
v5  = v4 @ v1                     (neighbor aggregation, n x d)
v6  = v5 @ v2                     (linear transform, n x h)
v7  = relu(v6)                    (activation, n x h)
v8  = v4 @ v7                     (second aggregation, n x h)
v9  = v8 @ v3                     (linear transform, n x c)
v10 = softmax(v9, axis=1)         (class probabilities, n x c)
v11 = cross_entropy(v10, y)       (scalar loss)

The Backward Pass

Start: $\bar{v}_{11} = 1$. Walk backward through the tape.

v11 = cross_entropy(v10, y)

$\bar{v}_{10} = \frac{\partial \text{CE}}{\partial v_{10}}$. For cross-entropy with softmax, this simplifies to: $\bar{v}_{10,ij} = v_{10,ij} - \mathbf{1}[j = y_i]$ (predicted minus one-hot target).

v10 = softmax(v9)

The softmax Jacobian is $\text{diag}(p) - pp^T$ per row. Pushes: $\bar{v}_9 = \bar{v}_{10} \odot v_{10} - v_{10} \odot (\bar{v}_{10} \cdot v_{10}^T)$.
In practice, combined CE+softmax backward is just $\bar{v}_9 = v_{10} - \text{onehot}(y)$, avoiding the full Jacobian.

v9 = v8 @ v3 (matrix multiply)

Matrix multiply adjoint rule: $d(\mathbf{A}\mathbf{B}) = (d\mathbf{A})\mathbf{B} + \mathbf{A}(d\mathbf{B})$
$\bar{v}_8 \mathrel{+}= \bar{v}_9 \cdot v_3^T$ (gradient w.r.t. left factor)
$\bar{v}_3 \mathrel{+}= v_8^T \cdot \bar{v}_9$ (this is $\frac{\partial f}{\partial \mathbf{W}_2}$!)

v8 = v4 @ v7 (neighbor aggregation)

$\bar{v}_7 \mathrel{+}= v_4^T \cdot \bar{v}_8$ (adjoint aggregates over neighbors, transposed!)
$\hat{\mathbf{A}}$ is constant (not learned), so we don't need $\bar{v}_4$.

v7 = relu(v6)

ReLU adjoint: $\bar{v}_6 = \bar{v}_7 \odot \mathbf{1}[v_6 > 0]$ (zero out where input was negative).
This is why ReLU is popular: the backward pass is just a mask!

v6 = v5 @ v2

$\bar{v}_5 \mathrel{+}= \bar{v}_6 \cdot v_2^T$
$\bar{v}_2 \mathrel{+}= v_5^T \cdot \bar{v}_6$ (this is $\frac{\partial f}{\partial \mathbf{W}_1}$!)

v5 = v4 @ v1

$\bar{v}_1 \mathrel{+}= v_4^T \cdot \bar{v}_5$ (gradient w.r.t. input features -- not needed for training, but useful for feature attribution)

Done! We now have $\bar{v}_2 = \frac{\partial f}{\partial \mathbf{W}_1}$ and $\bar{v}_3 = \frac{\partial f}{\partial \mathbf{W}_2}$.

One forward + one backward pass gave us gradients for all parameters ($\mathbf{W}_1$ and $\mathbf{W}_2$) simultaneously. If $\mathbf{W}_1$ is $d \times h$ and $\mathbf{W}_2$ is $h \times c$, that's $dh + hc$ partial derivatives from a single backward pass. Forward mode would have needed $dh + hc$ passes.

Adjoint Rules for Common Matrix Operations

The scalar adjoint rules extend naturally to matrices. Here are the key ones:

Forward: $\mathbf{V}_k = \ldots$	Backward: adjoint propagation
$\mathbf{A} \mathbf{B}$ (matmul)	$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \mathbf{B}^T$, $\bar{\mathbf{B}} \mathrel{+}= \mathbf{A}^T \bar{\mathbf{V}}_k$
$\mathbf{A} + \mathbf{B}$	$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k$, $\bar{\mathbf{B}} \mathrel{+}= \bar{\mathbf{V}}_k$
$\text{ReLU}(\mathbf{A})$	$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \odot \mathbf{1}[\mathbf{A} > 0]$
$\sigma(\mathbf{A})$ (elementwise)	$\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}}_k \odot \sigma'(\mathbf{A})$
$\text{sum}(\mathbf{A})$	$\bar{\mathbf{A}} \mathrel{+}= \bar{v}_k \cdot \mathbf{1}$

Pattern: The adjoint of a matrix multiply $\mathbf{C} = \mathbf{A}\mathbf{B}$ involves transposes: $\bar{\mathbf{A}} = \bar{\mathbf{C}}\mathbf{B}^T$ and $\bar{\mathbf{B}} = \mathbf{A}^T\bar{\mathbf{C}}$. This is the matrix version of the product rule: for $c = a \cdot b$, $\bar{a} = \bar{c} \cdot b$ and $\bar{b} = a \cdot \bar{c}$.

This IS Backpropagation

If you've studied neural networks, you've seen backpropagation. Here's the reveal: backprop is exactly reverse mode AD applied to the computation graph of a neural network.

Backprop term	AD term
Forward pass	Evaluate tape, store activations
Backward pass	Reverse sweep, propagate adjoints
$\delta$ (error signal)	Adjoint $\bar{v}$
Weight gradient	Adjoint of weight variable
Activation caching	Tape storage

Backprop was invented specifically for neural networks. AD is the general-purpose version that works for any computation. Every deep learning framework (PyTorch, JAX, TensorFlow) is, at its core, a reverse-mode AD engine.

The Complete Toolkit

Manual Calculus
Understand the math

Finite Differences
Check your work

Automatic Diff
Compute at scale

← Forward Mode Back to Overview