Exact derivatives, computed automatically, at near-zero cost. The engine behind deep learning.
We've seen two approaches so far:
AD is not symbolic differentiation (it doesn't produce formulas). It's also not finite differences (no approximation error). It works by systematically applying the chain rule to every elementary operation in your program.
The foundation of AD is decomposing any computation into a sequence of elementary operations. This sequence is called the tape. For example, $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$:
v1 = x1 (input) v2 = x2 (input) v3 = v1 * v2 (multiply) v4 = sin(v1) (sin) v5 = v3 + v4 (add) ← output f
Each line is one elementary operation that we know how to differentiate. The tape records what was computed and how. AD then uses the chain rule on this tape -- either forwards or in reverse.
Propagate derivatives forward alongside values. Uses dual numbers. One pass computes one directional derivative.
Cost for gradient: $n$ passes for $f: \mathbb{R}^n \to \mathbb{R}$
Best when: few inputs, many outputs
Explore Forward Mode →Record a tape forward, then propagate adjoints backward. One backward pass gives the entire gradient.
Cost for gradient: $1$ pass for $f: \mathbb{R}^n \to \mathbb{R}$
Best when: many inputs, few outputs (= deep learning!)
Explore Reverse Mode →| Forward Mode | Reverse Mode | |
|---|---|---|
| Computes per pass | One directional derivative $\nabla f \cdot v$ | Full gradient $\nabla f$ |
| Cost for full Jacobian | $n$ passes (one per input) | $m$ passes (one per output) |
| Best when | $n \ll m$ (few inputs, many outputs) | $m \ll n$ (many inputs, few outputs) |
| Memory | Low (no tape storage needed) | Higher (must store full tape) |
| Also known as | Tangent mode, JVP | Adjoint mode, VJP, backpropagation |