Automatic Differentiation

Exact derivatives, computed automatically, at near-zero cost. The engine behind deep learning.

Why AD?

We've seen two approaches so far:

Manual calculus
Exact but tedious and error-prone. Doesn't scale to complex algorithms.

Finite differences
Simple but approximate. Costs $O(n)$ function evaluations for gradient.

Automatic differentiation gives us the best of both worlds: exact derivatives (to machine precision) computed automatically from code, with cost comparable to running the original function.

AD is not symbolic differentiation (it doesn't produce formulas). It's also not finite differences (no approximation error). It works by systematically applying the chain rule to every elementary operation in your program.

The Tape

The foundation of AD is decomposing any computation into a sequence of elementary operations. This sequence is called the tape. For example, $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$:

v1 = x1          (input)
v2 = x2          (input)
v3 = v1 * v2     (multiply)
v4 = sin(v1)     (sin)
v5 = v3 + v4     (add)       ← output f

Each line is one elementary operation that we know how to differentiate. The tape records what was computed and how. AD then uses the chain rule on this tape -- either forwards or in reverse.

Two Modes

Forward Mode

Propagate derivatives forward alongside values. Uses dual numbers. One pass computes one directional derivative.

Cost for gradient: $n$ passes for $f: \mathbb{R}^n \to \mathbb{R}$

Best when: few inputs, many outputs

Explore Forward Mode →

Reverse Mode

Record a tape forward, then propagate adjoints backward. One backward pass gives the entire gradient.

Cost for gradient: $1$ pass for $f: \mathbb{R}^n \to \mathbb{R}$

Best when: many inputs, few outputs (= deep learning!)

Explore Reverse Mode →

Forward vs Reverse at a Glance

	Forward Mode	Reverse Mode
Computes per pass	One directional derivative $\nabla f \cdot v$	Full gradient $\nabla f$
Cost for full Jacobian	$n$ passes (one per input)	$m$ passes (one per output)
Best when	$n \ll m$ (few inputs, many outputs)	$m \ll n$ (many inputs, few outputs)
Memory	Low (no tape storage needed)	Higher (must store full tape)
Also known as	Tangent mode, JVP	Adjoint mode, VJP, backpropagation

Deep learning: loss $f: \mathbb{R}^{n} \to \mathbb{R}$ with $n = $ millions of parameters. Reverse mode computes the full gradient in one forward + one backward pass. This is exactly backpropagation.

Forward Mode → Reverse Mode →