Dual numbers, tangents, and passes. Computing derivatives by propagating them forward alongside values.
The key idea: extend every number to carry along its derivative. Define dual numbers:
This $\varepsilon$ is not "small" -- it's an algebraic object where $\varepsilon^2$ is exactly zero. Arithmetic follows naturally:
Now watch what happens when we evaluate a function at a dual number. We don't need to do anything special -- we just need to define what elementary operations do on dual numbers. The arithmetic rules above already handle polynomials. For example, squaring:
The $\varepsilon^2$ term vanishes! The dual part $2ab$ is exactly $\frac{d}{da}(a^2) \cdot b$. This isn't a coincidence. Consider any smooth function $f$ and its Taylor expansion:
But $(b\varepsilon)^2 = b^2\varepsilon^2 = 0$, and all higher powers of $\varepsilon$ are also zero. So the Taylor series truncates exactly:
Setting $b = 1$: the real part gives $f(a)$ and the dual part gives $f'(a)$. Exact, no approximation.
For a multi-variable function, we decompose the computation into a tape of elementary operations and propagate dual numbers through each step. At each step, we compute both the value and the tangent (the dual part, written $\dot{v}$).
The tangent propagation rule at each step is just the chain rule. If $v_k = g(v_i, v_j)$, then:
For specific operations:
| Operation | Value | Tangent |
|---|---|---|
| $v_k = v_i + v_j$ | $v_i + v_j$ | $\dot{v}_i + \dot{v}_j$ |
| $v_k = v_i \times v_j$ | $v_i \cdot v_j$ | $\dot{v}_i v_j + v_i \dot{v}_j$ |
| $v_k = \sin(v_i)$ | $\sin(v_i)$ | $\dot{v}_i \cos(v_i)$ |
| $v_k = \exp(v_i)$ | $\exp(v_i)$ | $\dot{v}_i \exp(v_i)$ |
| $v_k = \ln(v_i)$ | $\ln(v_i)$ | $\dot{v}_i / v_i$ |
| $v_k = v_i^2$ | $v_i^2$ | $2 v_i \dot{v}_i$ |
These are exactly the dual-number rules in table form. The tangent $\dot{v}_k$ tells you: "if the inputs change by $\dot{v}_i$, how does this intermediate value change?"
To compute $\frac{\partial f}{\partial x_i}$, we "seed" the inputs:
Then we propagate the tape. The final tangent $\dot{f}$ equals $\frac{\partial f}{\partial x_i}$.
More generally, seeding with an arbitrary vector $\dot{x} = v$ computes the directional derivative (Jacobian-vector product, or JVP):
Let's see all passes needed for $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$ at $x_1 = \pi/4, x_2 = 3$.
Try tracing forward mode on these functions. The widget above supports all of them -- select from the dropdown and step through.
v1 = x1 (input) v2 = x2 (input) v3 = v1^2 (square) v4 = v3 + v2 (add) v5 = exp(v4) (exp) ← output f
For seed $\dot{x} = (1, 0)$: the tangent at $v_3$ is $2x_1 \cdot 1 = 2x_1$, at $v_4$ is $2x_1 + 0$, and at $v_5$ is $2x_1 \exp(x_1^2 + x_2)$. This is $\frac{\partial f}{\partial x_1}$.
v1 = x1 (input) v2 = x2 (input) v3 = x3 (input) v4 = v1 * v2 (multiply) v5 = v4 * v3 (multiply) ← output f
This needs 3 passes for the full gradient. Each pass seeds one input with 1. Try it in the widget!
v1 = x1 (input) v2 = x2 (input) v3 = ln(v1) (log) v4 = v1 * v2 (multiply) v5 = sin(v2) (sin) v6 = v3 + v4 (add) v7 = v6 - v5 (subtract) ← output f
With seed $(1, 0)$: $\dot{v}_3 = 1/x_1$, $\dot{v}_4 = x_2$, $\dot{v}_5 = 0$, $\dot{v}_6 = 1/x_1 + x_2$, $\dot{v}_7 = 1/x_1 + x_2$. This is $\frac{\partial f}{\partial x_1} = \frac{1}{x_1} + x_2$. ✔
Forward mode is often overlooked in favor of reverse mode (backprop), but it has real advantages:
In practice, modern AD systems (JAX, PyTorch) support both modes. jax.jvp is forward mode; jax.grad is reverse mode.