Forward Mode AD

Dual numbers, tangents, and passes. Computing derivatives by propagating them forward alongside values.

Dual Numbers

The key idea: extend every number to carry along its derivative. Define dual numbers:

$$a + b\varepsilon \qquad \text{where } \varepsilon^2 = 0$$

This $\varepsilon$ is not "small" -- it's an algebraic object where $\varepsilon^2$ is exactly zero. Arithmetic follows naturally:

$(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon$
$(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon$

Now watch what happens when we evaluate a function at a dual number. We don't need to do anything special -- we just need to define what elementary operations do on dual numbers. The arithmetic rules above already handle polynomials. For example, squaring:

$$(a + b\varepsilon)^2 = a^2 + 2ab\varepsilon + b^2\varepsilon^2 = a^2 + 2ab\varepsilon$$

The $\varepsilon^2$ term vanishes! The dual part $2ab$ is exactly $\frac{d}{da}(a^2) \cdot b$. This isn't a coincidence. Consider any smooth function $f$ and its Taylor expansion:

$$f(a + b\varepsilon) = f(a) + f'(a)(b\varepsilon) + \frac{1}{2}f''(a)(b\varepsilon)^2 + \cdots$$

But $(b\varepsilon)^2 = b^2\varepsilon^2 = 0$, and all higher powers of $\varepsilon$ are also zero. So the Taylor series truncates exactly:

$$f(a + b\varepsilon) = f(a) + f'(a) \cdot b\varepsilon$$

Setting $b = 1$: the real part gives $f(a)$ and the dual part gives $f'(a)$. Exact, no approximation.

Key distinction from finite differences: In FD, we compute $(f(a+h) - f(a))/h$ and suffer from both truncation and cancellation errors. With dual numbers, the derivative appears directly in the $\varepsilon$ coefficient -- no subtraction, no division, no cancellation. The answer is exact to machine precision.
Every elementary function gets a dual-number rule:
$\sin(a + b\varepsilon) = \sin(a) + b\cos(a)\varepsilon$
$\exp(a + b\varepsilon) = \exp(a) + b\exp(a)\varepsilon$
$\ln(a + b\varepsilon) = \ln(a) + \frac{b}{a}\varepsilon$
$\frac{1}{a + b\varepsilon} = \frac{1}{a} - \frac{b}{a^2}\varepsilon$
Interactive: Dual Number Calculator

Enter dual numbers and pick an operation. The dual part of the result is the derivative!

The Tape and Tangent Propagation

For a multi-variable function, we decompose the computation into a tape of elementary operations and propagate dual numbers through each step. At each step, we compute both the value and the tangent (the dual part, written $\dot{v}$).

The tangent propagation rule at each step is just the chain rule. If $v_k = g(v_i, v_j)$, then:

$$\dot{v}_k = \frac{\partial g}{\partial v_i} \dot{v}_i + \frac{\partial g}{\partial v_j} \dot{v}_j$$

For specific operations:

OperationValueTangent
$v_k = v_i + v_j$$v_i + v_j$$\dot{v}_i + \dot{v}_j$
$v_k = v_i \times v_j$$v_i \cdot v_j$$\dot{v}_i v_j + v_i \dot{v}_j$
$v_k = \sin(v_i)$$\sin(v_i)$$\dot{v}_i \cos(v_i)$
$v_k = \exp(v_i)$$\exp(v_i)$$\dot{v}_i \exp(v_i)$
$v_k = \ln(v_i)$$\ln(v_i)$$\dot{v}_i / v_i$
$v_k = v_i^2$$v_i^2$$2 v_i \dot{v}_i$

These are exactly the dual-number rules in table form. The tangent $\dot{v}_k$ tells you: "if the inputs change by $\dot{v}_i$, how does this intermediate value change?"

Seed Directions and Passes

To compute $\frac{\partial f}{\partial x_i}$, we "seed" the inputs:

$$\dot{x}_j = \begin{cases} 1 & \text{if } j = i \\ 0 & \text{otherwise} \end{cases}$$

Then we propagate the tape. The final tangent $\dot{f}$ equals $\frac{\partial f}{\partial x_i}$.

One pass = one partial derivative. Each choice of seed gives us one component of the gradient. For the full gradient of $f: \mathbb{R}^n \to \mathbb{R}$, we need $n$ passes -- one with each standard basis vector as the seed.

More generally, seeding with an arbitrary vector $\dot{x} = v$ computes the directional derivative (Jacobian-vector product, or JVP):

$$\dot{f} = \nabla f(x)^T v = J_f \cdot v$$
Interactive: Forward Mode Trace

Step through the tape. At each step, see how the value and tangent are computed from their inputs. Change the seed direction to compute different partial derivatives.

Computing the Full Gradient: Multiple Passes

Let's see all passes needed for $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$ at $x_1 = \pi/4, x_2 = 3$.

Interactive: All Passes for the Full Gradient

More Examples

Try tracing forward mode on these functions. The widget above supports all of them -- select from the dropdown and step through.

Example: $f(x_1, x_2) = \exp(x_1^2 + x_2)$

v1 = x1          (input)
v2 = x2          (input)
v3 = v1^2        (square)
v4 = v3 + v2     (add)
v5 = exp(v4)     (exp)       ← output f

For seed $\dot{x} = (1, 0)$: the tangent at $v_3$ is $2x_1 \cdot 1 = 2x_1$, at $v_4$ is $2x_1 + 0$, and at $v_5$ is $2x_1 \exp(x_1^2 + x_2)$. This is $\frac{\partial f}{\partial x_1}$.

Example: $f(x_1, x_2, x_3) = x_1 x_2 x_3$

v1 = x1          (input)
v2 = x2          (input)
v3 = x3          (input)
v4 = v1 * v2     (multiply)
v5 = v4 * v3     (multiply)  ← output f

This needs 3 passes for the full gradient. Each pass seeds one input with 1. Try it in the widget!

Example: $f(x_1, x_2) = \ln(x_1) + x_1 x_2 - \sin(x_2)$

v1 = x1          (input)
v2 = x2          (input)
v3 = ln(v1)      (log)
v4 = v1 * v2     (multiply)
v5 = sin(v2)     (sin)
v6 = v3 + v4     (add)
v7 = v6 - v5     (subtract)  ← output f

With seed $(1, 0)$: $\dot{v}_3 = 1/x_1$, $\dot{v}_4 = x_2$, $\dot{v}_5 = 0$, $\dot{v}_6 = 1/x_1 + x_2$, $\dot{v}_7 = 1/x_1 + x_2$. This is $\frac{\partial f}{\partial x_1} = \frac{1}{x_1} + x_2$. ✔

When Forward Mode Wins

Forward mode is often overlooked in favor of reverse mode (backprop), but it has real advantages:

1. $f: \mathbb{R} \to \mathbb{R}^m$ (one input, many outputs): Forward mode computes the full Jacobian in one pass. Reverse mode would need $m$ passes.
2. No tape storage: Forward mode computes values and tangents together, without storing the computation graph. For memory-limited settings, this matters.
3. Simpler to implement: Just overload arithmetic operators on dual numbers. No graph construction, no backward pass logic.

In practice, modern AD systems (JAX, PyTorch) support both modes. jax.jvp is forward mode; jax.grad is reverse mode.

Summary

Forward mode AD propagates tangents (dual parts) forward through the tape.

Seed with $e_i$ to get $\frac{\partial f}{\partial x_i}$. Need $n$ passes for full gradient.

Each step computes value + tangent using the chain rule on the elementary operation.

Exact to machine precision -- no finite-difference approximation.

← AD Overview Next: Reverse Mode →