Forward Mode AD

Dual numbers, tangents, and passes. Computing derivatives by propagating them forward alongside values.

Dual Numbers

The key idea: extend every number to carry along its derivative. Define dual numbers:

a + b\varepsilon \qquad \text{where } \varepsilon^2 = 0

This $\varepsilon$ is not "small" -- it's an algebraic object where $\varepsilon^2$ is exactly zero. Arithmetic follows naturally:

(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon

(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon

Now watch what happens when we evaluate a function at a dual number. We don't need to do anything special -- we just need to define what elementary operations do on dual numbers. The arithmetic rules above already handle polynomials. For example, squaring:

(a + b\varepsilon)^2 = a^2 + 2ab\varepsilon + b^2\varepsilon^2 = a^2 + 2ab\varepsilon

The $\varepsilon^2$ term vanishes! The dual part $2ab$ is exactly $\frac{d}{da}(a^2) \cdot b$. This isn't a coincidence. Consider any smooth function $f$ and its Taylor expansion:

f(a + b\varepsilon) = f(a) + f'(a)(b\varepsilon) + \frac{1}{2}f''(a)(b\varepsilon)^2 + \cdots

But $(b\varepsilon)^2 = b^2\varepsilon^2 = 0$, and all higher powers of $\varepsilon$ are also zero. So the Taylor series truncates exactly:

f(a + b\varepsilon) = f(a) + f'(a) \cdot b\varepsilon

Setting $b = 1$: the real part gives $f(a)$ and the dual part gives $f'(a)$. Exact, no approximation.

Key distinction from finite differences: In FD, we compute $(f(a+h) - f(a))/h$ and suffer from both truncation and cancellation errors. With dual numbers, the derivative appears directly in the $\varepsilon$ coefficient -- no subtraction, no division, no cancellation. The answer is exact to machine precision.

Every elementary function gets a dual-number rule:
$\sin(a + b\varepsilon) = \sin(a) + b\cos(a)\varepsilon$
$\exp(a + b\varepsilon) = \exp(a) + b\exp(a)\varepsilon$
$\ln(a + b\varepsilon) = \ln(a) + \frac{b}{a}\varepsilon$
$\frac{1}{a + b\varepsilon} = \frac{1}{a} - \frac{b}{a^2}\varepsilon$

The Tape and Tangent Propagation

For a multi-variable function, we decompose the computation into a tape of elementary operations and propagate dual numbers through each step. At each step, we compute both the value and the tangent (the dual part, written $\dot{v}$).

The tangent propagation rule at each step is just the chain rule. If $v_k = g(v_i, v_j)$, then:

\dot{v}_k = \frac{\partial g}{\partial v_i} \dot{v}_i + \frac{\partial g}{\partial v_j} \dot{v}_j

For specific operations:

Operation	Value	Tangent
$v_k = v_i + v_j$	$v_i + v_j$	$\dot{v}_i + \dot{v}_j$
$v_k = v_i \times v_j$	$v_i \cdot v_j$	$\dot{v}_i v_j + v_i \dot{v}_j$
$v_k = \sin(v_i)$	$\sin(v_i)$	$\dot{v}_i \cos(v_i)$
$v_k = \exp(v_i)$	$\exp(v_i)$	$\dot{v}_i \exp(v_i)$
$v_k = \ln(v_i)$	$\ln(v_i)$	$\dot{v}_i / v_i$
$v_k = v_i^2$	$v_i^2$	$2 v_i \dot{v}_i$

These are exactly the dual-number rules in table form. The tangent $\dot{v}_k$ tells you: "if the inputs change by $\dot{v}_i$, how does this intermediate value change?"

Seed Directions and Passes

To compute $\frac{\partial f}{\partial x_i}$, we "seed" the inputs:

\dot{x}_j = \begin{cases} 1 & \text{if } j = i \\ 0 & \text{otherwise} \end{cases}

Then we propagate the tape. The final tangent $\dot{f}$ equals $\frac{\partial f}{\partial x_i}$.

One pass = one partial derivative. Each choice of seed gives us one component of the gradient. For the full gradient of $f: \mathbb{R}^n \to \mathbb{R}$, we need $n$ passes -- one with each standard basis vector as the seed.

More generally, seeding with an arbitrary vector $\dot{x} = v$ computes the directional derivative (Jacobian-vector product, or JVP):

\dot{f} = \nabla f(x)^T v = J_f \cdot v

Computing the Full Gradient: Multiple Passes

Let's see all passes needed for $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$ at $x_1 = \pi/4, x_2 = 3$.

More Examples

Try tracing forward mode on these functions. The widget above supports all of them -- select from the dropdown and step through.

Example: $f(x_1, x_2) = \exp(x_1^2 + x_2)$

v1 = x1          (input)
v2 = x2          (input)
v3 = v1^2        (square)
v4 = v3 + v2     (add)
v5 = exp(v4)     (exp)       ← output f

For seed $\dot{x} = (1, 0)$: the tangent at $v_3$ is $2x_1 \cdot 1 = 2x_1$, at $v_4$ is $2x_1 + 0$, and at $v_5$ is $2x_1 \exp(x_1^2 + x_2)$. This is $\frac{\partial f}{\partial x_1}$.

Example: $f(x_1, x_2, x_3) = x_1 x_2 x_3$

v1 = x1          (input)
v2 = x2          (input)
v3 = x3          (input)
v4 = v1 * v2     (multiply)
v5 = v4 * v3     (multiply)  ← output f

This needs 3 passes for the full gradient. Each pass seeds one input with 1. Try it in the widget!

Example: $f(x_1, x_2) = \ln(x_1) + x_1 x_2 - \sin(x_2)$

v1 = x1          (input)
v2 = x2          (input)
v3 = ln(v1)      (log)
v4 = v1 * v2     (multiply)
v5 = sin(v2)     (sin)
v6 = v3 + v4     (add)
v7 = v6 - v5     (subtract)  ← output f

With seed $(1, 0)$: $\dot{v}_3 = 1/x_1$, $\dot{v}_4 = x_2$, $\dot{v}_5 = 0$, $\dot{v}_6 = 1/x_1 + x_2$, $\dot{v}_7 = 1/x_1 + x_2$. This is $\frac{\partial f}{\partial x_1} = \frac{1}{x_1} + x_2$. ✔

When Forward Mode Wins

Forward mode is often overlooked in favor of reverse mode (backprop), but it has real advantages:

1. $f: \mathbb{R} \to \mathbb{R}^m$ (one input, many outputs): Forward mode computes the full Jacobian in one pass. Reverse mode would need $m$ passes.

2. No tape storage: Forward mode computes values and tangents together, without storing the computation graph. For memory-limited settings, this matters.

3. Simpler to implement: Just overload arithmetic operators on dual numbers. No graph construction, no backward pass logic.

In practice, modern AD systems (JAX, PyTorch) support both modes. jax.jvp is forward mode; jax.grad is reverse mode.

Summary

Forward mode AD propagates tangents (dual parts) forward through the tape.

Seed with $e_i$ to get $\frac{\partial f}{\partial x_i}$. Need $n$ passes for full gradient.

Each step computes value + tangent using the chain rule on the elementary operation.

Exact to machine precision -- no finite-difference approximation.

← AD Overview Next: Reverse Mode →