Checking Your Derivatives

Matrix calculus is detail-heavy and error-prone. The most important skill is knowing how to verify your answers.

The Problem

Throughout optimization, machine learning, and scientific computing, we need derivatives of functions $f: \mathbb{R}^n \to \mathbb{R}$. The gradient $\nabla f(x) \in \mathbb{R}^n$ is a column vector:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Computing these by hand is tedious and shockingly easy to get wrong -- a dropped factor of 2, a transposition error, a misapplied chain rule. Even experts make mistakes.

The golden rule: Never trust a hand-computed derivative until you have checked it numerically.

Our conventions: $x \in \mathbb{R}^n$ is a column vector. $f(x)$ is scalar-valued. The gradient $\nabla f(x)$ is a column vector. For matrix arguments $f(X)$, the derivative $\frac{\partial f}{\partial X}$ has the same shape as $X$.

The Finite Difference Idea

From the definition of the derivative, we know:

f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

So for any small (but nonzero) $h$, the ratio $\frac{f(x+h) - f(x)}{h}$ should be a reasonable approximation. By Taylor expansion:

f(x+h) = f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 + \cdots

Rearranging:

\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{1}{2}f''(x) h + \cdots = f'(x) + O(h)

This is a first-order approximation: the error is proportional to $h$. Make $h$ smaller, get a better approximation... right?

The Error "V" Shape

If smaller $h$ always meant better accuracy, we'd just pick $h = 10^{-15}$ and be done. But that's not what happens.

There are two competing errors:

Truncation error (from Taylor series): $E_{\text{trunc}} \approx \frac{1}{2}|f''(x)| \cdot h$. Decreases as $h \to 0$.

Rounding error (from floating-point cancellation): $E_{\text{round}} \approx \frac{2\varepsilon_{\text{mach}}|f(x)|}{h}$. Increases as $h \to 0$.

When $f(x+h)$ and $f(x)$ are nearly equal, their difference loses significant digits. The total error is:

E_{\text{total}} \approx \frac{M h}{2} + \frac{2\varepsilon}{h}

where $M = |f''(x)|$ and $\varepsilon \approx 10^{-16}$ (double precision). Minimizing over $h$:

h^* = 2\sqrt{\varepsilon / M} \approx \sqrt{\varepsilon} \approx 10^{-8}

On a log-log plot, this creates a characteristic "V" shape. Try it below!

Centered Differences: Doing Better

Instead of a one-sided difference, use a centered (or symmetric) difference:

\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \frac{1}{6}f'''(x) h^2 + \cdots = f'(x) + O(h^2)

This is second-order accurate -- the error decreases as $h^2$. The optimal step size is now:

h^* \approx \left(\frac{3\varepsilon}{M}\right)^{1/3} \approx \varepsilon^{1/3} \approx 10^{-5.3}

Look at the green line in the plot above -- it drops with slope $-2$ (instead of $-1$) on the left side of the V, and its minimum error is much smaller.

Checking Gradients in Practice

For $f: \mathbb{R}^n \to \mathbb{R}$, we approximate each component of the gradient:

[\nabla f(x)]_i \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}

where $e_i$ is the $i$-th standard basis vector. This requires $2n$ function evaluations (or $n+1$ for forward differences).

To compare the analytic gradient $g$ with the finite-difference approximation $\tilde{g}$, use:

\|g - \tilde{g}\|_\infty = \max_i |g_i - \tilde{g}_i| \qquad \text{(absolute error)}

\frac{\|g - \tilde{g}\|_\infty}{\|g\|_\infty} \qquad \text{(relative error)}

Rules of thumb (centered differences, $h \approx 10^{-5}$):
Relative error $< 10^{-8}$: almost certainly correct.
Relative error $\sim 10^{-5}$: suspicious but possibly OK for ill-conditioned problems.
Relative error $> 10^{-3}$: you have a bug.

Key Takeaways

1. Always check derivatives with finite differences -- it costs almost nothing and catches almost every bug.

2. Use centered differences with $h \approx 10^{-5}$ for best accuracy.

3. Watch for the V-shape: too large $h$ gives truncation error, too small $h$ gives cancellation error.

4. Relative error $< 10^{-8}$ means your analytic gradient is almost certainly correct.

Next: Matrix Calculus Rules →