Checking Your Derivatives
Matrix calculus is detail-heavy and error-prone. The most important skill is knowing how to verify your answers.
The Problem
Throughout optimization, machine learning, and scientific computing, we need derivatives of functions $f: \mathbb{R}^n \to \mathbb{R}$. The gradient $\nabla f(x) \in \mathbb{R}^n$ is a column vector:
$$\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
Computing these by hand is tedious and shockingly easy to get wrong -- a dropped factor of 2, a transposition error, a misapplied chain rule. Even experts make mistakes.
The golden rule: Never trust a hand-computed derivative until you have checked it numerically.
Our conventions: $x \in \mathbb{R}^n$ is a column vector. $f(x)$ is scalar-valued. The gradient $\nabla f(x)$ is a column vector. For matrix arguments $f(X)$, the derivative $\frac{\partial f}{\partial X}$ has the same shape as $X$.
The Finite Difference Idea
From the definition of the derivative, we know:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
So for any small (but nonzero) $h$, the ratio $\frac{f(x+h) - f(x)}{h}$ should be a reasonable approximation. By Taylor expansion:
$$f(x+h) = f(x) + f'(x) h + \frac{1}{2} f''(x) h^2 + \cdots$$
Rearranging:
$$\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{1}{2}f''(x) h + \cdots = f'(x) + O(h)$$
This is a first-order approximation: the error is proportional to $h$. Make $h$ smaller, get a better approximation... right?
The Error "V" Shape
If smaller $h$ always meant better accuracy, we'd just pick $h = 10^{-15}$ and be done. But that's not what happens.
There are two competing errors:
Truncation error (from Taylor series): $E_{\text{trunc}} \approx \frac{1}{2}|f''(x)| \cdot h$. Decreases as $h \to 0$.
Rounding error (from floating-point cancellation): $E_{\text{round}} \approx \frac{2\varepsilon_{\text{mach}}|f(x)|}{h}$. Increases as $h \to 0$.
When $f(x+h)$ and $f(x)$ are nearly equal, their difference loses significant digits. The total error is:
$$E_{\text{total}} \approx \frac{M h}{2} + \frac{2\varepsilon}{h}$$
where $M = |f''(x)|$ and $\varepsilon \approx 10^{-16}$ (double precision). Minimizing over $h$:
$$h^* = 2\sqrt{\varepsilon / M} \approx \sqrt{\varepsilon} \approx 10^{-8}$$
On a log-log plot, this creates a characteristic "V" shape. Try it below!
Centered Differences: Doing Better
Instead of a one-sided difference, use a centered (or symmetric) difference:
$$\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \frac{1}{6}f'''(x) h^2 + \cdots = f'(x) + O(h^2)$$
This is second-order accurate -- the error decreases as $h^2$. The optimal step size is now:
$$h^* \approx \left(\frac{3\varepsilon}{M}\right)^{1/3} \approx \varepsilon^{1/3} \approx 10^{-5.3}$$
Look at the green line in the plot above -- it drops with slope $-2$ (instead of $-1$) on the left side of the V, and its minimum error is much smaller.
Checking Gradients in Practice
For $f: \mathbb{R}^n \to \mathbb{R}$, we approximate each component of the gradient:
$$[\nabla f(x)]_i \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}$$
where $e_i$ is the $i$-th standard basis vector. This requires $2n$ function evaluations (or $n+1$ for forward differences).
To compare the analytic gradient $g$ with the finite-difference approximation $\tilde{g}$, use:
$$\|g - \tilde{g}\|_\infty = \max_i |g_i - \tilde{g}_i| \qquad \text{(absolute error)}$$
$$\frac{\|g - \tilde{g}\|_\infty}{\|g\|_\infty} \qquad \text{(relative error)}$$
Rules of thumb (centered differences, $h \approx 10^{-5}$):
Relative error $< 10^{-8}$: almost certainly correct.
Relative error $\sim 10^{-5}$: suspicious but possibly OK for ill-conditioned problems.
Relative error $> 10^{-3}$: you have a bug.
Key Takeaways
1. Always check derivatives with finite differences -- it costs almost nothing and catches almost every bug.
2. Use centered differences with $h \approx 10^{-5}$ for best accuracy.
3. Watch for the V-shape: too large $h$ gives truncation error, too small $h$ gives cancellation error.
4. Relative error $< 10^{-8}$ means your analytic gradient is almost certainly correct.
Next: Matrix Calculus Rules →