Solution 1: Approximate the gradient using function evaluations alone.
How would you do optimization without derivatives? The first and simplest idea: approximate them using finite differences.
Recall the definition of the derivative:
$$f'(x) = \lim_{\gamma \to 0} \frac{f(x + \gamma) - f(x)}{\gamma}$$For a small but finite $\gamma$, we get a forward difference approximation:
$$f'(x) \approx \frac{1}{\gamma}\big(f(x + \gamma) - f(x)\big)$$Centered differences are more accurate (second-order vs first-order), but require twice as many function evaluations.
For $f : \mathbb{R}^n \to \mathbb{R}$, we approximate each partial derivative by perturbing one coordinate at a time:
$$\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + \gamma \mathbf{e}_i) - f(\mathbf{x})}{\gamma}$$where $\mathbf{e}_i$ is the $i$-th standard basis vector. This requires:
More generally, the directional derivative of $f$ at $\mathbf{x}$ in direction $\mathbf{d}$ is:
$$\nabla_\mathbf{d} f(\mathbf{x}) = \lim_{\gamma \to 0} \frac{f(\mathbf{x} + \gamma \mathbf{d}) - f(\mathbf{x})}{\gamma} = \nabla f(\mathbf{x})^T \mathbf{d}$$This tells us the rate of change of $f$ as we move along the line $\mathbf{x} + \gamma\mathbf{d}$. The finite difference approximation is:
$$\nabla_\mathbf{d} f(\mathbf{x}) \approx \frac{f(\mathbf{x} + \gamma\mathbf{d}) - f(\mathbf{x})}{\gamma}$$When $\mathbf{d} = \mathbf{e}_i$ (a standard basis vector), this recovers a single partial derivative. When $\mathbf{d}$ is an arbitrary unit vector, it gives the slope of $f$ along that line — useful for line search in optimization.
If smaller $\gamma$ always meant better accuracy, we'd just pick $\gamma = 10^{-15}$ and be done. But that's not what happens. There are two competing errors:
The total error is the sum:
where $M = |f''(x)|$ and $\varepsilon \approx 10^{-16}$ in double precision. Taking $dE/d\gamma = 0$:
For centered differences, the truncation error is $O(\gamma^2)$ instead of $O(\gamma)$, so:
On a log-log plot, this creates the characteristic "V" shape: the left side slopes down (truncation dominates), the right side slopes up (rounding dominates), with a minimum at $\gamma^*$.
| Property | Value |
|---|---|
| Cost per gradient | $n+1$ (forward) or $2n$ (centered) function evaluations |
| Accuracy | $O(\gamma)$ or $O(\gamma^2)$ — limited by floating-point |
| Step size | $\gamma \approx \sqrt{\epsilon}$, possibly scaled by $|f(\mathbf{x})|$ |
| Gives you | Gradient-descent-like progress (first-order) |