Lecture 23 — Finite Differences for Optimization

The core question

How would you do optimization without derivatives? The first and simplest idea: approximate them using finite differences.

Recall the definition of the derivative:

$$f'(x) = \lim_{\gamma \to 0} \frac{f(x + \gamma) - f(x)}{\gamma}$$

For a small but finite $\gamma$, we get a forward difference approximation:

$$f'(x) \approx \frac{1}{\gamma}\big(f(x + \gamma) - f(x)\big)$$

Forward vs. centered differences

Forward Difference

$$f'(x) \approx \frac{f(x+\gamma) - f(x)}{\gamma} \qquad \text{Error} = O(\gamma)$$

Centered Difference

$$f'(x) \approx \frac{f(x+\gamma) - f(x-\gamma)}{2\gamma} \qquad \text{Error} = O(\gamma^2)$$

Centered differences are more accurate (second-order vs first-order), but require twice as many function evaluations.

Multivariate gradients

For $f : \mathbb{R}^n \to \mathbb{R}$, we approximate each partial derivative by perturbing one coordinate at a time:

$$\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + \gamma \mathbf{e}_i) - f(\mathbf{x})}{\gamma}$$

where $\mathbf{e}_i$ is the $i$-th standard basis vector. This requires:

Forward differences: $n + 1$ function evaluations (one baseline + one per component)
Centered differences: $2n$ function evaluations

Gradient along a direction

More generally, the directional derivative of $f$ at $\mathbf{x}$ in direction $\mathbf{d}$ is:

$$\nabla_\mathbf{d} f(\mathbf{x}) = \lim_{\gamma \to 0} \frac{f(\mathbf{x} + \gamma \mathbf{d}) - f(\mathbf{x})}{\gamma} = \nabla f(\mathbf{x})^T \mathbf{d}$$

This tells us the rate of change of $f$ as we move along the line $\mathbf{x} + \gamma\mathbf{d}$. The finite difference approximation is:

$$\nabla_\mathbf{d} f(\mathbf{x}) \approx \frac{f(\mathbf{x} + \gamma\mathbf{d}) - f(\mathbf{x})}{\gamma}$$

When $\mathbf{d} = \mathbf{e}_i$ (a standard basis vector), this recovers a single partial derivative. When $\mathbf{d}$ is an arbitrary unit vector, it gives the slope of $f$ along that line — useful for line search in optimization.

Key point: Computing the full gradient via FD requires $n$ directional derivatives (one per coordinate). But a single directional derivative costs just 1 function evaluation (plus the baseline). This is the basis of coordinate search and pattern search methods we'll see later.

The error "V" shape

If smaller $\gamma$ always meant better accuracy, we'd just pick $\gamma = 10^{-15}$ and be done. But that's not what happens. There are two competing errors:

Truncation error (from the Taylor series): Using $f(x+\gamma) = f(x) + \gamma f'(x) + \frac{\gamma^2}{2}f''(x) + \cdots$, the forward difference gives $$E_{\text{trunc}} \approx \tfrac{1}{2}|f''(x)|\cdot\gamma$$ This decreases as $\gamma \to 0$.

Rounding error (from floating-point cancellation): When $f(x+\gamma) \approx f(x)$, their difference loses significant digits. The error is approximately $$E_{\text{round}} \approx \frac{2\,\varepsilon_{\text{mach}}\,|f(x)|}{\gamma}$$ This increases as $\gamma \to 0$.

The total error is the sum:

E_{\text{total}} \approx \frac{M\gamma}{2} + \frac{2\varepsilon}{\gamma}

where $M = |f''(x)|$ and $\varepsilon \approx 10^{-16}$ in double precision. Taking $dE/d\gamma = 0$:

\gamma^* = 2\sqrt{\varepsilon/M} \approx \sqrt{\varepsilon} \approx 10^{-8}

For centered differences, the truncation error is $O(\gamma^2)$ instead of $O(\gamma)$, so:

\gamma^*_{\text{centered}} \approx \left(\frac{3\varepsilon}{M}\right)^{1/3} \approx \varepsilon^{1/3} \approx 10^{-5.3}

On a log-log plot, this creates the characteristic "V" shape: the left side slopes down (truncation dominates), the right side slopes up (rounding dominates), with a minimum at $\gamma^*$.

Summary: FD as an optimization strategy

Property	Value
Cost per gradient	$n+1$ (forward) or $2n$ (centered) function evaluations
Accuracy	$O(\gamma)$ or $O(\gamma^2)$ — limited by floating-point
Step size	$\gamma \approx \sqrt{\epsilon}$, possibly scaled by $\|f(\mathbf{x})\|$
Gives you	Gradient-descent-like progress (first-order)

Key question: Can we do better? Can we get Newton-like (second-order) progress without derivatives? That leads us to quadratic model fitting.