Lecture 23 — Quadratic Model Methods

Why not just use FD for the Hessian?

Newton's method needs the Hessian $\nabla^2 f(\mathbf{x}) \in \mathbb{R}^{n \times n}$. Could we estimate it via finite differences? Each column requires a gradient evaluation, and each gradient costs $O(n)$ function evaluations, so:

$$\text{FD Hessian cost} = O(n) \times O(n) = O(n^2) \text{ function evaluations}$$

For expensive black-box functions, $O(n^2)$ evaluations per step is often unacceptable. The quadratic model approach achieves Newton-like behavior using a different strategy: rather than estimating derivatives, we interpolate the function directly.

The quadratic model

Build a full quadratic model of $f$ around the current point $\mathbf{x}_k$:

$$m_k(\mathbf{x}_k + \mathbf{p}) = c + \mathbf{q}^T\mathbf{p} + \tfrac{1}{2}\mathbf{p}^T\mathbf{G}\mathbf{p}$$

Here $c$ is a scalar, $\mathbf{q} \in \mathbb{R}^n$ plays the role of the gradient, and $\mathbf{G} \in \mathbb{R}^{n \times n}$ (symmetric) plays the role of the Hessian.

We determine $c$, $\mathbf{q}$, $\mathbf{G}$ by interpolation: choose sample points $\mathbf{y}_1, \ldots, \mathbf{y}_p$ and require

$$m_k(\mathbf{y}_\ell) = f(\mathbf{y}_\ell), \quad 1 \le \ell \le p$$

Then we optimize this model using a trust-region method.

How many parameters?

The quadratic model has three groups of unknowns:

Component	Count
Constant $c$	1
Linear term $\mathbf{q}$	$n$
Quadratic term $\mathbf{G}$ (symmetric)	$\frac{n(n+1)}{2}$
Total	$1 + n + \frac{n(n+1)}{2} = O(n^2)$

Compare two ways to build a local quadratic model at a point $\mathbf{x}_k$: interpolation (6 sample points) vs. FD gradient + FD Hessian.

True function

Interpolation model (6 pts)

FD Grad + FD Hessian

The interpolation model uses 6 function evaluations (the minimum for 2D). The FD model uses 1 + 2 (gradient) + 4 (Hessian) = 7 evaluations via central differences. Both produce similar quality — but the interpolation approach generalizes to higher dimensions without explicitly estimating the Hessian.

In practice, function evaluations are often noisy (simulation noise, measurement error, stochastic objectives). How does each method degrade? The interpolation model averages noise over 6 points, while FD amplifies noise by dividing small differences by a small $h$.

True function

Interpolation model (noisy)

FD model (noisy)

Try cranking noise up while keeping $h$ small — the FD Hessian blows up (dividing noise $\sim\sigma$ by $h^2$), while the interpolation model degrades more gracefully. Click Re-roll noise to see different noise realizations.

Computational cost

We have $p = O(n^2)$ parameters and $p$ interpolation conditions $m_k(\mathbf{y}_\ell) = f(\mathbf{y}_\ell)$. This gives a $p \times p$ linear system for the unknowns $(c, \mathbf{q}, \mathbf{G})$.

Cost to solve the interpolation system

From scratch: $O(p^3) = O(n^6)$ — Gaussian elimination on an $O(n^2) \times O(n^2)$ system
To update (replacing one sample point): $O(n^4)$

Why does the update cost only $O(n^4)$?

When we move to a new iterate $\mathbf{x}_{k+1}$, we don't throw away the whole model. We keep most of the sample points $\mathbf{y}_\ell$ and replace just one (typically the oldest or worst). This changes one row of the interpolation system.

Updating the factorization of a $p \times p$ system when one row changes costs $O(p^2) = O(n^4)$ — a rank-1 update to the factored system, analogous to the Sherman–Morrison formula.

The key tradeoff:
FD gradient + gradient descent — $O(n)$ function evaluations per step, but only first-order (gradient-descent-like) progress.
Quadratic model + trust region — $O(n^4)$ arithmetic per step (plus 1 new function evaluation), but second-order (Newton-like) progress per step.

For moderate $n$ (say 10–50), the $O(n^4)$ arithmetic is cheap compared to function evaluations. As the instructor noted: "This little toy in my pocket can do billions of computations per second" — $n^4$ for $n = 50$ is about $6 \times 10^6$, a fraction of a second.

Open questions: How to choose the initial sample point set $\mathbf{y}_\ell$? How to maintain good geometry as points are replaced? See Nocedal & Wright, Section 9.2.