Quadratic Model Methods

Solution 2: Fit a local quadratic model by interpolation and solve it with a trust-region method.

Why not just use FD for the Hessian?

Newton's method needs the Hessian $\nabla^2 f(\mathbf{x}) \in \mathbb{R}^{n \times n}$. Could we estimate it via finite differences? Each column requires a gradient evaluation, and each gradient costs $O(n)$ function evaluations, so:

$$\text{FD Hessian cost} = O(n) \times O(n) = O(n^2) \text{ function evaluations}$$

For expensive black-box functions, $O(n^2)$ evaluations per step is often unacceptable. The quadratic model approach achieves Newton-like behavior using a different strategy: rather than estimating derivatives, we interpolate the function directly.

The quadratic model

Build a full quadratic model of $f$ around the current point $\mathbf{x}_k$:

$$m_k(\mathbf{x}_k + \mathbf{p}) = c + \mathbf{q}^T\mathbf{p} + \tfrac{1}{2}\mathbf{p}^T\mathbf{G}\mathbf{p}$$

Here $c$ is a scalar, $\mathbf{q} \in \mathbb{R}^n$ plays the role of the gradient, and $\mathbf{G} \in \mathbb{R}^{n \times n}$ (symmetric) plays the role of the Hessian.

We determine $c$, $\mathbf{q}$, $\mathbf{G}$ by interpolation: choose sample points $\mathbf{y}_1, \ldots, \mathbf{y}_p$ and require

$$m_k(\mathbf{y}_\ell) = f(\mathbf{y}_\ell), \quad 1 \le \ell \le p$$

Then we optimize this model using a trust-region method.

How many parameters?

The quadratic model has three groups of unknowns:

ComponentCount
Constant $c$1
Linear term $\mathbf{q}$$n$
Quadratic term $\mathbf{G}$ (symmetric)$\frac{n(n+1)}{2}$
Total$1 + n + \frac{n(n+1)}{2} = O(n^2)$
Parameter Counter
Quadratic Model Fitting Demo (2D)

Compare two ways to build a local quadratic model at a point $\mathbf{x}_k$: interpolation (6 sample points) vs. FD gradient + FD Hessian.

True function
Interpolation model (6 pts)
FD Grad + FD Hessian

The interpolation model uses 6 function evaluations (the minimum for 2D). The FD model uses 1 + 2 (gradient) + 4 (Hessian) = 7 evaluations via central differences. Both produce similar quality — but the interpolation approach generalizes to higher dimensions without explicitly estimating the Hessian.

Noisy Evaluations: Interpolation vs. FD

In practice, function evaluations are often noisy (simulation noise, measurement error, stochastic objectives). How does each method degrade? The interpolation model averages noise over 6 points, while FD amplifies noise by dividing small differences by a small $h$.

True function
Interpolation model (noisy)
FD model (noisy)

Try cranking noise up while keeping $h$ small — the FD Hessian blows up (dividing noise $\sim\sigma$ by $h^2$), while the interpolation model degrades more gracefully. Click Re-roll noise to see different noise realizations.

Computational cost

We have $p = O(n^2)$ parameters and $p$ interpolation conditions $m_k(\mathbf{y}_\ell) = f(\mathbf{y}_\ell)$. This gives a $p \times p$ linear system for the unknowns $(c, \mathbf{q}, \mathbf{G})$.

Cost to solve the interpolation system
  • From scratch: $O(p^3) = O(n^6)$ — Gaussian elimination on an $O(n^2) \times O(n^2)$ system
  • To update (replacing one sample point): $O(n^4)$

Why does the update cost only $O(n^4)$?

When we move to a new iterate $\mathbf{x}_{k+1}$, we don't throw away the whole model. We keep most of the sample points $\mathbf{y}_\ell$ and replace just one (typically the oldest or worst). This changes one row of the interpolation system.

Updating the factorization of a $p \times p$ system when one row changes costs $O(p^2) = O(n^4)$ — a rank-1 update to the factored system, analogous to the Sherman–Morrison formula.

The key tradeoff:
FD gradient + gradient descent — $O(n)$ function evaluations per step, but only first-order (gradient-descent-like) progress.
Quadratic model + trust region — $O(n^4)$ arithmetic per step (plus 1 new function evaluation), but second-order (Newton-like) progress per step.

For moderate $n$ (say 10–50), the $O(n^4)$ arithmetic is cheap compared to function evaluations. As the instructor noted: "This little toy in my pocket can do billions of computations per second" — $n^4$ for $n = 50$ is about $6 \times 10^6$, a fraction of a second.

Open questions: How to choose the initial sample point set $\mathbf{y}_\ell$? How to maintain good geometry as points are replaced? See Nocedal & Wright, Section 9.2.