Sometimes $\mH$ is huge but most of it is zero — or it's a low-rank perturbation of something easy. Then Newton is back on the table.
The cost of a Newton step is $O(n^3)$ only if $\mH$ is dense and unstructured. For many real problems, the Hessian inherits structure from the objective: diagonal, banded, sparse, low-rank-plus-diagonal, or arrowhead. Each pattern admits a specialized solver that beats $O(n^3)$ — sometimes dramatically.
Below: a show-and-tell of patterns that come up in practice, with spy plots and where they arise.
If $f(\vx) = \sum_{i=1}^{n} \phi_i(x_i)$, the Hessian is purely diagonal: $H_{ii} = \phi_i''(x_i)$, $H_{ij} = 0$ for $i \ne j$.
Where it shows up:
Solve cost: $O(n)$ Storage: $O(n)$
A pure diagonal $\mH$ means Newton is just $n$ scalar divides per step. This is essentially gradient descent with a per-coordinate learning rate.
Consider $f(\vx) = \tfrac{1}{2}\|\mA\vx - \vb\|^2 + \sum_i x_i^p$. The Hessian is
$$\mH = \mA^T\mA + \mD$$
where $\mD$ is diagonal. The first piece may be dense, but if $\mA$ has structure (sparse, low-rank, FFT, ...), so does $\mA^T\mA$.
Trick: never form $\mA^T\mA$. Instead, solve $\mH\vp = -\vg$ with conjugate gradients, applying $\mH$ via two matvecs: $\mH\vp = \mA^T(\mA\vp) + \mD\vp$.
Solve cost: $O(\text{nnz}(\mA))$ per CG iter
$H_{ij} = 0$ when $|i-j| > b$. Bandwidth $b$ controls the number of nonzeros.
Where it shows up:
Solve cost: $O(n b^2)$ Storage: $O(nb)$
Banded Cholesky stays banded. So tridiagonal solves in $O(n)$, pentadiagonal in $O(n)$, etc.
A diagonal core, plus a single dense row and column. Form: $\mH = \mD + \vu\ve_n^T + \ve_n\vu^T$ (or block versions).
Where it shows up:
Solve cost: $O(n)$ if you order it right
Ordering matters! Put the global variable last, and Cholesky stays sparse. Put it first, and Cholesky fills in the entire factor — back to $O(n^3)$. (Demo below.)
Many large convex QPs have $\mH$ that's sparse with no special pattern, just few nonzeros per row. Optimization variables tied to nodes of a graph give a Hessian with the graph's adjacency structure.
Where it shows up:
Solve cost: depends on fill-in
Sparse direct solvers (CHOLMOD, MA57) use clever permutations — AMD, METIS — to minimize fill-in. Tim Davis's Direct Methods for Sparse Linear Systems is the bible. Our colleague Alex Pothen at Purdue works on automatic differentiation methods that produce sparse Hessians directly.
$\mH = \mD + \mU\mU^T$ with $\mU \in \RR^{n \times k}$, $k \ll n$.
Where it shows up:
Solve cost: $O(nk^2 + k^3)$ via Sherman–Morrison–Woodbury
$(\mD + \mU\mU^T)^{-1} = \mD^{-1} - \mD^{-1}\mU(\mI + \mU^T\mD^{-1}\mU)^{-1}\mU^T\mD^{-1}$. The expensive piece is a $k \times k$ inversion.
| Structure | Storage | Solve cost |
|---|---|---|
| Diagonal | $O(n)$ | $O(n)$ |
| Banded ($b$) | $O(nb)$ | $O(nb^2)$ |
| Arrowhead ($k$ globals) | $O(n + k^2)$ | $O(n + k^3)$ |
| General sparse | $O(\text{nnz})$ | $O(\text{nnz} + \text{fill-in})$ |
| Low-rank + diag ($k$) | $O(nk)$ | $O(nk^2 + k^3)$ |
| Dense (no structure) | $O(n^2)$ | $O(n^3)$ |
Sometimes you can compute $\mH \vv$ cheaply — via automatic differentiation, or because $\mH = \mA^T\mA + \mD$ with $\mA$ implicit (FFT, neural net, ...) — but you can't form $\mH$ as a matrix.
Then use an iterative solver for $\mH \vp = -\vg$ (CG when $\mH \succ 0$, MINRES otherwise). Stop early when
$$\|\mH \vp + \vg\| \le \eta_k \|\vg\|$$
with $\eta_k \to 0$ chosen carefully. This is inexact Newton (Nocedal & Wright, Theorem 7.1) and it preserves superlinear convergence under mild assumptions.