Lecture 27 — Structured Hessians

The setup

The cost of a Newton step is $O(n^3)$ only if $\mH$ is dense and unstructured. For many real problems, the Hessian inherits structure from the objective: diagonal, banded, sparse, low-rank-plus-diagonal, or arrowhead. Each pattern admits a specialized solver that beats $O(n^3)$ — sometimes dramatically.

Below: a show-and-tell of patterns that come up in practice, with spy plots and where they arise.

Reading a spy plot. A black dot at row $i$, column $j$ means $H_{ij} \ne 0$. The pattern is what matters — the actual numerical values don't.

1. Diagonal — separable objectives and log-barriers

If $f(\vx) = \sum_{i=1}^{n} \phi_i(x_i)$, the Hessian is purely diagonal: $H_{ii} = \phi_i''(x_i)$, $H_{ij} = 0$ for $i \ne j$.

Where it shows up:

Log-barrier terms in interior-point methods: $-\tau \sum \log x_i \;\Rightarrow\; H_{ii} = \tau / x_i^2$.
$\ell_p$ regularizers: $\sum |x_i|^p \;\Rightarrow\;$ diagonal Hessian (where defined).
Per-coordinate energy terms in lattice / image-denoising models.

Solve cost: $O(n)$ Storage: $O(n)$

A pure diagonal $\mH$ means Newton is just $n$ scalar divides per step. This is essentially gradient descent with a per-coordinate learning rate.

2. Dense + diagonal — quadratic loss with a separable penalty

Consider $f(\vx) = \tfrac{1}{2}\|\mA\vx - \vb\|^2 + \sum_i x_i^p$. The Hessian is

$$\mH = \mA^T\mA + \mD$$

where $\mD$ is diagonal. The first piece may be dense, but if $\mA$ has structure (sparse, low-rank, FFT, ...), so does $\mA^T\mA$.

Trick: never form $\mA^T\mA$. Instead, solve $\mH\vp = -\vg$ with conjugate gradients, applying $\mH$ via two matvecs: $\mH\vp = \mA^T(\mA\vp) + \mD\vp$.

Solve cost: $O(\text{nnz}(\mA))$ per CG iter

3. Banded — local couplings only

$H_{ij} = 0$ when $|i-j| > b$. Bandwidth $b$ controls the number of nonzeros.

Where it shows up:

1-D PDE discretizations (heat equation on a line): tridiagonal, $b = 1$.
Spline fitting with cubic / B-spline basis functions: pentadiagonal.
Time-series / trajectory optimization (each timestep couples to the next): block-banded.
Markov chain stationary distribution problems with a chain topology.

Solve cost: $O(n b^2)$ Storage: $O(nb)$

Banded Cholesky stays banded. So tridiagonal solves in $O(n)$, pentadiagonal in $O(n)$, etc.

4. Arrowhead — a few "global" variables coupling local ones

A diagonal core, plus a single dense row and column. Form: $\mH = \mD + \vu\ve_n^T + \ve_n\vu^T$ (or block versions).

Where it shows up:

Multi-task learning: per-task parameters (diagonal blocks) plus a few shared parameters (the "arrow").
Hierarchical models: many local effects, a few global hyperparameters.
Lagrangian Hessians with a few coupling constraints.

Solve cost: $O(n)$ if you order it right

Ordering matters! Put the global variable last, and Cholesky stays sparse. Put it first, and Cholesky fills in the entire factor — back to $O(n^3)$. (Demo below.)

5. General sparse — from QPs and graphs

Many large convex QPs have $\mH$ that's sparse with no special pattern, just few nonzeros per row. Optimization variables tied to nodes of a graph give a Hessian with the graph's adjacency structure.

Where it shows up:

Graph regularization: $\sum_{(i,j) \in E} (x_i - x_j)^2 \;\Rightarrow\; \mH$ is the graph Laplacian.
Finite-element assembly on a 2-D / 3-D mesh.
SLAM and bundle adjustment in robotics / vision (very sparse, structured by camera-landmark adjacency).

Solve cost: depends on fill-in

Sparse direct solvers (CHOLMOD, MA57) use clever permutations — AMD, METIS — to minimize fill-in. Tim Davis's Direct Methods for Sparse Linear Systems is the bible. Our colleague Alex Pothen at Purdue works on automatic differentiation methods that produce sparse Hessians directly.

6. Low-rank + diagonal — generalized linear models

$\mH = \mD + \mU\mU^T$ with $\mU \in \RR^{n \times k}$, $k \ll n$.

Where it shows up:

Logistic regression: $\mH = \mathbf{X}^T \mD \mathbf{X}$ where $\mathbf{X} \in \RR^{m \times n}$, $m \ll n$ (more parameters than data, with $\ell_2$ to make it PD).
Reduced-order models: a small number of "modes" couple to a diagonal background.
Quasi-Newton approximations — including L-BFGS — produce exactly this form. (Foreshadowing!)

Solve cost: $O(nk^2 + k^3)$ via Sherman–Morrison–Woodbury

$(\mD + \mU\mU^T)^{-1} = \mD^{-1} - \mD^{-1}\mU(\mI + \mU^T\mD^{-1}\mU)^{-1}\mU^T\mD^{-1}$. The expensive piece is a $k \times k$ inversion.

Big picture: what to do with structure

Structure	Storage	Solve cost
Diagonal	$O(n)$	$O(n)$
Banded ($b$)	$O(nb)$	$O(nb^2)$
Arrowhead ($k$ globals)	$O(n + k^2)$	$O(n + k^3)$
General sparse	$O(\text{nnz})$	$O(\text{nnz} + \text{fill-in})$
Low-rank + diag ($k$)	$O(nk)$	$O(nk^2 + k^3)$
Dense (no structure)	$O(n^2)$	$O(n^3)$

The takeaway. If you can identify any structure in your Hessian, exploit it. Newton with a $10^6$-by-$10^6$ banded Hessian is faster than gradient descent on the same problem.

What if you can only do matvecs?

Sometimes you can compute $\mH \vv$ cheaply — via automatic differentiation, or because $\mH = \mA^T\mA + \mD$ with $\mA$ implicit (FFT, neural net, ...) — but you can't form $\mH$ as a matrix.

Then use an iterative solver for $\mH \vp = -\vg$ (CG when $\mH \succ 0$, MINRES otherwise). Stop early when

$$\|\mH \vp + \vg\| \le \eta_k \|\vg\|$$

with $\eta_k \to 0$ chosen carefully. This is inexact Newton (Nocedal & Wright, Theorem 7.1) and it preserves superlinear convergence under mild assumptions.

Aside: directions of negative curvature. If CG breaks down because $\mH$ has a negative eigenvalue, the breakdown direction is a direction of negative curvature — useful for escaping saddles.

Next up. What if you don't have structure, can't compute $\mH$, but still want fast quasi-Newton-like progress? That's what L-BFGS is for.