Most of the speed of full BFGS, none of the $n \times n$ memory. The trick: store the last few $(\vs, \vy)$ pairs and apply $\mT_k$ via a two-loop recursion.
BFGS maintains an approximation $\mT_k$ to the inverse Hessian. Each step:
x_0, T_0 given # T_0 = scaled identity
while not done:
p_k = -T_k g_k # search direction
α_k = line search along p_k
x_{k+1} = x_k + α_k p_k
s_k = x_{k+1} - x_k
y_k = g_{k+1} - g_k
T_{k+1} = V_k^T T_k V_k + ρ_k s_k s_k^T # update
where V_k = I - ρ_k y_k s_k^T, ρ_k = 1 / (y_k^T s_k)
Each BFGS update is a rank-2 change to $\mT$. After $k$ iterations, $\mT_k$ is at most a rank-$2k$ correction to $\mT_0$.
If our optimization converges in 10–20 steps (which Newton-type methods usually do!), $\mT_k$ is just a rank-20 update on $\mT_0$. We don't need the dense matrix — we need a way to compute
$$\vp_k = -\mT_k \vg_k$$
using only the small set of vectors that actually changed it.
Where does the two-loop algorithm come from? Click Next to expand each line of algebra.
# Input: gradient g, history pairs (s_i, y_i), scaling γ
# Output: r = T_k g
q = g
for i = k-1, k-2, ..., k-m:
α_i = ρ_i * (s_i · q) # save α_i for the second loop
q = q - α_i * y_i
r = γ * q # this is T_0 * q
for i = k-m, k-m+1, ..., k-1:
β = ρ_i * (y_i · r)
r = r + (α_i - β) * s_i
return r
This is exactly the algebra you just stepped through, written compactly.
$\mT_0$ shows up exactly once in the recursion: as the multiplier $\gamma$. The standard choice is
$$\gamma_k = \frac{\vs_{k-1}^T \vy_{k-1}}{\vy_{k-1}^T \vy_{k-1}}.$$
Why this formula? It's a scalar approximation of $\|\mH^{-1}\|$ along the most recent search direction. From the secant condition $\mH \vs \approx \vy$, we get $\vs^T \vy \approx \vs^T \mH \vs$ and $\vy^T \vy \approx \vs^T \mH^2 \vs$, so $\gamma \approx (\vs^T \mH \vs) / (\vs^T \mH^2 \vs)$ — a Rayleigh quotient that lands somewhere in the spectrum of $\mH^{-1}$.
L-BFGS-B. $m = 10$ in many others. Above $m \approx 20$ you rarely see further benefit.L-BFGS gives you most of full BFGS's speed at $O(mn)$ memory instead of $O(n^2)$. It is the default unconstrained optimizer in essentially every serious numerical package — scipy, MATLAB's fminunc, R's optim, Julia's Optim.jl, JAX's scipy.optimize. If you remember one method from this lecture, make it L-BFGS.
| Method | Memory | Per-iter cost | Convergence |
|---|---|---|---|
| Gradient descent | $O(n)$ | $O(n)$ | linear |
| Conjugate gradient | $O(n)$ | $O(n)$ | linear (better constant) |
| L-BFGS | $\mathbf{O(mn)}$ | $\mathbf{O(mn)}$ | superlinear (in practice) |
| Full BFGS | $O(n^2)$ | $O(n^2)$ | superlinear |
| Newton | $O(n^2)$ | $O(n^3)$ | quadratic |