Lecture 27 — Limited-Memory BFGS

BFGS recap

BFGS maintains an approximation $\mT_k$ to the inverse Hessian. Each step:

x_0, T_0 given     # T_0 = scaled identity
while not done:
    p_k = -T_k g_k                      # search direction
    α_k = line search along p_k
    x_{k+1} = x_k + α_k p_k
    s_k = x_{k+1} - x_k
    y_k = g_{k+1} - g_k
    T_{k+1} = V_k^T T_k V_k + ρ_k s_k s_k^T   # update
        where V_k = I - ρ_k y_k s_k^T,  ρ_k = 1 / (y_k^T s_k)

The problem. $\mT_k$ is dense $n \times n$. At $n = 10^6$, that's $\frac{1}{2}(10^6)^2 \cdot 8 \;\text{bytes} = 4 \;\text{TB}$. We can't store it. We can't even name it.

The key insight

Each BFGS update is a rank-2 change to $\mT$. After $k$ iterations, $\mT_k$ is at most a rank-$2k$ correction to $\mT_0$.

If our optimization converges in 10–20 steps (which Newton-type methods usually do!), $\mT_k$ is just a rank-20 update on $\mT_0$. We don't need the dense matrix — we need a way to compute

$$\vp_k = -\mT_k \vg_k$$

using only the small set of vectors that actually changed it.

L-BFGS in one line: store the last $m$ pairs $(\vs_i, \vy_i)$ and apply $\mT_k \vv$ via a recursion. Memory is $\mathbf{2mn}$ doubles. With $m = 5$ and $n = 10^6$, that's 80 MB instead of 4 TB.

Deriving the recursion

Where does the two-loop algorithm come from? Click Next to expand each line of algebra.

Step 1 — the BFGS update, in one line

$$\mT_{k+1} = \mV_k^T \mT_k \mV_k + \rho_k \vs_k \vs_k^T, \qquad \mV_k = \mI - \rho_k \vy_k \vs_k^T, \quad \rho_k = \frac{1}{\vy_k^T \vs_k}.$$

Step 1 of 9

The two-loop recursion in pseudocode

# Input: gradient g, history pairs (s_i, y_i), scaling γ
# Output: r = T_k g

q = g
for i = k-1, k-2, ..., k-m:
    α_i = ρ_i * (s_i · q)        # save α_i for the second loop
    q   = q - α_i * y_i

r = γ * q                         # this is T_0 * q

for i = k-m, k-m+1, ..., k-1:
    β   = ρ_i * (y_i · r)
    r   = r + (α_i - β) * s_i

return r

This is exactly the algebra you just stepped through, written compactly.

Choosing $\mT_0$

$\mT_0$ shows up exactly once in the recursion: as the multiplier $\gamma$. The standard choice is

$$\gamma_k = \frac{\vs_{k-1}^T \vy_{k-1}}{\vy_{k-1}^T \vy_{k-1}}.$$

Why this formula? It's a scalar approximation of $\|\mH^{-1}\|$ along the most recent search direction. From the secant condition $\mH \vs \approx \vy$, we get $\vs^T \vy \approx \vs^T \mH \vs$ and $\vy^T \vy \approx \vs^T \mH^2 \vs$, so $\gamma \approx (\vs^T \mH \vs) / (\vs^T \mH^2 \vs)$ — a Rayleigh quotient that lands somewhere in the spectrum of $\mH^{-1}$.

Why this matters so much. The very first step, $k = 0$, has no history. We use $\vp_0 = -\gamma_0 \vg_0$ — just gradient descent with step size $\gamma_0$. If $\gamma_0$ is wildly wrong, the first step is catastrophic and the line search wastes evaluations recovering. L-BFGS is more sensitive to initial scaling than full BFGS.

Same problem, multiple runs. The auto choice uses $\gamma_k = \vs^T\vy / \vy^T\vy$ at every step. The fixed runs freeze $\gamma$ at whatever you set it to. Watch the spread: auto recovers quickly; badly-chosen fixed values stall.

Convergence ($\|\vg_k\|$)

Auto $\gamma_k$ over iterations

The auto-$\gamma$ curve (right) shows how the scaling self-corrects: it starts wherever you chose, then settles to a value that reflects the local curvature of $f$. If you freeze it at a bad value, L-BFGS never recovers.

Practical notes

Circular buffer. Keep $m$ pairs in a ring buffer. When you'd add the $(m+1)$-th pair, overwrite the oldest. Indexing is just $i \bmod m$.
Skip the update if $\vy_k^T \vs_k \le 0$ (curvature failed). This happens with weak line searches; you lose the iterate but the next one usually recovers.
Strong Wolfe line search. Just like full BFGS, L-BFGS works much better with a Wolfe (not Armijo) line search, because the curvature condition guarantees $\vy^T \vs > 0$.
Restart on stagnation. If progress stalls, drop all pairs and restart with $m = 0$. Often a fresh start helps more than persisting with bad history.
Default $m$. $m = 5$ in scipy's L-BFGS-B. $m = 10$ in many others. Above $m \approx 20$ you rarely see further benefit.

Wrap-up

L-BFGS gives you most of full BFGS's speed at $O(mn)$ memory instead of $O(n^2)$. It is the default unconstrained optimizer in essentially every serious numerical package — scipy, MATLAB's fminunc, R's optim, Julia's Optim.jl, JAX's scipy.optimize. If you remember one method from this lecture, make it L-BFGS.

Method	Memory	Per-iter cost	Convergence
Gradient descent	$O(n)$	$O(n)$	linear
Conjugate gradient	$O(n)$	$O(n)$	linear (better constant)
L-BFGS	$\mathbf{O(mn)}$	$\mathbf{O(mn)}$	superlinear (in practice)
Full BFGS	$O(n^2)$	$O(n^2)$	superlinear
Newton	$O(n^2)$	$O(n^3)$	quadratic