Limited-Memory BFGS

Most of the speed of full BFGS, none of the $n \times n$ memory. The trick: store the last few $(\vs, \vy)$ pairs and apply $\mT_k$ via a two-loop recursion.

BFGS recap

BFGS maintains an approximation $\mT_k$ to the inverse Hessian. Each step:

x_0, T_0 given     # T_0 = scaled identity
while not done:
    p_k = -T_k g_k                      # search direction
    α_k = line search along p_k
    x_{k+1} = x_k + α_k p_k
    s_k = x_{k+1} - x_k
    y_k = g_{k+1} - g_k
    T_{k+1} = V_k^T T_k V_k + ρ_k s_k s_k^T   # update
        where V_k = I - ρ_k y_k s_k^T,  ρ_k = 1 / (y_k^T s_k)
The problem. $\mT_k$ is dense $n \times n$. At $n = 10^6$, that's $\frac{1}{2}(10^6)^2 \cdot 8 \;\text{bytes} = 4 \;\text{TB}$. We can't store it. We can't even name it.

The key insight

Each BFGS update is a rank-2 change to $\mT$. After $k$ iterations, $\mT_k$ is at most a rank-$2k$ correction to $\mT_0$.

If our optimization converges in 10–20 steps (which Newton-type methods usually do!), $\mT_k$ is just a rank-20 update on $\mT_0$. We don't need the dense matrix — we need a way to compute

$$\vp_k = -\mT_k \vg_k$$

using only the small set of vectors that actually changed it.

L-BFGS in one line: store the last $m$ pairs $(\vs_i, \vy_i)$ and apply $\mT_k \vv$ via a recursion. Memory is $\mathbf{2mn}$ doubles. With $m = 5$ and $n = 10^6$, that's 80 MB instead of 4 TB.

Deriving the recursion

Where does the two-loop algorithm come from? Click Next to expand each line of algebra.

Step 1 — the BFGS update, in one line
$$\mT_{k+1} = \mV_k^T \mT_k \mV_k + \rho_k \vs_k \vs_k^T, \qquad \mV_k = \mI - \rho_k \vy_k \vs_k^T, \quad \rho_k = \frac{1}{\vy_k^T \vs_k}.$$
Step 1 of 9

The two-loop recursion in pseudocode

# Input: gradient g, history pairs (s_i, y_i), scaling γ
# Output: r = T_k g

q = g
for i = k-1, k-2, ..., k-m:
    α_i = ρ_i * (s_i · q)        # save α_i for the second loop
    q   = q - α_i * y_i

r = γ * q                         # this is T_0 * q

for i = k-m, k-m+1, ..., k-1:
    β   = ρ_i * (y_i · r)
    r   = r + (α_i - β) * s_i

return r

This is exactly the algebra you just stepped through, written compactly.

Demo 1 — the two-loop recursion, step by step

A tiny worked example with $n = 3$ and $m = 3$ stored pairs. Watch $\vq$ get whittled down in the backward sweep, then $\vr$ rebuild back to $\mT_k \vv$ in the forward sweep.

Choosing $\mT_0$

$\mT_0$ shows up exactly once in the recursion: as the multiplier $\gamma$. The standard choice is

$$\gamma_k = \frac{\vs_{k-1}^T \vy_{k-1}}{\vy_{k-1}^T \vy_{k-1}}.$$

Why this formula? It's a scalar approximation of $\|\mH^{-1}\|$ along the most recent search direction. From the secant condition $\mH \vs \approx \vy$, we get $\vs^T \vy \approx \vs^T \mH \vs$ and $\vy^T \vy \approx \vs^T \mH^2 \vs$, so $\gamma \approx (\vs^T \mH \vs) / (\vs^T \mH^2 \vs)$ — a Rayleigh quotient that lands somewhere in the spectrum of $\mH^{-1}$.

Why this matters so much. The very first step, $k = 0$, has no history. We use $\vp_0 = -\gamma_0 \vg_0$ — just gradient descent with step size $\gamma_0$. If $\gamma_0$ is wildly wrong, the first step is catastrophic and the line search wastes evaluations recovering. L-BFGS is more sensitive to initial scaling than full BFGS.
Demo 2 — sensitivity to $\mT_0$ (initial scaling)

Same problem, multiple runs. The auto choice uses $\gamma_k = \vs^T\vy / \vy^T\vy$ at every step. The fixed runs freeze $\gamma$ at whatever you set it to. Watch the spread: auto recovers quickly; badly-chosen fixed values stall.

Convergence ($\|\vg_k\|$)
Auto $\gamma_k$ over iterations

The auto-$\gamma$ curve (right) shows how the scaling self-corrects: it starts wherever you chose, then settles to a value that reflects the local curvature of $f$. If you freeze it at a bad value, L-BFGS never recovers.

Demo 3 — how much memory do you need? ($m$ sweep)

Same higher-dimensional problem, varying the number of stored pairs $m$. With $m = 0$ this is gradient descent. With $m \to \infty$, this is full BFGS.

The takeaway: $m = 5$ to $m = 10$ is usually enough. Returns diminish fast. In practice, $m = 5$ is the default in many implementations.

Demo 4 — head-to-head: GD vs L-BFGS vs full BFGS

All three from the same starting point. Watch GD zigzag in the narrow valley, L-BFGS pick up curvature after a few steps, and full BFGS converge with about the same trajectory as L-BFGS.

Iterates on contour
Convergence ($\|\vg_k\|$)

Practical notes

Wrap-up

L-BFGS gives you most of full BFGS's speed at $O(mn)$ memory instead of $O(n^2)$. It is the default unconstrained optimizer in essentially every serious numerical package — scipy, MATLAB's fminunc, R's optim, Julia's Optim.jl, JAX's scipy.optimize. If you remember one method from this lecture, make it L-BFGS.

MethodMemoryPer-iter costConvergence
Gradient descent$O(n)$$O(n)$linear
Conjugate gradient$O(n)$$O(n)$linear (better constant)
L-BFGS$\mathbf{O(mn)}$$\mathbf{O(mn)}$superlinear (in practice)
Full BFGS$O(n^2)$$O(n^2)$superlinear
Newton$O(n^2)$$O(n^3)$quadratic