Lecture 27 — Why Large-Scale Optimization is Hard

What counts as "large"?

There's no formal cutoff. Operationally:

$n = 1{,}000$ — small. Anything works.
$n = 10{,}000$ — intermediate. Quasi-Newton starts to feel slow.
$n = 100{,}000$ — uncomfortable. The Hessian is 80 GB.
$n = 1{,}000{,}000$ & beyond — large. You need a different method.

Working definition. A problem is large-scale if a vanilla quasi-Newton method runs out of memory or time. The transition lives somewhere around $n \sim 10^4$ on a laptop, $n \sim 10^5$ on a beefy workstation.

Real problems at scale

Problem	Variables	Constraints
Metric-constrained clustering (Veldt & Gleich)	$1.6 \times 10^8$	$3 \times 10^{12}$
Aerospace trajectory / CFD	$10^5$–$10^7$	varies
Modern LLM training (GPT-class)	$10^{12}$+ (parameters)	—
PDE-constrained inverse problems	$10^6$–$10^9$ (mesh DOFs)	PDE

The Veldt/Gleich problem took days to weeks on a cluster. With 3 trillion constraints, you can't even list the constraints — you have to generate them on the fly inside the solver.

Our working assumptions

For this lecture, we assume:

Assumption 1 — Storage

We can store the iterate $\vx \in \RR^n$ in memory. (At $n = 10^9$, even this fails. We'll punt on that.)

Assumption 2 — Cheap function and gradient

Computing $f(\vx)$ and $\vg(\vx) = \nabla f(\vx)$ is cheap enough to do thousands of times. Formally: $$\text{cost of } f, \vg = o(n^2).$$

If even one function call takes hours (e.g., a CFD simulation), you're in surrogate optimization territory — a different lecture entirely.

Aside: little-$o$ vs big-$O$

Recall the asymptotic notations:

Big-$O$ (upper bound, possibly tight)

$g(n) = O(h(n))$ means there exist constants $C, n_0$ such that $|g(n)| \le C\,|h(n)|$ for all $n \ge n_0$.

Little-$o$ (strictly smaller)

$g(n) = o(h(n))$ means $\displaystyle \lim_{n \to \infty} \frac{g(n)}{h(n)} = 0$.

So "$f$ costs $o(n^2)$" says the cost grows strictly slower than $n^2$. Acceptable: $O(n)$, $O(n \log n)$, $O(n^{3/2})$. Not acceptable: $O(n^2)$ exactly, or $O(n^2 / \log n)$.

Why this matters. If $f$ costs $O(n)$ (e.g., a sparse matvec), and we run $k$ iterations, the function-evaluation budget is $O(kn)$. As long as $k \ll n$, that's affordable. Newton, by contrast, needs $\Omega(n^2)$ memory per step regardless of how cheap $f$ is.

What does Newton actually need?

Newton's method, stripped down:

x_0 given
while not done:
    Solve H_k p_k = -g_k        # ← the bottleneck
    Line search to find α_k
    x_{k+1} = x_k + α_k p_k

The line search is fine — it's just inner products and a few function calls. The trouble is the linear solve.

For a dense symmetric $\mH \in \RR^{n \times n}$, Cholesky factorization costs $\sim n^3/6$ flops and the matrix itself takes $n(n+1)/2$ doubles.

Live Quiz

Can we run Newton's method on $f : \RR^{100{,}000} \to \RR$?

Yes — nothing fundamentally breaks.

No — the linear algebra is too big.

Maybe — depends on the structure.

Three escape routes

If the problem is huge and dense Newton is hopeless, you have three options:

Route	Idea	Covered in
1. Use a simpler method	Drop the Hessian: gradient descent, conjugate gradient, SGD.	Next lecture
2. Use scalable linear algebra	Exploit sparsity, banding, or low-rank structure in $\mH$.	Part 2
3. Change the method	Build a low-memory approximation to $\mH^{-1}$ that never gets formed explicitly.	Part 3 — L-BFGS

Why Large-Scale is Hard

What counts as "large"?

Real problems at scale

Our working assumptions

Aside: little-$o$ vs big-$O$

What does Newton actually need?

Three escape routes