Why Large-Scale is Hard

When $n$ is a million, the linear algebra inside Newton-type methods stops being free.

What counts as "large"?

There's no formal cutoff. Operationally:

Working definition. A problem is large-scale if a vanilla quasi-Newton method runs out of memory or time. The transition lives somewhere around $n \sim 10^4$ on a laptop, $n \sim 10^5$ on a beefy workstation.

Real problems at scale

ProblemVariablesConstraints
Metric-constrained clustering (Veldt & Gleich)$1.6 \times 10^8$$3 \times 10^{12}$
Aerospace trajectory / CFD$10^5$–$10^7$varies
Modern LLM training (GPT-class)$10^{12}$+ (parameters)
PDE-constrained inverse problems$10^6$–$10^9$ (mesh DOFs)PDE

The Veldt/Gleich problem took days to weeks on a cluster. With 3 trillion constraints, you can't even list the constraints — you have to generate them on the fly inside the solver.

Our working assumptions

For this lecture, we assume:

Assumption 1 — Storage
We can store the iterate $\vx \in \RR^n$ in memory. (At $n = 10^9$, even this fails. We'll punt on that.)
Assumption 2 — Cheap function and gradient
Computing $f(\vx)$ and $\vg(\vx) = \nabla f(\vx)$ is cheap enough to do thousands of times. Formally: $$\text{cost of } f, \vg = o(n^2).$$

If even one function call takes hours (e.g., a CFD simulation), you're in surrogate optimization territory — a different lecture entirely.

Aside: little-$o$ vs big-$O$

Recall the asymptotic notations:

Big-$O$ (upper bound, possibly tight)
$g(n) = O(h(n))$ means there exist constants $C, n_0$ such that $|g(n)| \le C\,|h(n)|$ for all $n \ge n_0$.
Little-$o$ (strictly smaller)
$g(n) = o(h(n))$ means $\displaystyle \lim_{n \to \infty} \frac{g(n)}{h(n)} = 0$.

So "$f$ costs $o(n^2)$" says the cost grows strictly slower than $n^2$. Acceptable: $O(n)$, $O(n \log n)$, $O(n^{3/2})$. Not acceptable: $O(n^2)$ exactly, or $O(n^2 / \log n)$.

Why this matters. If $f$ costs $O(n)$ (e.g., a sparse matvec), and we run $k$ iterations, the function-evaluation budget is $O(kn)$. As long as $k \ll n$, that's affordable. Newton, by contrast, needs $\Omega(n^2)$ memory per step regardless of how cheap $f$ is.

What does Newton actually need?

Newton's method, stripped down:

x_0 given
while not done:
    Solve H_k p_k = -g_k        # ← the bottleneck
    Line search to find α_k
    x_{k+1} = x_k + α_k p_k

The line search is fine — it's just inner products and a few function calls. The trouble is the linear solve.

For a dense symmetric $\mH \in \RR^{n \times n}$, Cholesky factorization costs $\sim n^3/6$ flops and the matrix itself takes $n(n+1)/2$ doubles.

Live Quiz
Can we run Newton's method on $f : \RR^{100{,}000} \to \RR$?
Yes — nothing fundamentally breaks.
No — the linear algebra is too big.
Maybe — depends on the structure.
Newton at scale: memory & time calculator

Drag $n$ to see what Newton's method actually demands. Hessian storage assumes a dense symmetric matrix in double precision; time assumes a textbook Cholesky at $\sim 10^{10}$ flops/sec (a very fast laptop).

Three escape routes

If the problem is huge and dense Newton is hopeless, you have three options:

RouteIdeaCovered in
1. Use a simpler methodDrop the Hessian: gradient descent, conjugate gradient, SGD.Next lecture
2. Use scalable linear algebraExploit sparsity, banding, or low-rank structure in $\mH$.Part 2
3. Change the methodBuild a low-memory approximation to $\mH^{-1}$ that never gets formed explicitly.Part 3 — L-BFGS