❯ Here we have some lectures I've been working out for my computational opt class. I want to do lecture 27 notes. There are notes  
  and some readings in that directory about how I want thigns done. see the plan.txt file in taht directory. Also look at the       
  structure from other lectures. These are "brilliant"-like elctures with mutliple sections and demos.   

Section 1...

Challenges in Large Scale optimization. 

min f(x) where x in R^{100,000}

Nate Veldt and I studied a problem with 2 trillion constraints and 161 million variables. 
It took days/weeks to solve. (This was metric constrained opitmization.)

Modern LLMs have _trillions_ of parameters.

--> Add other examples as relevant. (Aerospace trajectories? CFD optimization?)

---

Why is large scale optimization hard? 

- f may take a long time to evaluate --> in which case you want surrogate optimization,
which isn't what we are talking about here.

- we are going to assume we can store $x$ easily. 
- we are going to assume that $f(x)$ and $g(x)$ (the gradient) can be computed reasonably
efficiently for 10k+ evaluations.

Formally, this could be computing $f$ and $g$ is little-o($n^2$)

Something like O($n log n$) is okay.
Something like O($n^(3/2)$) is okay. 

Add a little aside on little-o notation vs. big-O.

---

So can we do Newton with $f : R^{100,000} \to R$? 

This is a quiz! Can we do Newton with this problem?

Yes/No/Maybe

Yes -- well, what if the Hessian is dense because all variables are coupled, how
will you store the Hessian? Is it easy to solve systems with a dense 100k by 100k matrix?

No -- well, what if $H$ is structured? Maybe it's sparse like from a quadratic program!

Maybe -- that's not an answer -- try and learn something!

---

Newton needs O(n^2) space for the Hessian and O(n^3) time in general to solve.

That gives 10^15 work and needs 80 GB of memory.

That's not entirely unfeasiable. GPUs have 100GB+ memory now!

But if n = 10^6, we get... 

---

But what about structured Hessians?

---

Section 2: Structured Hessians.
Section 2 on structured hessians is just a "show-and-tell"

Here are some examples of where Hessians are structured

- log-barrier terms on LPs:
show the diagonal Hessian structure like a spy plot.

- anything with a quadratic + something like sum x_i^p 

- banded diagonals
Give some examples of where these arise. 

- please insert more here and Explain where they arise. 

--- 

section 3: 

---

Quasi-Newton methods

The next is low-memory / low-rank Quasi-Newton methods.

We start by studying the BFGS update for the inverse hessian.

We only need to compute -T_k * g to get the QN search direction!

Most of this is worked on in the tex document.

The goal is to explain this 

and put some demos behind this. Can you suggest a few? 

Demos: 

Need to show the sensitivity to the initial scaling.