❯ Here we have some lectures I've been working out for my computational opt class. I want to do lecture 27 notes. There are notes and some readings in that directory about how I want thigns done. see the plan.txt file in taht directory. Also look at the structure from other lectures. These are "brilliant"-like elctures with mutliple sections and demos. Section 1... Challenges in Large Scale optimization. min f(x) where x in R^{100,000} Nate Veldt and I studied a problem with 2 trillion constraints and 161 million variables. It took days/weeks to solve. (This was metric constrained opitmization.) Modern LLMs have _trillions_ of parameters. --> Add other examples as relevant. (Aerospace trajectories? CFD optimization?) --- Why is large scale optimization hard? - f may take a long time to evaluate --> in which case you want surrogate optimization, which isn't what we are talking about here. - we are going to assume we can store $x$ easily. - we are going to assume that $f(x)$ and $g(x)$ (the gradient) can be computed reasonably efficiently for 10k+ evaluations. Formally, this could be computing $f$ and $g$ is little-o($n^2$) Something like O($n log n$) is okay. Something like O($n^(3/2)$) is okay. Add a little aside on little-o notation vs. big-O. --- So can we do Newton with $f : R^{100,000} \to R$? This is a quiz! Can we do Newton with this problem? Yes/No/Maybe Yes -- well, what if the Hessian is dense because all variables are coupled, how will you store the Hessian? Is it easy to solve systems with a dense 100k by 100k matrix? No -- well, what if $H$ is structured? Maybe it's sparse like from a quadratic program! Maybe -- that's not an answer -- try and learn something! --- Newton needs O(n^2) space for the Hessian and O(n^3) time in general to solve. That gives 10^15 work and needs 80 GB of memory. That's not entirely unfeasiable. GPUs have 100GB+ memory now! But if n = 10^6, we get... --- But what about structured Hessians? --- Section 2: Structured Hessians. Section 2 on structured hessians is just a "show-and-tell" Here are some examples of where Hessians are structured - log-barrier terms on LPs: show the diagonal Hessian structure like a spy plot. - anything with a quadratic + something like sum x_i^p - banded diagonals Give some examples of where these arise. - please insert more here and Explain where they arise. --- section 3: --- Quasi-Newton methods The next is low-memory / low-rank Quasi-Newton methods. We start by studying the BFGS update for the inverse hessian. We only need to compute -T_k * g to get the QN search direction! Most of this is worked on in the tex document. The goal is to explain this and put some demos behind this. Can you suggest a few? Demos: Need to show the sensitivity to the initial scaling.