\documentclass{dgleich-article}
\usepackage{dgleich-math}
\usepackage{epstopdf}
\usepackage{microtype}

\newcommand{\vxn}{\vx^{+}}
\newcommand{\vgn}{\vg^{+}}
\newcommand{\fn}{f^{+}}

\title{Large scale optimization}
\usepackage{indentfirst}
\usenatbib
\author{David F. Gleich}

\begin{document}
\maketitle

 Consider the unconstrained optimization problem: 
\[ \MIN{}{f(\vx)} \]
where $f : \RR^{n} \to \RR$ is twice continuously differentiable.
\marginnote{These lecture notes are based on Nocedal and Wright, Chapter 7 as well as Griva, Sofer and Nash, Chapter 13}

Throughout this entire section, we will assume that $f(\vx)$ and $\vg(\vx)$ are both easy to evaluate.  What does this mean, precisely?  It means that your implementations of $f(\vx)$ and $\vg(\vx)$ are both ``fast'' programs.\footnote{For instance, suppose that $\mA$ is a large, sparse matrix.  Then we can compute the matrix-vector product $\mA \vx$ quite easily using sparse linear algebra.  Thus, for instance, suppose we want to solve a large, non-negative least squares problem and are willing to tolerate a barrier approximation to the constraint: 
\[ \MIN{}{\normof{\mA \vx - \vb}^2 - \tau \ve^T \log \vx} \]
then computing $f$ and $\vg$ can be done as follows:\\[1ex]

\noindent \parbox{\linewidth}{
\ttfamily
 function nonnegls(x,A,b,tau)\\
 \hspace{1ex} y = A*x - b \# efficient \\
 \hspace{1ex} f = norm(y)\char`\^ 2 - tau*sum(log(x)) \\
 \hspace{1ex}  g = 2*A'*y - tau*1.0./x \\
 \hspace{1ex} return f, g \# need to double-check
}
}
In a slightly more formal sense, we assume that the function and gradient correspond to $O(n)$ or $O(n \log n)$ computations.
\section{Why do we need large-scale methods?}

Let's consider the best methods we've studied for solving optimization problems: Netwon's method and the Quasi-Newton method. As a very small pseudo-code, and without appropriate checks on the non-singularity of $\mH$, Newton's method is:

\begin{pseudocode}
 $\vx_0$ is given
 while not done
   Solve $\mH_k \vp_k = -\vg_k$.
   Set $\vx_{k+1} = \vx_k + \alpha_k \vp_k$ using a strong Wolfe line search
\end{pseudocode}

First, doing the line-search with $\vx_k$ as a 1,000,000 or even 5,000,000 dimensional vector is not a problem.  All we need to do for a strong-Wolfe line search is compute inner-products with the search direction and the gradient.  We have assumed the function and gradient are easy to evaluate so this isn't a problem.

However, this method involves solving a symmetric, positive definite linear system. We can do this via the Cholesky factorization in approximately $n^3/6$ floating point operation.  If $n$ is 100,000 or 1,000,000 -- which are reasonable sizes for problems -- then this computation is impractical using a textbook Cholesky method.  We'll discuss the case of sparse Hessians, which is a case when it is reasonable to solve this problem, below.

If $f$ is even modestly complicated, getting Hessian information may be difficult. Thus, let's also consider a Quasi-Newton BFGS method.

\begin{pseudocode}
 $\vx_0$ is given
 $\mT_0$ is a scaled identity such that $\mT_0 \approx \mH(\vx_0)$.
 while not done
   Set $\vx_{k+1} = \vx_k - \alpha_k \mT_k \vg_k$ using a strong Wolfe line search
   Set $\mT_{k+1} = \mV_k \mT_k \mV_k + \rho \vs_k \vs_k^T$
     where $\vs_k = \vx_{k+1} - \vx_k$, $\vy_k = \vg_{k+1} - \vg_k$
           $\rho_k = (\vy_k^T \vs_k)^{-1}$,
           $\mV_k = (\eye - \rho_k \vy_k \vs_k^T)$.
\end{pseudocode}

In this case, the first update to the Hessian inverse $\mT_k$ results in a dense $n \times n$ matrix.  If $n$ is even 500,000, this problem is almost impossible to even store!\footnote{The problem is symmetric, so we'd have to store ``500,000 choose 2'' double-precision numbers, or 931 GB of data; in theory this could be feasible.}

\emph{In a nutshell, the problem with both of these methods is that the linear algebra required does not scale.}
If your optimization problem is huge then you have three choices to solve it:
\begin{compactenum}
 \item Use a simple method
 \item Use scalable linear algebra
 \item Change the method
\end{compactenum}
We'll discuss these in turn.

\section{Simple methods}

Since the problem with the methods is that the linear algebra does not scale, let's use methods with simple linear algebra.  We've seen one such methods: gradient descent. Another method is conjugate gradients. This computes a new search direction that is conjugate to the previous search directions.

For both of these methods, all we need to do is to compute the gradient at each step.  We have assumed that computing the gradient is easy (see above).

\begin{pseudocode}
 while not done
   $\vx_{k+1} = \vx_k - \alpha_k \vg_k$
\end{pseudocode}

And so the only work is doing the line-search which is scalable under the assumptions above.

For conjugate gradients, the pseudo-code of the method is: 

\begin{pseudocode}
 while not done 
   ... 
   $\beta_k = \normof{\vg_k}^2/(\vg_{k} - \vg_{k-1})^T \vp_{k-1}$ % O(n) work
   $\vp_k = -\vg_k + \beta_k \vp_{k-1}$ % O(n) work
   $\vx_{k+1} = \vx_{k} + \alpha_k \vp_k$ % line-search
\end{pseudocode}

This method only involves $O(n)$ work beyond computing the line-search and gradients.
The proof of convergence for CG involves only that every $n$ iterations, we do a step of gradient descent as we reset the process. 

\emph{The problem with both of these methods is that they only converge linearly to a solution.}  Unless the problems have very special structure where this results in a small amount of work, these methods tend to be slow.

\paragraph{Modern ML/AI methods} There has been a lot of work on other simple, scalable optimization procedures for problems in machine learning and AI.  These methods tend to be based on stochastic gradient descent. Almost all of these methods have linear convergence (or worse!), yet they are widely used in practice. The rationale for this usage is that the optimization problem is solving a problem for a particular sample of data. There is nothing perfect about this sample of data, and hence, we don't need to find the exact solution. Ideally, we'd like something that is nearby a robust / reliable minimizer to a solution that's easy to compute. In fact, this \emph{approximation} is a form of regularization that helps the methods avoid overfitting the given data, although this behavior is much more complicated than you might expect. See research on the \emph{double descent behavior} in . Thus, there are many large-scale problems where linear or even sub-linear convergence is perfectly acceptable.

\section{Scalable linear algebra}

The key to scalable linear algebra is to exploit structure in the matrix. In a general, $n \times n$, symmetric matrix, there is no structure to exploit except for symmetry.  Thus, scalable methods for these problems end up using many computers to store all the in the matrix. The linear algebra routines are parallel codes that use hundreds of machines to do things like compute a Cholesky factorization.\footnote{I've used the codes in ScaLAPACK, arguably the standard library, in order compute all eigenvalues and vectors of a symmetric 250,000-by-250,000 matrix using around 6000 processors.}

\subsection{Sparsity}

The most common structure in large problems is sparsity.  It's a tricky concept to define, but the idea with sparsity is that most of the matrix is actually a zero entry.  If the sparsity isn't too bad, then there has been an enormous amount of work on how to solve linear systems by exploiting sparsity.  The key ideas here are how to minimize the number of new zeros introduced during a matrix factorization.  Here's a really simple example.

Let $\mA = \eye + \vv \ve_1^T + \ve_1 \vv^T$.  This matrix has the shape of an arrow: 
\[ \bmat{ \bullet & \bullet & \bullet & \bullet & \bullet & \bullet\\ \bullet & \bullet & 0 & 0 & 0 & 0 \\ \bullet & 0 & \bullet & 0 & 0 & 0 \\ \bullet & 0 & 0 & \bullet & 0 & 0 \\ \bullet & 0 & 0 & 0 & \bullet & 0 \\ \bullet & 0 & 0 & 0 & 0 & \bullet } \quad \text{\parbox{2in}{where the $\bullet$ denotes a non-zero entry. We often just omit the zeros: }} \quad
\bmat{ \bullet & \bullet & \bullet & \bullet & \bullet & \bullet\\ \bullet & \bullet &  &  &  &  \\ \bullet &  & \bullet &  &  &  \\ \bullet &  &  & \bullet &  &  \\ \bullet &  &  &  & \bullet &  \\ \bullet &  &  &  &  & \bullet }.
\]
If we compute the Cholesky factorization of this matrix, then we'll get a Cholesky factor: 
\[ \mA = \mL \mL^T \quad \text{ where } \quad \mL = \bmat{ \bullet & \\ \bullet & \bullet &  &  &  &  \\ \bullet & \bullet & \bullet &  &  &  \\ \bullet & \bullet & \bullet & \bullet &  &  \\ \bullet & \bullet & \bullet & \bullet & \bullet &  \\ \bullet & \bullet & \bullet & \bullet & \bullet & \bullet }.\]
That is, we'll have take a very sparse matrix and made it entirely dense.  If, however, we had permuted the matrix $\mA$ such that: $\mA = \eye + \vv \ve_n^T + \ve_n \vv^T$ then we would have found: 
\[ \mA = \bmat{ \bullet &  &  &  &  & \bullet \\  & \bullet &  &  &  & \bullet \\  &  & \bullet &  &  & \bullet \\  &  &  & \bullet &  & \bullet \\  &  &  &  & \bullet & \bullet \\ \bullet & \bullet & \bullet & \bullet & \bullet & \bullet }
\qquad \text{ with Cholesky factor } \qquad \mL = \bmat{ \bullet &  &  &  &  &  \\  & \bullet &  &  &  &  \\  &  & \bullet &  &  &  \\  &  &  & \bullet &  &  \\  &  &  &  & \bullet &  \\ \bullet & \bullet & \bullet & \bullet & \bullet & \bullet } \]
The goal with sparse linear algebra is to figure out how to find permutations $\mP$ that keep the Cholesky factors as sparse as possible. This procedure is very tricky and involves a host of beautiful relationships with graph theory. Tim Davis has an incredible book about it called ``Direct Methods for Sparse Linear Systems''

For this reason, there has been a lot of work in Alex Pothen's group here at Purdue on how to compute sparse Hessians using automatic differentiation.

%\subsection{Structure}

%The 

\subsection{Iterative methods \famp\ inexact newton}

Alternatively, if the problem will somehow let you compute matrix-vector product with the Hessian efficiently, then you could use an iterative method.\footnote{There are many problems where this is the case, even though the Hessian itself is dense, such as constrained least-squares problems where the constraint is a Fourier-transform matrix, which is dense, but has a $O(n \log n)$ matrix-vector product through the FFT.}
The goal would be to solve \[ \mH_k \vp_k = -\vg_k \] using only these matrix-vector product. As discussed in the book, if $\vec{\tilde{\vp}}_k$ is a direction where
\[ \normof{\mH \vec{\tilde{\vp}}_k + \vg_k} \le \eta \normof{\vg_k} \]
and $0 < \eta < 1$, then we are ``okay'' -- see Nocedal and Wright, Theorem 7.1.

\paragraph{Directions of negative curvature} Quick aside! On why CG can fail and how that ends up helping us!

\section{New methods: Limited-Memory BFGS}

The goal in designing a new method is to exploit the structure of the BFGS Quasi-Newton method and generate a limited memory approximation of the Hessian with similar properties. The key idea follows from the question: \emph{How could we exploit the fact that we think our optimization method will converge fast, such as in 10 steps?}  If your routine only takes 10 iterations, then it seems crazy to do all the work in updating a Hessian approximation with $O(n^2/2)$ matrix-elements. In total, the 10 steps would generate a rank-20 change to the matrix, which would only take $20n$ matrix-elements to store. So the idea with a limited memory BFGS method is that we'll just store the last few changes to the Hessian as a low-rank update.

\subsection{L-BFGS}

In the BFGS update, we compute: 
\[ \mT_{k+1} = (\eye - \rho_k \vs \vy^T) \mT_k (\eye - \rho_k \vy \vs^T) + \rho \vs_k \vs_k^T. \]
Let $\mV_k = (\eye - \rho_k \vy_k \vs_k^T)$. Then 
\[ \mT_{k+1} = \mV_k^T \mT_k \mV_k + \rho \vs_k \vs_k^T. \]
Now, consider what happens if we store vectors $\vs_k$ and $\vy_k$ for the last $10$ (or so) steps. 

Then we have, starting from $\mT_0$: 
\[ \begin{aligned}
      \mT_k &= \mV_{k-1}^T \mV_{k-2}^T \cdots \mV_{k-m}^T \mT_{0} \mV_{k-m} \cdots \mV_{k-2} \mV_{k-1} + \\
            & \quad +   \rho_{k-m} \mV_{k-1}^T \cdots \mV_{k-m+1}^T \vs_{k-m} \vs_{k-m}^T \mV_{k-m+1} \cdots \mV_{k-1} + \\
            & \quad + \rho_{k-m+1} \mV_{k-1}^T \cdots \mV_{k-m+2}^T \vs_{k-m+} \vs_{k-m+1}^T \mV_{k-m+2} \cdots \mV_{k-1} + \\
            & \quad + \vdots \\
            & \quad + \rho_{k-1} \vs_{k-1} \vs_{k-1}.
   \end{aligned} 
\]
To compute $\mT_k \vg_k$ all we need to do is multiply $\mT_k$ by a vector. We don't actually need the elements itself!
We can do that using the following algorithm:

\begin{pseudocode}
 $\vq = \vz$
 for i=$k-1$ to $k-m$
   $\alpha_i = \rho_i \vs_i^T \vq$ % need to save these for below
   $\vq \leftarrow \vq - \alpha_i \vy_i$
 $\vr = \mT_0 \vq$
 for i=$k-m$ to $k-1$
   $\beta = \rho_i \vy_i^T \vr$
   $\vr \leftarrow \vr + \vs_i (\alpha_i - \beta)$
\end{pseudocode}

Then we'll have $\vr = \mT_k \vz$.

So the idea with Limited Memory Quasi Newton (L-BFGS) is that we store the last $T$ updates to the $\mT_0$ and use those to estimate $\mT_k \vg_k$. 
We can actually change $\mT_0$ at each step, and so a common choice is $\mT_0 = \vs_{k-1}^T \vy_{k-1} / (\vy_{k-1}^T \vy_{k-1}) \mI$.  This choice attempts to estimate the norm of the Hessian along the most recent search direction.  

In practice, you keep the most recent $m$ iterates around.  Usually these are stored in a circular buffer.  

   
%\section{Seperable functions}   


\end{document}