Please answer the following questions in complete sentences in a clearly prepared manuscript and submit the solution by the due date on Blackboard (around Sunday, September 23th, 2018.)
Remember that this is a graduate class. There may be elements of the problem statements that require you to fill in appropriate assumptions. You are also responsible for determining what evidence to include. An answer alone is rarely sufficient, but neither is an overly verbose description required. Use your judgement to focus your discussion on the most interesting pieces. The answer to "should I include 'something' in my solution?" will almost always be: Yes, if you think it helps support your answer.
Please identify anyone, whether or not they are in the class, with whom you discussed your homework. This problem is worth 1 point, but on a multiplicative scale.
Make sure you have included your source-code and prepared your solution according to the most recent Piazza note on homework submissions.
The algorithms for coordinate and gradient descent are for solving on a symmetric positive definite matrix correspond to trying to minimize by, at each step, exactly minizing this function for a single coordinate (coordinate descent) or by exactly this function in the direction of the gradient (gradient descent). Let represent the approximate solution after steps. Let represent the gradient be the gradient at the th step. Then the algorithms can be written: respectively.
Suppose that we wish to generalize coordinate descent to \emph{two} coordinates. Then the update can be written: where and . That is, we have to choose two coefficients: in order to figure out the next step. Like coordinate descent, we are going to pick these to exactly minimize as a function of .
Give the resulting iterative algorithm.
For gradient descent, the most expensive computational step is computing matrix-vector products with the matrix . Show how to implement this algorithm with only a single matrix-vector product per iteration. Your algorithm should stop when the residual is sufficiently small. Your algorithm, including checking the stopping condition, can only use one matrix-vector product per iteration.
(Hint: you will have to figure out how to update the expression for the gradient and solution and reformulate the algorithm!)
For coordinate descent, the most expensive step is computing . Use the routines from Homework 2 to compute this term efficiently. Develop an algorithm to solve via coordinate descent with two strategies to pick the coordinates: (1) by cycling through 1 to (called cyclic coordinate descent) and (2) by randomly picking a coordinate (called random coordinate descent). If row (or column ) of the matrix has non-zeros, your algorithm must do work proportional to in each iteration.
For the 2d mesh with , , and , compare the performance of the algorithms from problems 2 and 3 in terms of the \emph{norm of the relative residual} compared with the \emph{number of non-zeros used}. Use a relative residual of .
We are not comparing by number of iterations because each iteration does vastly different amounts of work. An iteration of coordinate descent does a very small amount of work, and we need iterations of cyclic coordinate descent to give us the same amount of work as 1 iteration of gradient descent. Hence, we'll use the total number of non-zero entries used. So if a row/column used in coordinate descent has non-zero entries, then we increase our work count by . For each iteration of gradient descent, we use the total number of non-zeros in work.
Your answer should probably have both a table and a plot of how the residual descreases with \emph{work}. You should show how the residual decreases from to as a function of work. This suggests a log-scaling of a -axis.
Describe what happens when you use these algorithms on the linear system of equations for Candyland.