1 00:00:00,000 --> 00:00:12,120 All right. Welcome back. We have a sparse group here today. So, we'll definitely have 2 00:00:12,120 --> 00:00:20,460 a quiz so you guys are all in luck. What are we talking about today? So, large scale optimization. 3 00:00:20,460 --> 00:00:38,000 There are some handouts. I didn't get a chance to post them before class, but again, there 4 00:00:38,000 --> 00:00:43,560 are some handouts that go over a lot of the same stuff that we'll see here that you can 5 00:00:43,560 --> 00:00:52,200 check out after class, too. So, things we're going to see. Why is large scale optimization 6 00:00:52,200 --> 00:00:57,680 challenging? We'll then talk about this idea called limited memory quasi-Newton, which 7 00:00:57,680 --> 00:01:07,360 allows you to get something a little bit better. Sorry. It's sort of a hybrid technique that 8 00:01:07,360 --> 00:01:12,000 gives you a little bit of curvature information. It's not perfect, but it generally does a 9 00:01:12,000 --> 00:01:17,560 little bit better than just doing pure gradient descent. Other ideas sort of in that vein 10 00:01:17,560 --> 00:01:23,360 are conjugate gradient. And then we'll go back to cases where you sort of have a little 11 00:01:23,360 --> 00:01:29,720 bit -- like, you can actually do something with the Hessian. So, sparsity in exact Newton. 12 00:01:29,720 --> 00:01:34,080 So, I don't think we're going to get to all of this stuff today. So, we'll do whatever 13 00:01:34,080 --> 00:01:44,960 we don't get to today on Thursday. And, yeah. So, I guess in terms of projects, I should 14 00:01:44,960 --> 00:01:55,880 do a little bit about projects. So, projects... I forget when they're due now, but they're 15 00:01:55,880 --> 00:02:05,240 due whenever I said they were due. Two weeks? Okay. Due in two weeks. So, you don't have 16 00:02:05,240 --> 00:02:13,440 any more interim status reports. So, your last one was last time. Again, in terms of 17 00:02:13,440 --> 00:02:17,440 things that are important, if you're doing the optimization on the surface of the hypersphere, 18 00:02:17,440 --> 00:02:22,920 I actually do expect you to try some gradient based optimization. So, if you're not going 19 00:02:22,920 --> 00:02:27,080 to do some gradient based optimization, you probably want to have a conversation with 20 00:02:27,080 --> 00:02:31,880 me. And in particular, if you're doing gradient based optimization, what do you always do 21 00:02:31,880 --> 00:02:37,960 with what you call your gradient? Check it. So, if you don't check it, you're probably 22 00:02:37,960 --> 00:02:44,120 going to lose some points on your write-up. If you implement your own optimization software, 23 00:02:44,120 --> 00:02:48,680 I'm not going to review it, but I do want you to give me some justification that it's 24 00:02:48,680 --> 00:02:54,440 plausibly correct. So, what might that look like? Take some problem where you know the 25 00:02:54,440 --> 00:02:58,200 answer, make sure you get it. Yeah? 26 00:02:58,200 --> 00:03:07,560 >> Many of the algorithms for the software are coming from textbooks. Is it okay if we 27 00:03:07,560 --> 00:03:08,560 basically iterate through those proofs? 28 00:03:08,560 --> 00:03:12,780 >> So, that is a theoretical justification that what you have done in principle is correct. 29 00:03:12,780 --> 00:03:18,080 That does not mean your software is correct. So, I'm looking for a software based demonstration 30 00:03:18,080 --> 00:03:23,320 of this, not a, hey, in theory, these ideas work. Yes, in theory, lots of ideas work, 31 00:03:23,320 --> 00:03:29,040 but you may have made an error when you translated those ideas onto the computer. A subtle error, 32 00:03:29,040 --> 00:03:33,440 but it might -- again, how do you know you haven't made a subtle error in translating 33 00:03:33,440 --> 00:03:38,640 the algorithms onto the computer? So, I would like to see some justification of that as 34 00:03:38,640 --> 00:03:41,080 well. 35 00:03:41,080 --> 00:03:47,320 If you're using libraries, again, that is not an excuse to trust the library. Libraries 36 00:03:47,320 --> 00:03:53,480 have errors just like your software does, too. They're less likely, but they still do 37 00:03:53,480 --> 00:03:56,880 happen. So, I would expect the same thing there. Yes? 38 00:03:56,880 --> 00:04:01,080 >> What's the value you use for project 2? 39 00:04:01,080 --> 00:04:13,880 >> So, I'll get to project 2 in a second. So, again, on project 1, essentially, I want 40 00:04:13,880 --> 00:04:17,680 to see a write-up that I don't have to read your code to understand that what you've done 41 00:04:17,680 --> 00:04:22,160 is correct. So, you should definitely justify that in your things, and then hopefully you've 42 00:04:22,160 --> 00:04:28,880 done something interesting in this gen rule space. So, other questions about project 1, 43 00:04:28,880 --> 00:04:29,880 the hypersphere? Yes? 44 00:04:29,880 --> 00:04:30,880 >> Yeah, just a quick question. So, on the grade scope, there's a portion or a place 45 00:04:30,880 --> 00:04:48,720 to upload the report, as well as upload the code. If we're doing project 1, do we... 46 00:04:48,720 --> 00:04:49,720 >> You do not need to upload code. 47 00:04:49,720 --> 00:04:50,720 >> Just a report? 48 00:04:50,720 --> 00:04:51,720 >> Yeah, just a report. 49 00:04:51,720 --> 00:04:52,720 >> We're not going to get marked down for not doing anything? 50 00:04:52,720 --> 00:04:53,720 >> You will not get marked down for not doing anything. I will only... I mean, I'm only 51 00:04:53,720 --> 00:04:57,480 going to download the codes for project 2. I mean, sorry. I guess if you upload a code 52 00:04:57,480 --> 00:05:03,920 for project 1, I'm not going to run it. And I probably won't even look at it. So, I mean... 53 00:05:03,920 --> 00:05:12,040 Although, I wouldn't test that, because I may look at it. But, anyway... I guess I should 54 00:05:12,040 --> 00:05:19,560 say test that at your own risk. So, if you want to upload something fun there, you know... 55 00:05:19,560 --> 00:05:29,720 All right. There you go. I'd suggest your time could be better spent in improving your 56 00:05:29,720 --> 00:05:34,920 writing or running your stuff through some grammar checks or spelling checks or things 57 00:05:34,920 --> 00:05:55,040 like that. So... All right. So project... Other questions about project 1? So project 58 00:05:55,040 --> 00:06:13,680 2, I definitely need a Julia file, because what I'm going to do is I'm going to include 59 00:06:13,680 --> 00:06:23,560 your Julia file. So I'll probably specify a file name convention that I'll want you 60 00:06:23,560 --> 00:06:28,520 to use. Like, you know, just your group's last name or something like that. Actually, 61 00:06:28,520 --> 00:06:43,280 it will probably just be project2.jl. Let me just make that clear. So if you need obscure 62 00:06:43,280 --> 00:06:47,520 packages, you definitely want to check those with me ahead of time. So if you're using 63 00:06:47,520 --> 00:06:52,520 things that we've used in class, you can sort of assume that I'll have them. But if you're 64 00:06:52,520 --> 00:07:17,440 using exotic or obscure packages, just let me know ahead of time. So... But in particular, 65 00:07:17,440 --> 00:07:23,240 you probably shouldn't be using things like jump or other types of solvers. So those ones 66 00:07:23,240 --> 00:07:32,640 would be... I don't want to say forbidden, because if you have a good reason, I'm open 67 00:07:32,640 --> 00:07:48,860 to it. But you should have a very good reason for depending on any solver package. 68 00:07:48,860 --> 00:08:03,460 So I'm going to put a note here, just accept solvers. What tolerance? 10 to the minus 6. 69 00:08:03,460 --> 00:08:08,820 10 to the minus 4. 10 to the minus 5. I don't know. I haven't decided which one I'm going 70 00:08:08,820 --> 00:08:15,780 to run yet. But certainly... I would try and make sure my software is well-behaved for 71 00:08:15,780 --> 00:08:21,220 a variety of tolerances. In the past, people have complained, because I ran with a different 72 00:08:21,220 --> 00:08:24,460 tolerance than what they thought I was going to run with. And then they said, well, if 73 00:08:24,460 --> 00:08:28,980 I tweak this constant here, then it works. I'm like... Yeah, but I told you I wasn't 74 00:08:28,980 --> 00:08:34,180 going to run with a particular tolerance ahead of time. And so if you're overly reliant on 75 00:08:34,180 --> 00:08:39,500 your tolerance for how you've sort of tweaked your algorithm, that is the kind of thing 76 00:08:39,500 --> 00:08:45,980 you tend to lose points for on this one. So best advice is try a bunch of tolerances, 77 00:08:45,980 --> 00:08:52,620 and make sure your algorithms behave nicely and/or fail gracefully for lots of them. Yes? 78 00:08:52,620 --> 00:09:16,380 Yes, you can do smarter linear algebra. I should be clear. Linear systems. Let me put 79 00:09:16,380 --> 00:09:27,180 it that way. But again, if you're using some obscure package or things that we haven't 80 00:09:27,180 --> 00:09:32,260 used in class, you might just want to check those with me. I haven't seen anyone in their 81 00:09:32,260 --> 00:09:39,740 projects reporting on that yet, or anything that sort of alarmed me. But if I do see it, 82 00:09:39,740 --> 00:09:50,500 I will let you know. I mean, I also might check stuff like 10 to the minus 2, 10 to 83 00:09:50,500 --> 00:10:01,620 the minus 3, to see if you're sort of solving them appropriately for the tolerance as well. 84 00:10:01,620 --> 00:10:16,980 So you look skeptical about 10 to the minus 6 as a tolerance. Of course the performance 85 00:10:16,980 --> 00:10:32,740 is going to vary between the tolerances. So presumably if I'm going to specify a higher 86 00:10:32,740 --> 00:10:39,380 tolerance, I'll specify a higher maximum number of iterations, too. But again, if your solver 87 00:10:39,380 --> 00:10:46,020 runs out of iterations, you should tell me that, right? So for instance, if I call your 88 00:10:46,020 --> 00:10:51,420 solver with a tolerance of 10 to the minus 6 and 5 iterations, you should tell me what 89 00:10:51,420 --> 00:11:00,420 in your output. Maximum number of iterations reached. So that then I know that your solver 90 00:11:00,420 --> 00:11:06,780 knows that it needs more iterations, rather than... I mean, if it declares success, I'm 91 00:11:06,780 --> 00:11:11,180 going to assume that it's correctly solved the problem and evaluate it. But if it declares 92 00:11:11,180 --> 00:11:17,900 that I need more iterations, then that wouldn't be fair of me to demand that it solves it. 93 00:11:17,900 --> 00:11:33,180 Or sorry. To evaluate that. I do both manual inspections and a test harness. 94 00:11:33,180 --> 00:11:37,580 But again, the test harness logic is exactly what you might expect. Run it with something 95 00:11:37,580 --> 00:11:42,620 like this. Check. Does it report I need more iterations? Or throw an error? Or something 96 00:11:42,620 --> 00:12:01,900 like that. And then if so, go back and increase the number of iterations. 97 00:12:01,900 --> 00:12:07,740 So again, you've got to have a file called project2.jl. I know someone was talking about 98 00:12:07,740 --> 00:12:12,340 doing this in Python. So if you're still thinking about that, please do come up and chat with 99 00:12:12,340 --> 00:12:18,620 me just to figure out if you're still doing that or not. 100 00:12:18,620 --> 00:12:28,460 Other questions about project2? 101 00:12:28,460 --> 00:12:31,980 Anything you need in order to run it. Like, you've got the interface that I'm gonna call 102 00:12:31,980 --> 00:12:38,180 your software on. 103 00:12:38,180 --> 00:12:45,500 You don't have to handle anything for loading data. You can assume I will give you IPLP 104 00:12:45,500 --> 00:12:46,500 problems. 105 00:12:46,500 --> 00:13:08,500 Well, you probably need the IPLP struct definition, too. 106 00:13:08,500 --> 00:13:12,540 I would put that into the report, not in what you submit. Because if you submit something 107 00:13:12,540 --> 00:13:17,300 and then it causes a problem on my end, you lose points for it. 108 00:13:17,300 --> 00:13:23,700 So I'd make sure that you can sort of load this. If someone posts a note on edstem, I'll 109 00:13:23,700 --> 00:13:41,360 try and give you a test case, test harness that you can use. 110 00:13:41,360 --> 00:13:45,260 You do not have to do integer programming. 111 00:13:45,260 --> 00:14:13,100 All right. Should we go on to large-scale optimization? 112 00:14:13,100 --> 00:14:28,100 So, what is large-scale optimization? I don't know. I think... 113 00:14:28,100 --> 00:14:33,900 Certainly I think we'd all agree that if you've got a million variables and you're trying 114 00:14:33,900 --> 00:14:48,900 to solve an optimization with a million variables, that's large-scale. 115 00:14:48,900 --> 00:14:53,120 So a million is definitely large-scale. Is 10,000 variables large-scale? 116 00:14:53,120 --> 00:14:57,700 Well, I would argue that 10,000 is sort of an intermediate point. 117 00:14:57,700 --> 00:15:00,240 And I'll be a little bit more precise about this. 118 00:15:00,240 --> 00:15:07,320 But what I really mean by large... My mental concept of large-scale optimization is something 119 00:15:07,320 --> 00:15:12,500 where if you tried to use a quasi-Newton method, just a vanilla quasi-Newton method, it wouldn't 120 00:15:12,500 --> 00:15:17,720 work because you're out of memory or it's too expensive. 121 00:15:17,720 --> 00:15:26,900 So can you do a quasi-Newton method with 10,000 variables? 10,000, yes. 25,000? Probably. 122 00:15:26,900 --> 00:15:33,940 50,000? Yeah, but it's getting expensive. 100,000? It's getting real expensive at 100,000, 123 00:15:33,940 --> 00:15:37,100 although you can still do it. 124 00:15:37,100 --> 00:15:41,540 And sort of somewhere... But you can sort of see that there's a gradient here, where 125 00:15:41,540 --> 00:15:47,900 it feels like there's some sort of transition point around 10,000, where things get more 126 00:15:47,900 --> 00:15:54,580 interesting and maybe you want to do something else. 127 00:15:54,580 --> 00:16:15,460 So example... So this was one of my PhD students. Old PhD student. Now at Texas A&M. So he's 128 00:16:15,460 --> 00:16:31,300 now at Texas A&M. So he was looking at a problem with 3 trillion constraints and 161 million 129 00:16:31,300 --> 00:16:49,740 variables. And so this took days and weeks to solve. So with 3 trillion 130 00:16:49,740 --> 00:16:55,740 constraints, you can't even really form all the constraints. And so you end up with some 131 00:16:55,740 --> 00:17:03,740 interesting stuff there. And again, this was just one problem out of many we would like 132 00:17:03,740 --> 00:17:10,200 to solve. But certainly, we're going to need some different strategies for something like 133 00:17:10,200 --> 00:17:16,640 that. Yes? 134 00:17:16,640 --> 00:17:24,740 So in this case, it was a particular method for clustering data. So every data point had 135 00:17:24,740 --> 00:17:35,900 basically n choose 3 associated constraints and n choose 2 associated variables. 136 00:17:35,900 --> 00:17:42,260 So do we need to do this to cluster data? No. We were sort of looking at some more interesting 137 00:17:42,260 --> 00:17:49,620 slash exotic slash really principled algorithms in this space that are then much, much, much 138 00:17:49,620 --> 00:17:56,980 more expensive to solve than a lot of other ones were. If you want a slightly more refined 139 00:17:56,980 --> 00:18:06,240 answer, we were trying to get lower bounds on performance of some clustering algorithms. 140 00:18:06,240 --> 00:18:11,240 But again, I think we all know of optimization problems with trillions of constraints. Sorry, 141 00:18:11,240 --> 00:18:18,280 with trillions of variables. So has anyone heard of something called chatGPT? So chatGPT, 142 00:18:18,280 --> 00:18:22,400 when they're constructing chatGPT, what they're doing is they're really optimizing a function 143 00:18:22,400 --> 00:18:27,680 with the number of weights or parameters as the number of constraints. And although they 144 00:18:27,680 --> 00:18:31,280 haven't disclosed it, it's reported to be in the multiple trillions. 145 00:18:31,280 --> 00:18:38,080 Can I ask a question? Yeah. 146 00:18:38,080 --> 00:18:56,120 [INAUDIBLE] Sorry. So you've got a machine learning model 147 00:18:56,120 --> 00:19:03,760 to approximate the constraints? Yeah. Based on the given data. 148 00:19:03,760 --> 00:19:10,680 So in general, I think just throwing a machine learning problem at it is bad science. On 149 00:19:10,680 --> 00:19:18,880 the other hand, it's very good practice in the sense that it's a useful tool. But you 150 00:19:18,880 --> 00:19:23,800 don't really learn anything about it. And if it works-- if it fails, you don't-- you've 151 00:19:23,800 --> 00:19:30,400 got no understanding of why. So I think it's one of those things where you throw it at 152 00:19:30,400 --> 00:19:34,400 it, see if it works. And if it works, you say, all right, great. Then maybe you think 153 00:19:34,400 --> 00:19:39,280 about it a little bit harder and think, is there a reason that this works? 154 00:19:39,280 --> 00:19:47,600 But as far as do I think it's a good idea or not, it is a tool in a tool chest. So tools 155 00:19:47,600 --> 00:19:53,040 are always good ideas. Does this mean that this idea is particularly good? No, I don't 156 00:19:53,040 --> 00:19:58,640 think there's anything special about ML algorithms. So if you've got a tough optimization problem 157 00:19:58,640 --> 00:20:04,240 with lots of constraints, you need some way of handling them. And if there's a particular 158 00:20:04,240 --> 00:20:11,520 reason that they work well for your application, great. But a priori, I don't see any sort 159 00:20:11,520 --> 00:20:15,600 of special reason why they would work sort of fundamentally better than other techniques. 160 00:20:15,600 --> 00:20:31,920 [INAUDIBLE] 161 00:20:31,920 --> 00:20:40,640 So I guess we'll have to chat a little bit later. Because again, chat GPT is like-- the 162 00:20:40,640 --> 00:20:48,880 states here are constructed to build an algorithm. And so it gets a little bit tricky to talk 163 00:20:48,880 --> 00:20:54,560 about some of these ideas in a way that's coherent. Certainly, you can optimize over 164 00:20:54,560 --> 00:20:58,400 the space of algorithms, right? You parameterize your algorithm by some features, and then 165 00:20:58,400 --> 00:21:03,120 you optimize over it. So you need a parametric class of algorithms, and then you can do things 166 00:21:03,120 --> 00:21:12,000 like this. 167 00:21:12,000 --> 00:21:41,360 So why is it hard? So there's one reason it could be hard, which is my function takes 168 00:21:41,360 --> 00:21:47,640 a really long time to evaluate. So again, if you're actually evaluating a function with 169 00:21:47,640 --> 00:21:54,560 trillions of variables, it might sort of fundamentally take hours, days, to potentially just evaluate 170 00:21:54,560 --> 00:22:01,440 a single function call and get a gradient. We are not going to look at methods that take 171 00:22:01,440 --> 00:22:16,160 this particular scenario. So for these things, you want surrogate optimization, which is 172 00:22:16,160 --> 00:22:35,360 a different-- 173 00:22:35,360 --> 00:22:38,600 There's this other idea called sequential quadratic programming, which also shows up 174 00:22:38,600 --> 00:22:42,880 a lot, where you essentially build a quadratic surrogate to your function, and then spend 175 00:22:42,880 --> 00:22:47,200 all of your time optimizing that. And then every so often, you'll update your quadratic 176 00:22:47,200 --> 00:22:53,360 surrogate with calls to your function. 177 00:22:53,360 --> 00:23:14,080 But we are going to assume... 178 00:23:14,080 --> 00:23:42,480 So we're going to assume that you can evaluate your function or gradient, say, 1,000 or 10,000 179 00:23:42,480 --> 00:23:50,600 times within whatever time budget you want to run your optimization problem. So if you're 180 00:23:50,600 --> 00:23:55,480 willing to wait a week, then maybe you could go 10,000 or 100,000 times. If you're willing 181 00:23:55,480 --> 00:24:06,180 to wait a day, then maybe 1,000 to 10,000 times. So these things aren't so expensive 182 00:24:06,180 --> 00:24:11,160 that it takes a day just to evaluate your function a single time. 183 00:24:11,160 --> 00:24:15,800 Does that make sense with everyone? Because this is sort of a difference, where in small 184 00:24:15,800 --> 00:24:20,480 scale optimization, you generally assume that your functions are not super hard to evaluate, 185 00:24:20,480 --> 00:24:25,080 although they may be. But if they are, then you're not going to evaluate them that many 186 00:24:25,080 --> 00:24:30,920 times. But in large scale stuff, you do get this sort of break in how you might think 187 00:24:30,920 --> 00:24:35,280 about that. 188 00:24:35,280 --> 00:24:54,080 So if you want to formally... I would want them to be little o of n squared. That is, 189 00:24:54,080 --> 00:25:15,160 they take less time to evaluate than the squared number of variables. So eg... So if I've got 190 00:25:15,160 --> 00:25:22,200 n variables, then your function might take order, say, n log n to evaluate, or n to the 191 00:25:22,200 --> 00:25:41,240 3/2, or something along those lines. It's just not order n squared. 192 00:25:41,240 --> 00:26:08,800 So I'm going to do a little bit of math here. So I'm going to take n log n to the 3/2, and 193 00:26:08,800 --> 00:26:19,280 then I'm going to... So the implications of this is we can run line search. You have to 194 00:26:19,280 --> 00:26:23,080 be able to store all of your variables. So I guess I didn't say that, but I am assuming 195 00:26:23,080 --> 00:26:28,520 you can store all of your variables. Hopefully everyone's okay with that. 196 00:26:28,520 --> 00:26:34,840 Again, in the chat GPT case, where you honest to gosh have trillions of variables, or in 197 00:26:34,840 --> 00:26:41,080 the 3 trillion constraint case, this does sort of become an issue. But for keeping our 198 00:26:41,080 --> 00:27:00,840 discussion compact, we are going to assume that we can store the variables. 199 00:27:00,840 --> 00:27:04,400 So the implications of these two is we can run line search, and if we can run line search, 200 00:27:04,400 --> 00:27:10,000 then we can run gradient descent. So we can do optimization in this scenario. There's 201 00:27:10,000 --> 00:27:16,360 nothing fundamentally that's broken about this. Typically, when I teach this stuff, 202 00:27:16,360 --> 00:27:21,640 I talk about an algorithm called conjugate gradient before this, but I decided to switch 203 00:27:21,640 --> 00:27:34,680 it up this year and talk about limited memory quasi-Newton first. 204 00:27:34,680 --> 00:27:59,320 So we can also run conjugate gradient, which I think we're going to see next lecture. But 205 00:27:59,320 --> 00:28:15,100 let's do it. So could we do something like Newton with 100,000 dimensional problems? 206 00:28:15,100 --> 00:28:23,680 So f takes as input 100,000 variables and outputs a single scalar. And this is going 207 00:28:23,680 --> 00:28:27,800 to be your quiz. I'd like you to think about this for a second and give one of the following 208 00:28:27,800 --> 00:28:55,200 two answers. Yes, no, maybe. And also why. 209 00:28:55,200 --> 00:29:09,680 All right. So I am going to shuffle these up a little bit and then go through them and 210 00:29:09,680 --> 00:29:36,120 report out what the class thinks about this. So yes, but very expensive. Yes. 211 00:29:36,120 --> 00:30:02,160 Yes. Maybe. Another. I've seen two very expensive. No. I like the definitiveness. 212 00:30:02,160 --> 00:30:31,980 Sorry. I guess this was 10 to the 15 is too big. Yes. Another yes. Another 10 to the 15 213 00:30:31,980 --> 00:30:56,120 is too complicated. Another yes. Maybe. 214 00:30:56,120 --> 00:31:25,560 I'm going to call this one a maybe. No. Sorry. No. 215 00:31:25,560 --> 00:31:47,700 Another yes, but very expensive. No. All right. Another no. Another maybe. I think we're going 216 00:31:47,700 --> 00:31:57,740 to end up tied, a three-way tie amongst the class here in a second. No or maybe. OK. I'm 217 00:31:57,740 --> 00:32:05,820 going to go with maybe here. I think that's-- all right. And at the end, no's come in to 218 00:32:05,820 --> 00:32:13,620 too big to handle. All right. So we have almost a perfect tie amongst everyone in this class. 219 00:32:13,620 --> 00:32:18,640 So among people who said it was possible, they said, look, there's nothing that goes 220 00:32:18,640 --> 00:32:27,340 wrong when you try and do this. And 10 to the 15 is big. But so what do we need to do? 221 00:32:27,340 --> 00:32:41,500 Newton said Newton. 10 to the 5 to the third. So you need to do O of n cubed. So Newton 222 00:32:41,500 --> 00:33:04,900 needs O of n cubed work per iteration. So just to store the hashing, you need about 223 00:33:04,900 --> 00:33:13,980 80 gigabytes of memory. All right. So 80 gigabytes is a lot of memory. Who has access to a computer 224 00:33:13,980 --> 00:33:23,900 with more than 80 gigabytes of memory? I know a few of you do. So it's not like an impossible 225 00:33:23,900 --> 00:33:28,020 amount of memory. On the other hand, every step you're going to have to solve a linear 226 00:33:28,020 --> 00:33:35,420 system of equations with this, which 10 to the 15 work. Again, that's not impossible. 227 00:33:35,420 --> 00:33:40,040 So I would actually say the answer to this one is probably maybe. It depends on how quickly 228 00:33:40,040 --> 00:33:46,480 you need to do it. The point is it's going to be really expensive. So I'm actually sort 229 00:33:46,480 --> 00:33:55,400 of pleased with this distribution of answers in the sense that I think the no's and maybe's 230 00:33:55,400 --> 00:34:02,180 probably have it over the yeses. But on the other hand, if you have access to sufficient 231 00:34:02,180 --> 00:34:13,740 compute resources, you can do this. Now, if n is equal to 10 to the 6, can I get a quick 232 00:34:13,740 --> 00:34:19,760 show of hands who thinks it's yes, no, and maybe? So can I see the yeses? 233 00:34:19,760 --> 00:34:31,440 Okay. Six of you said yes for 10 to the 5th. You're all going to say no for 10 to the 6th? 234 00:34:31,440 --> 00:34:43,640 You're going to say yes? What's that? We've got two yeses. 235 00:34:43,640 --> 00:34:51,520 That still won't store your Hessian. That's 8 terabytes of RAM. But on the other hand, 236 00:34:51,520 --> 00:34:57,780 you're right. You can rent a computer on Amazon with 32 terabytes of RAM. So we can run it 237 00:34:57,780 --> 00:35:07,980 on that one, which I think costs $200 an hour to rent or something like that. 238 00:35:07,980 --> 00:35:17,620 Per iteration. He's got his own research funding, by the way. No, I'm teasing. No. Who thinks 239 00:35:17,620 --> 00:35:34,280 no for this one? Okay. Like 10 and maybe? A few. 240 00:35:34,280 --> 00:35:39,080 So again, I think these are all consistent answers. I haven't said anything about the 241 00:35:39,080 --> 00:35:44,560 type of problem. So if your Hessian has any type of structure in it, I think all of you 242 00:35:44,560 --> 00:35:47,980 would start to say, like, oh, you're all going to switch over to the yes column if I tell 243 00:35:47,980 --> 00:35:54,800 you your Hessian's diagonal, right? Yes, you should all switch over to the yes column if 244 00:35:54,800 --> 00:36:00,080 your Hessian happens to be diagonal, in which case then doing Newton is not really any more 245 00:36:00,080 --> 00:36:04,400 work or expense than gradient descent. Now, you have to know that your Hessian's diagonal 246 00:36:04,400 --> 00:36:09,360 up front in order to avoid computing all the irrelevant stuff. But that would be like a 247 00:36:09,360 --> 00:36:11,800 super easy Hessian to work with. Yes? 248 00:36:11,800 --> 00:36:21,240 Didn't you say f was the [INAUDIBLE] so why would-- like, the Hessian would be that square, 249 00:36:21,240 --> 00:36:22,240 right? 250 00:36:22,240 --> 00:36:27,040 The Hessian would be that square. But I never said-- like, your Hessian might have structure, 251 00:36:27,040 --> 00:36:34,800 right? So like, it might not. But if you're solving something, then maybe you're solving 252 00:36:34,800 --> 00:36:41,800 a problem where you know your Hessian has tridiagonal structure or something like that. 253 00:36:41,800 --> 00:36:46,720 In particular, I solve a lot of problems where your Hessian is graph-structured. So it's 254 00:36:46,720 --> 00:36:51,440 a sparse matrix. So if I tell you you've got to solve a million by a million sparse matrix 255 00:36:51,440 --> 00:36:57,560 at every turn, well, that still becomes expensive, but doesn't become impossibly problematic 256 00:36:57,560 --> 00:37:01,920 expensive. So that's why, like, I think sort of shifting from the no column over to the 257 00:37:01,920 --> 00:37:07,520 maybe column. But again, all of this just becomes harder as the problems get bigger 258 00:37:07,520 --> 00:37:11,240 and bigger and bigger. So. Yes? 259 00:37:11,240 --> 00:37:26,240 Excuse me, question, but why is space complexity is cubed, not squared? 260 00:37:26,240 --> 00:37:27,240 Space complexity is squared. Time complexity is cubed. 261 00:37:27,240 --> 00:37:39,680 Oh, sorry. This is-- I miswrote this. Space. Oh, no, no, no. That was end of the work. 262 00:37:39,680 --> 00:37:47,180 Okay. 263 00:37:47,180 --> 00:38:16,020 So 264 00:38:16,020 --> 00:38:28,580 we'll talk about structure in a little bit. 265 00:38:28,580 --> 00:38:34,420 Do quasi-Newton methods help? 266 00:38:34,420 --> 00:38:38,940 So quasi-Newton methods allowed us to get super linear convergence rates, which is sort 267 00:38:38,940 --> 00:38:39,940 of what we want. 268 00:38:39,940 --> 00:38:47,880 Really fast convergence. 269 00:38:47,880 --> 00:38:50,500 They don't need the Hessian explicitly, right? 270 00:38:50,500 --> 00:39:04,420 They were working with some type of Hessian surrogate or Hessian approximate. 271 00:39:04,420 --> 00:39:12,980 So I would say no. 272 00:39:12,980 --> 00:39:15,160 Why would I probably say no? 273 00:39:15,160 --> 00:39:19,080 If I tell you my answer is no, what do you think is the best reason for why quasi-Newton 274 00:39:19,080 --> 00:39:23,460 methods aren't going to help this? 275 00:39:23,460 --> 00:39:32,540 What's that? 276 00:39:32,540 --> 00:39:46,460 So there's no history in quasi-Newton methods. 277 00:39:46,460 --> 00:39:52,580 So I think of memory as the bigger constraint than time, in that if you are pressed to, 278 00:39:52,580 --> 00:39:57,500 you can always let your computer run longer, but you can't get more memory. 279 00:39:57,500 --> 00:40:02,060 And so quasi-Newton methods still take order n squared memory. 280 00:40:02,060 --> 00:40:18,500 So they're going to take 8 terabytes of RAM for that million by million problem. 281 00:40:18,500 --> 00:40:22,860 And in fact, I would argue it's even worse in that even if your Hessian does have some 282 00:40:22,860 --> 00:40:26,840 structure, quasi-Newton methods won't capture that at all. 283 00:40:26,840 --> 00:40:39,500 They're going to give you a fully dense approximate matrix. 284 00:40:39,500 --> 00:40:40,500 All right. 285 00:40:40,500 --> 00:40:49,820 But is all hope lost? 286 00:40:49,820 --> 00:41:11,360 Let's study BFGS for a second. 287 00:41:11,360 --> 00:41:18,800 So remember, BFGS maintains an approximation of your inverse Hessian. 288 00:41:18,800 --> 00:41:39,240 So the way I like to think of this is I minus rho. 289 00:41:39,240 --> 00:41:43,560 So just because I get tired of writing these things, I'm going to call this matrix L, and 290 00:41:43,560 --> 00:41:48,640 I'm going to call this matrix R. In some other notes, we're going to use VK 291 00:41:48,640 --> 00:41:50,040 for those matrices. 292 00:41:50,040 --> 00:41:57,400 So I'm just trying to keep things simple here as I explain a concept. 293 00:41:57,400 --> 00:42:16,680 So what do we need to do? 294 00:42:16,680 --> 00:42:20,960 So what we need to do with TK+1 is we don't actually need TK+1. 295 00:42:20,960 --> 00:42:27,320 What we need to do with TK+1 is we just need to multiply it against the negative gradient. 296 00:42:27,320 --> 00:42:31,560 So the idea of numerical linear algebra and lots of other things is if a problem has structure, 297 00:42:31,560 --> 00:42:33,640 we should try and exploit that structure. 298 00:42:33,640 --> 00:42:36,280 So we do have a little bit of structure here. 299 00:42:36,280 --> 00:42:41,900 Suppose I gave you a routine to multiply TK by a vector. 300 00:42:41,900 --> 00:42:49,000 I claim then we could compute TK+1 times a vector in light of that routine for TK times 301 00:42:49,000 --> 00:42:50,480 a vector. 302 00:42:50,480 --> 00:42:51,960 How would you do it? 303 00:42:51,960 --> 00:43:05,280 Well, you'd say PK is equal to TK+1 times minus G is equal to L times TK times R plus 304 00:43:05,280 --> 00:43:21,380 rho SS transpose -- let me write it this way. 305 00:43:21,380 --> 00:43:35,400 So here and here, this is just a scalar, and R times -- let me call this a vector Z. 306 00:43:35,400 --> 00:44:02,080 So 307 00:44:02,080 --> 00:44:09,140 hopefully I didn't screw up any constants while I was doing this one. 308 00:44:09,140 --> 00:44:10,580 So what do I have to do here? 309 00:44:10,580 --> 00:44:15,180 So this is an inner product. 310 00:44:15,180 --> 00:44:20,320 This is an inner product. 311 00:44:20,320 --> 00:44:27,380 And then I just need to form a linear combination of G and this inner product times Y. 312 00:44:27,380 --> 00:44:31,180 So does everyone agree I can compute the vector Z? 313 00:44:31,180 --> 00:44:50,700 Like I've got access to the vectors Y and S and the quantity rho. 314 00:44:50,700 --> 00:44:52,920 I see a lot of people give me blank stares. 315 00:44:52,920 --> 00:44:55,060 What are your questions about computing the quantity Z? 316 00:44:55,060 --> 00:45:00,380 If I gave you access to Y, S, and rho on the computer and gave you access to the vector 317 00:45:00,380 --> 00:45:04,300 minus G, I claim we can compute Z. 318 00:45:04,300 --> 00:45:14,180 You just literally plug these quantities into Julia, and it'll happily do it. 319 00:45:14,180 --> 00:45:20,940 I'm seeing a few people nod along. 320 00:45:20,940 --> 00:45:23,740 If you've got questions about this, ask, because it's about to get more complicated. 321 00:45:23,740 --> 00:45:29,420 So it's worth appreciating what's going on here first. 322 00:45:29,420 --> 00:45:54,540 So then we inductively-- I've said that we've got some way of doing TK times Z. 323 00:45:54,540 --> 00:46:13,060 Let's call this one-- what's a good variable? 324 00:46:13,060 --> 00:46:17,580 Man, I'm running out of variables here. 325 00:46:17,580 --> 00:46:22,700 So I've used Y, S, T. 326 00:46:22,700 --> 00:46:25,060 Let me go with D. 327 00:46:25,060 --> 00:46:53,700 I'm going to call this one D. I'm going to call this one D. I'm going to call this one 328 00:46:53,700 --> 00:47:01,340 Z. 329 00:47:01,340 --> 00:47:10,060 So then this entire thing just becomes L times D, which we do just like Z, plus beta, which 330 00:47:10,060 --> 00:47:27,380 is this quantity over here, times rho times S. 331 00:47:27,380 --> 00:47:47,900 So the idea is-- what is going on? 332 00:47:47,900 --> 00:48:10,460 We all on the same page? 333 00:48:10,460 --> 00:48:15,180 Essentially, if I just store the change-- so I think you were getting at this point 334 00:48:15,180 --> 00:48:16,180 with history. 335 00:48:16,180 --> 00:48:23,560 Some of you have looked at limited memory quasi-Newton methods already in project one. 336 00:48:23,560 --> 00:48:29,640 But here we're just going through them for the entire class. 337 00:48:29,640 --> 00:48:34,680 So the idea is, if I just store the change, I can just sort of update what I've seen before. 338 00:48:34,680 --> 00:48:37,820 And so the idea is, I'm just going to do this. 339 00:48:37,820 --> 00:48:39,540 So let's see what happens. 340 00:48:39,540 --> 00:49:08,500 So let Vk is equal to-- oops, these are all k's. 341 00:49:08,500 --> 00:49:10,820 Oops. 342 00:49:10,820 --> 00:49:12,860 Sorry. 343 00:49:12,860 --> 00:49:42,580 So if T(k+1) is equal to that expression, then T(k)-- 344 00:49:42,580 --> 00:49:48,500 T(k) is equal to that expression just with k switched to k minus 1. 345 00:49:48,500 --> 00:49:54,140 This is true at every step of the process. 346 00:49:54,140 --> 00:50:01,000 And so if I just substitute this in, or I can do it one more step, T(k-1) is equal to 347 00:50:01,000 --> 00:50:21,640 2 transpose T(k-2) V(k-2) plus rho(k-2) S(k-2) S(k-2) transpose. 348 00:50:21,640 --> 00:50:29,520 And so I get T(k) is equal to-- if I just substitute all of these together. 349 00:50:29,520 --> 00:50:37,080 V(k-1) transpose. 350 00:50:37,080 --> 00:50:41,760 So I just take this guy and plug him in here, and then this guy and plug him in here, except 351 00:50:41,760 --> 00:50:45,600 I'm just going to unroll it the other way. 352 00:50:45,600 --> 00:51:06,780 So V(k-1) times V(k-2) transpose dot dot dot V(0) transpose T(0) V(0) V(k-1) plus rho(k-1) 353 00:51:06,780 --> 00:51:24,140 S(k-1) S(k-1) transpose plus rho(k-2) times S-- sorry, times V(k)-- sorry, this is V(k-1). 354 00:51:24,140 --> 00:51:39,380 This is where you need to be a little careful. 355 00:51:39,380 --> 00:51:46,300 So essentially this is what I get from unrolling all of these, rho dot dot dot rho(1) times 356 00:51:46,300 --> 00:51:56,160 V(k-1) transpose dot dot dot V(k)-- or sorry, V(1). 357 00:51:56,160 --> 00:51:59,100 So you've got to be careful to get your indices right, and this is where having access to 358 00:51:59,100 --> 00:52:05,940 a computer really helps, because you can just do this all on paper and pencil and then plug 359 00:52:05,940 --> 00:52:11,300 it into something like Julia or Python or MATLAB and just double check that you're not 360 00:52:11,300 --> 00:52:18,820 off by an index somewhere. 361 00:52:18,820 --> 00:52:24,700 So I can't remember if that last one's right, but it's-- like, I certainly assume you guys 362 00:52:24,700 --> 00:52:50,860 get the idea. 363 00:52:50,860 --> 00:52:56,220 And so there's an algorithm to do this, which-- I mean, I guess I can write out in the last 364 00:52:56,220 --> 00:52:57,780 few minutes of class. 365 00:52:57,780 --> 00:53:05,420 But essentially what we do is we use this implicit representation, and the only thing 366 00:53:05,420 --> 00:53:10,940 that's not specified here is T0. 367 00:53:10,940 --> 00:53:16,260 So it turns out when you're using limited memory quasi-Newton methods-- so I guess I 368 00:53:16,260 --> 00:53:20,980 should say this is usually called a limited memory quasi-Newton method. 369 00:53:20,980 --> 00:53:21,980 And so what's the idea? 370 00:53:21,980 --> 00:53:27,300 Are we going to store-- so when I say limited memory, what we're going to do is we're only 371 00:53:27,300 --> 00:53:30,340 going to store a finite number of these updates. 372 00:53:30,340 --> 00:53:31,860 So you might store 50 of them. 373 00:53:31,860 --> 00:53:33,100 You might store 100 of them. 374 00:53:33,100 --> 00:53:35,620 You could store 1,000 of them. 375 00:53:35,620 --> 00:53:39,820 The point is you're not going to store capital-- like, little n of them, because if you stored 376 00:53:39,820 --> 00:53:46,740 little n of them, then you'd be storing more than you would have stored the Hessian. 377 00:53:46,740 --> 00:53:51,260 And then when you want to work with them, you can compute Hessian-- or sorry, inverse 378 00:53:51,260 --> 00:54:15,100 Hessian times vector products, using an algorithm that I can sketch out, so-- so alg. 379 00:54:15,100 --> 00:54:38,340 So it is really important how you scale that initial diagonal approximation. 380 00:54:38,340 --> 00:54:45,300 So usually, you use some scaled multiple of the identity. 381 00:54:45,300 --> 00:54:46,300 And then for-- 382 00:54:46,300 --> 00:55:13,900 [INAUDIBLE] 383 00:55:13,900 --> 00:55:32,340 Scaling is important. 384 00:55:32,340 --> 00:55:45,100 So here is a common scaling you will see. 385 00:55:45,100 --> 00:55:49,040 So what this is trying to do is it's trying to estimate the magnitude of the Hessian based 386 00:55:49,040 --> 00:55:54,880 on your most recent iteration. 387 00:55:54,880 --> 00:55:59,380 And so it's attempting to roughly scale everything. 388 00:55:59,380 --> 00:56:02,300 And so you could probably also use some ideas from momentum here. 389 00:56:02,300 --> 00:56:05,300 I imagine someone's worked out a few of those things. 390 00:56:05,300 --> 00:56:09,020 They're trying to get at this same type of information, just with different types of 391 00:56:09,020 --> 00:56:16,540 techniques. 392 00:56:16,540 --> 00:56:42,540 And then once-- OK. 393 00:56:42,540 --> 00:56:57,820 So you just keep the k most recent things. 394 00:56:57,820 --> 00:57:00,500 And usually, people will say something like in a circular buffer. 395 00:57:00,500 --> 00:57:02,780 So you just have k slots allocated. 396 00:57:02,780 --> 00:57:20,320 And as you keep updating them, you just overwrite the last one in your buffer. 397 00:57:20,320 --> 00:57:27,900 So there's an algorithm to actually compute this t of k times the vector in the nodes. 398 00:57:27,900 --> 00:57:33,340 It's not particularly interesting in the sense that all we do is we just take advantage of 399 00:57:33,340 --> 00:57:37,620 the structure and compute things sort of like a step at a time. 400 00:57:37,620 --> 00:57:41,340 So conceptually, it runs down through all of the k's. 401 00:57:41,340 --> 00:57:46,800 So from k plus 1 all the way down to k 0 to compute some things. 402 00:57:46,800 --> 00:57:53,100 And then it runs upward to compute the rest. 403 00:57:53,100 --> 00:57:57,580 And so it's the kind of algorithm where I imagine anyone in this class could solve it 404 00:57:57,580 --> 00:58:04,380 if you were given like a week of time to do it or something maybe even more like a day. 405 00:58:04,380 --> 00:58:16,140 So it's not a-- it's nothing really super sophisticated or deep. 406 00:58:16,140 --> 00:58:21,160 So that's the sort of high level pitch about limited memory quasi-Newton methods. 407 00:58:21,160 --> 00:58:27,460 So I don't actually think anyone's ever proved the convergence rate of this is any faster 408 00:58:27,460 --> 00:58:28,860 than linear. 409 00:58:28,860 --> 00:58:32,180 But empirically, they do go quite a bit faster than linear. 410 00:58:32,180 --> 00:58:40,420 So you do really benefit from having this Hessian information on a lot of problems. 411 00:58:40,420 --> 00:58:42,700 But again, I don't think there's ever been a formal proof of that. 412 00:58:42,700 --> 00:58:48,020 I think it was-- at least as of a few years ago, it was still unsolved. 413 00:58:48,020 --> 00:58:53,780 So I should double check if it's still unsolved or not. 414 00:58:53,780 --> 00:58:56,300 What are people's questions about these methods? 415 00:58:56,300 --> 00:59:01,740 I don't have any good advice on how to pick the amount of updates you store, k. 416 00:59:01,740 --> 00:59:04,380 In general, you don't want it too big. 417 00:59:04,380 --> 00:59:07,020 And you also don't want it too small. 418 00:59:07,020 --> 00:59:13,700 So like 50 to 100 tends to be where people often set them. 419 00:59:13,700 --> 00:59:17,820 The reason you don't want it too big is, yes, you do more accurate work on approximating 420 00:59:17,820 --> 00:59:20,540 your Hessian if you have it really big. 421 00:59:20,540 --> 00:59:23,920 But then you spend a lot more work computing these products. 422 00:59:23,920 --> 00:59:28,740 And it's not clear that you get a benefit from that versus doing something that's a 423 00:59:28,740 --> 00:59:30,260 little bit more gradient descent-like. 424 00:59:30,260 --> 00:59:35,260 So you just have a trade-off between how much work you do at each iteration and work with 425 00:59:35,260 --> 00:59:42,540 your optimization algorithm versus work towards your objective. 426 00:59:42,540 --> 00:59:47,420 If you wanted to do something fancy, I suspect once you start getting close to a solution, 427 00:59:47,420 --> 00:59:50,180 maybe you want to increase it a little bit to make the directions just a little bit more 428 00:59:50,180 --> 00:59:51,500 Newton-like there. 429 00:59:51,500 --> 01:00:05,840 So you might want something that grows slightly as the iterations proceed. 430 01:00:05,840 --> 01:00:08,240 And I should also say that this is for unconstrained optimization. 431 01:00:08,240 --> 01:00:13,100 If you want to use these ideas on constrained optimization, again, then you go down the 432 01:00:13,100 --> 01:00:16,460 route of some of the augmented Lagrangian methods, all of which convert this stuff into 433 01:00:16,460 --> 01:00:29,340 an unconstrained problem that you solve at every iteration. 434 01:00:29,340 --> 01:00:38,960 All right. 435 01:00:38,960 --> 01:00:52,320 Questions about this before I jump into a different topic? 436 01:00:52,320 --> 01:01:12,920 Time to figure out what I want to do next, because I don't have time to do too many... 437 01:01:12,920 --> 01:01:37,600 So let me just give you... 438 01:01:37,600 --> 01:01:40,240 So structured Hessians. 439 01:01:40,240 --> 01:01:45,640 So imagine... 440 01:01:45,640 --> 01:02:13,800 Shoot, there's a scalar here. 441 01:02:13,800 --> 01:02:15,240 I can't remember the scalar here. 442 01:02:15,240 --> 01:02:17,520 But there's some scalar. 443 01:02:17,520 --> 01:02:25,560 So this is basically diagonal. 444 01:02:25,560 --> 01:02:26,840 Where might you see this? 445 01:02:26,840 --> 01:02:30,720 Well, if you're doing some type of linear fitting where you want non-negativity and 446 01:02:30,720 --> 01:02:36,200 you're using a log-barrier term, then essentially you're going to get a very nice structured 447 01:02:36,200 --> 01:02:39,000 Hessian to that one. 448 01:02:39,000 --> 01:02:45,400 If your only non-linearity comes from your log-barrier term. 449 01:02:45,400 --> 01:02:50,000 In which case, again, you can totally use Newton on very large scale problems of this 450 01:02:50,000 --> 01:02:53,480 form, because your Hessian has tons of structure. 451 01:02:53,480 --> 01:02:55,160 It's just a diagonal. 452 01:02:55,160 --> 01:03:01,640 There are no off-diagonal terms in that at all. 453 01:03:01,640 --> 01:03:28,960 Let me do a row here. 454 01:03:28,960 --> 01:03:54,200 So, in a lot of problems I look at, you have something like... 455 01:03:54,200 --> 01:04:22,400 So... 456 01:04:22,400 --> 01:04:28,640 So you end up with something like this. 457 01:04:28,640 --> 01:04:33,840 You've got F of X, which is X transpose LX, where L is a Laplacian matrix, if you know 458 01:04:33,840 --> 01:04:34,840 what that is. 459 01:04:34,840 --> 01:04:38,000 But it's a sparse matrix here with some structure I know. 460 01:04:38,000 --> 01:04:40,120 So it's basically an input. 461 01:04:40,120 --> 01:04:44,220 And I also have some fractional power of X over here. 462 01:04:44,220 --> 01:04:47,320 So think P is like 4, something like this. 463 01:04:47,320 --> 01:05:05,400 In which case, H of X is equal to 2L plus P times P minus 1 times... 464 01:05:05,400 --> 01:05:12,840 Sorry. 465 01:05:12,840 --> 01:05:22,240 Let me write it like this. 466 01:05:22,240 --> 01:05:29,160 So again, there's some type of sparsity structure to your Hessian, combined with some other 467 01:05:29,160 --> 01:05:30,160 type of nonlinearity. 468 01:05:30,160 --> 01:05:34,160 So it's not just constant. 469 01:05:34,160 --> 01:05:47,080 But you have some very well-defined structure. 470 01:05:47,080 --> 01:06:05,000 So another common structure you'll see would be a banded diagonal. 471 01:06:05,000 --> 01:06:25,920 And these come... 472 01:06:25,920 --> 01:06:46,640 So, if I look at all the interactions, and for whatever reason, each variable only interacts 473 01:06:46,640 --> 01:06:58,680 with, say, its second nearest neighbors, then you'd get these linear coupling terms. 474 01:06:58,680 --> 01:07:01,440 That would give you a banded diagonal. 475 01:07:01,440 --> 01:07:05,480 So this shows up a lot in signal processing, where you have some type of implicit filter 476 01:07:05,480 --> 01:07:10,720 structure that has some sequential set of variables. 477 01:07:10,720 --> 01:07:15,600 It also shows up in partial differential equations, if you're doing anything on a one-dimensional 478 01:07:15,600 --> 01:07:21,560 grid, where you just sort of have nearest-neighbor interactions. 479 01:07:21,560 --> 01:07:23,360 Other places where this shows up... 480 01:07:23,360 --> 01:07:24,360 I don't know. 481 01:07:24,360 --> 01:07:29,260 It shows up all over the place. 482 01:07:29,260 --> 01:07:34,680 But that's another sort of common structure you'll see. 483 01:07:34,680 --> 01:07:37,120 What else? 484 01:07:37,120 --> 01:07:41,920 Has anyone else seen structured Hessians? 485 01:07:41,920 --> 01:07:43,980 I guess your Hessian could be low rank. 486 01:07:43,980 --> 01:07:45,160 That would be another type of structure. 487 01:07:45,160 --> 01:07:46,960 You might see... 488 01:07:46,960 --> 01:07:50,920 But those are slightly less common, as usually you want your Hessian to be full rank for 489 01:07:50,920 --> 01:07:55,720 a lot of different reasons. 490 01:07:55,720 --> 01:07:59,680 But again, the point here is not that any of these structures are things you'll see. 491 01:07:59,680 --> 01:08:00,680 It's just... 492 01:08:00,680 --> 01:08:04,040 If you have any of these structures in your Hessian, and you can show that they arise, 493 01:08:04,040 --> 01:08:08,440 you can still do a lot of ideas in large-scale optimization, without resorting to things 494 01:08:08,440 --> 01:08:15,360 like limited memory, quasi-Newton, or these other types of ideas. 495 01:08:15,360 --> 01:08:17,280 All right. 496 01:08:17,280 --> 01:08:18,600 I think that's it for today. 497 01:08:18,600 --> 01:08:22,760 So we'll resume next time talking about conjugate gradient, and how to take advantage of some 498 01:08:22,760 --> 01:08:24,240 of the types... 499 01:08:24,240 --> 01:08:26,480 Some of these types of sparsity. 500 01:08:26,480 --> 01:08:27,720 So I will see you folks on Thursday. 501 01:08:27,720 --> 01:08:37,720 [BLANK_AUDIO]