1
00:00:00,000 --> 00:00:12,120
 All right. Welcome back. We have a sparse group here today. So, we'll definitely have

2
00:00:12,120 --> 00:00:20,460
 a quiz so you guys are all in luck. What are we talking about today? So, large scale optimization.

3
00:00:20,460 --> 00:00:38,000
 There are some handouts. I didn't get a chance to post them before class, but again, there

4
00:00:38,000 --> 00:00:43,560
 are some handouts that go over a lot of the same stuff that we'll see here that you can

5
00:00:43,560 --> 00:00:52,200
 check out after class, too. So, things we're going to see. Why is large scale optimization

6
00:00:52,200 --> 00:00:57,680
 challenging? We'll then talk about this idea called limited memory quasi-Newton, which

7
00:00:57,680 --> 00:01:07,360
 allows you to get something a little bit better. Sorry. It's sort of a hybrid technique that

8
00:01:07,360 --> 00:01:12,000
 gives you a little bit of curvature information. It's not perfect, but it generally does a

9
00:01:12,000 --> 00:01:17,560
 little bit better than just doing pure gradient descent. Other ideas sort of in that vein

10
00:01:17,560 --> 00:01:23,360
 are conjugate gradient. And then we'll go back to cases where you sort of have a little

11
00:01:23,360 --> 00:01:29,720
 bit -- like, you can actually do something with the Hessian. So, sparsity in exact Newton.

12
00:01:29,720 --> 00:01:34,080
 So, I don't think we're going to get to all of this stuff today. So, we'll do whatever

13
00:01:34,080 --> 00:01:44,960
 we don't get to today on Thursday. And, yeah. So, I guess in terms of projects, I should

14
00:01:44,960 --> 00:01:55,880
 do a little bit about projects. So, projects... I forget when they're due now, but they're

15
00:01:55,880 --> 00:02:05,240
 due whenever I said they were due. Two weeks? Okay. Due in two weeks. So, you don't have

16
00:02:05,240 --> 00:02:13,440
 any more interim status reports. So, your last one was last time. Again, in terms of

17
00:02:13,440 --> 00:02:17,440
 things that are important, if you're doing the optimization on the surface of the hypersphere,

18
00:02:17,440 --> 00:02:22,920
 I actually do expect you to try some gradient based optimization. So, if you're not going

19
00:02:22,920 --> 00:02:27,080
 to do some gradient based optimization, you probably want to have a conversation with

20
00:02:27,080 --> 00:02:31,880
 me. And in particular, if you're doing gradient based optimization, what do you always do

21
00:02:31,880 --> 00:02:37,960
 with what you call your gradient? Check it. So, if you don't check it, you're probably

22
00:02:37,960 --> 00:02:44,120
 going to lose some points on your write-up. If you implement your own optimization software,

23
00:02:44,120 --> 00:02:48,680
 I'm not going to review it, but I do want you to give me some justification that it's

24
00:02:48,680 --> 00:02:54,440
 plausibly correct. So, what might that look like? Take some problem where you know the

25
00:02:54,440 --> 00:02:58,200
 answer, make sure you get it. Yeah?

26
00:02:58,200 --> 00:03:07,560
 >> Many of the algorithms for the software are coming from textbooks. Is it okay if we

27
00:03:07,560 --> 00:03:08,560
 basically iterate through those proofs?

28
00:03:08,560 --> 00:03:12,780
 >> So, that is a theoretical justification that what you have done in principle is correct.

29
00:03:12,780 --> 00:03:18,080
 That does not mean your software is correct. So, I'm looking for a software based demonstration

30
00:03:18,080 --> 00:03:23,320
 of this, not a, hey, in theory, these ideas work. Yes, in theory, lots of ideas work,

31
00:03:23,320 --> 00:03:29,040
 but you may have made an error when you translated those ideas onto the computer. A subtle error,

32
00:03:29,040 --> 00:03:33,440
 but it might -- again, how do you know you haven't made a subtle error in translating

33
00:03:33,440 --> 00:03:38,640
 the algorithms onto the computer? So, I would like to see some justification of that as

34
00:03:38,640 --> 00:03:41,080
 well.

35
00:03:41,080 --> 00:03:47,320
 If you're using libraries, again, that is not an excuse to trust the library. Libraries

36
00:03:47,320 --> 00:03:53,480
 have errors just like your software does, too. They're less likely, but they still do

37
00:03:53,480 --> 00:03:56,880
 happen. So, I would expect the same thing there. Yes?

38
00:03:56,880 --> 00:04:01,080
 >> What's the value you use for project 2?

39
00:04:01,080 --> 00:04:13,880
 >> So, I'll get to project 2 in a second. So, again, on project 1, essentially, I want

40
00:04:13,880 --> 00:04:17,680
 to see a write-up that I don't have to read your code to understand that what you've done

41
00:04:17,680 --> 00:04:22,160
 is correct. So, you should definitely justify that in your things, and then hopefully you've

42
00:04:22,160 --> 00:04:28,880
 done something interesting in this gen rule space. So, other questions about project 1,

43
00:04:28,880 --> 00:04:29,880
 the hypersphere? Yes?

44
00:04:29,880 --> 00:04:30,880
 >> Yeah, just a quick question. So, on the grade scope, there's a portion or a place

45
00:04:30,880 --> 00:04:48,720
 to upload the report, as well as upload the code. If we're doing project 1, do we...

46
00:04:48,720 --> 00:04:49,720
 >> You do not need to upload code.

47
00:04:49,720 --> 00:04:50,720
 >> Just a report?

48
00:04:50,720 --> 00:04:51,720
 >> Yeah, just a report.

49
00:04:51,720 --> 00:04:52,720
 >> We're not going to get marked down for not doing anything?

50
00:04:52,720 --> 00:04:53,720
 >> You will not get marked down for not doing anything. I will only... I mean, I'm only

51
00:04:53,720 --> 00:04:57,480
 going to download the codes for project 2. I mean, sorry. I guess if you upload a code

52
00:04:57,480 --> 00:05:03,920
 for project 1, I'm not going to run it. And I probably won't even look at it. So, I mean...

53
00:05:03,920 --> 00:05:12,040
 Although, I wouldn't test that, because I may look at it. But, anyway... I guess I should

54
00:05:12,040 --> 00:05:19,560
 say test that at your own risk. So, if you want to upload something fun there, you know...

55
00:05:19,560 --> 00:05:29,720
 All right. There you go. I'd suggest your time could be better spent in improving your

56
00:05:29,720 --> 00:05:34,920
 writing or running your stuff through some grammar checks or spelling checks or things

57
00:05:34,920 --> 00:05:55,040
 like that. So... All right. So project... Other questions about project 1? So project

58
00:05:55,040 --> 00:06:13,680
 2, I definitely need a Julia file, because what I'm going to do is I'm going to include

59
00:06:13,680 --> 00:06:23,560
 your Julia file. So I'll probably specify a file name convention that I'll want you

60
00:06:23,560 --> 00:06:28,520
 to use. Like, you know, just your group's last name or something like that. Actually,

61
00:06:28,520 --> 00:06:43,280
 it will probably just be project2.jl. Let me just make that clear. So if you need obscure

62
00:06:43,280 --> 00:06:47,520
 packages, you definitely want to check those with me ahead of time. So if you're using

63
00:06:47,520 --> 00:06:52,520
 things that we've used in class, you can sort of assume that I'll have them. But if you're

64
00:06:52,520 --> 00:07:17,440
 using exotic or obscure packages, just let me know ahead of time. So... But in particular,

65
00:07:17,440 --> 00:07:23,240
 you probably shouldn't be using things like jump or other types of solvers. So those ones

66
00:07:23,240 --> 00:07:32,640
 would be... I don't want to say forbidden, because if you have a good reason, I'm open

67
00:07:32,640 --> 00:07:48,860
 to it. But you should have a very good reason for depending on any solver package.

68
00:07:48,860 --> 00:08:03,460
 So I'm going to put a note here, just accept solvers. What tolerance? 10 to the minus 6.

69
00:08:03,460 --> 00:08:08,820
 10 to the minus 4. 10 to the minus 5. I don't know. I haven't decided which one I'm going

70
00:08:08,820 --> 00:08:15,780
 to run yet. But certainly... I would try and make sure my software is well-behaved for

71
00:08:15,780 --> 00:08:21,220
 a variety of tolerances. In the past, people have complained, because I ran with a different

72
00:08:21,220 --> 00:08:24,460
 tolerance than what they thought I was going to run with. And then they said, well, if

73
00:08:24,460 --> 00:08:28,980
 I tweak this constant here, then it works. I'm like... Yeah, but I told you I wasn't

74
00:08:28,980 --> 00:08:34,180
 going to run with a particular tolerance ahead of time. And so if you're overly reliant on

75
00:08:34,180 --> 00:08:39,500
 your tolerance for how you've sort of tweaked your algorithm, that is the kind of thing

76
00:08:39,500 --> 00:08:45,980
 you tend to lose points for on this one. So best advice is try a bunch of tolerances,

77
00:08:45,980 --> 00:08:52,620
 and make sure your algorithms behave nicely and/or fail gracefully for lots of them. Yes?

78
00:08:52,620 --> 00:09:16,380
 Yes, you can do smarter linear algebra. I should be clear. Linear systems. Let me put

79
00:09:16,380 --> 00:09:27,180
 it that way. But again, if you're using some obscure package or things that we haven't

80
00:09:27,180 --> 00:09:32,260
 used in class, you might just want to check those with me. I haven't seen anyone in their

81
00:09:32,260 --> 00:09:39,740
 projects reporting on that yet, or anything that sort of alarmed me. But if I do see it,

82
00:09:39,740 --> 00:09:50,500
 I will let you know. I mean, I also might check stuff like 10 to the minus 2, 10 to

83
00:09:50,500 --> 00:10:01,620
 the minus 3, to see if you're sort of solving them appropriately for the tolerance as well.

84
00:10:01,620 --> 00:10:16,980
 So you look skeptical about 10 to the minus 6 as a tolerance. Of course the performance

85
00:10:16,980 --> 00:10:32,740
 is going to vary between the tolerances. So presumably if I'm going to specify a higher

86
00:10:32,740 --> 00:10:39,380
 tolerance, I'll specify a higher maximum number of iterations, too. But again, if your solver

87
00:10:39,380 --> 00:10:46,020
 runs out of iterations, you should tell me that, right? So for instance, if I call your

88
00:10:46,020 --> 00:10:51,420
 solver with a tolerance of 10 to the minus 6 and 5 iterations, you should tell me what

89
00:10:51,420 --> 00:11:00,420
 in your output. Maximum number of iterations reached. So that then I know that your solver

90
00:11:00,420 --> 00:11:06,780
 knows that it needs more iterations, rather than... I mean, if it declares success, I'm

91
00:11:06,780 --> 00:11:11,180
 going to assume that it's correctly solved the problem and evaluate it. But if it declares

92
00:11:11,180 --> 00:11:17,900
 that I need more iterations, then that wouldn't be fair of me to demand that it solves it.

93
00:11:17,900 --> 00:11:33,180
 Or sorry. To evaluate that. I do both manual inspections and a test harness.

94
00:11:33,180 --> 00:11:37,580
 But again, the test harness logic is exactly what you might expect. Run it with something

95
00:11:37,580 --> 00:11:42,620
 like this. Check. Does it report I need more iterations? Or throw an error? Or something

96
00:11:42,620 --> 00:12:01,900
 like that. And then if so, go back and increase the number of iterations.

97
00:12:01,900 --> 00:12:07,740
 So again, you've got to have a file called project2.jl. I know someone was talking about

98
00:12:07,740 --> 00:12:12,340
 doing this in Python. So if you're still thinking about that, please do come up and chat with

99
00:12:12,340 --> 00:12:18,620
 me just to figure out if you're still doing that or not.

100
00:12:18,620 --> 00:12:28,460
 Other questions about project2?

101
00:12:28,460 --> 00:12:31,980
 Anything you need in order to run it. Like, you've got the interface that I'm gonna call

102
00:12:31,980 --> 00:12:38,180
 your software on.

103
00:12:38,180 --> 00:12:45,500
 You don't have to handle anything for loading data. You can assume I will give you IPLP

104
00:12:45,500 --> 00:12:46,500
 problems.

105
00:12:46,500 --> 00:13:08,500
 Well, you probably need the IPLP struct definition, too.

106
00:13:08,500 --> 00:13:12,540
 I would put that into the report, not in what you submit. Because if you submit something

107
00:13:12,540 --> 00:13:17,300
 and then it causes a problem on my end, you lose points for it.

108
00:13:17,300 --> 00:13:23,700
 So I'd make sure that you can sort of load this. If someone posts a note on edstem, I'll

109
00:13:23,700 --> 00:13:41,360
 try and give you a test case, test harness that you can use.

110
00:13:41,360 --> 00:13:45,260
 You do not have to do integer programming.

111
00:13:45,260 --> 00:14:13,100
 All right. Should we go on to large-scale optimization?

112
00:14:13,100 --> 00:14:28,100
 So, what is large-scale optimization? I don't know. I think...

113
00:14:28,100 --> 00:14:33,900
 Certainly I think we'd all agree that if you've got a million variables and you're trying

114
00:14:33,900 --> 00:14:48,900
 to solve an optimization with a million variables, that's large-scale.

115
00:14:48,900 --> 00:14:53,120
 So a million is definitely large-scale. Is 10,000 variables large-scale?

116
00:14:53,120 --> 00:14:57,700
 Well, I would argue that 10,000 is sort of an intermediate point.

117
00:14:57,700 --> 00:15:00,240
 And I'll be a little bit more precise about this.

118
00:15:00,240 --> 00:15:07,320
 But what I really mean by large... My mental concept of large-scale optimization is something

119
00:15:07,320 --> 00:15:12,500
 where if you tried to use a quasi-Newton method, just a vanilla quasi-Newton method, it wouldn't

120
00:15:12,500 --> 00:15:17,720
 work because you're out of memory or it's too expensive.

121
00:15:17,720 --> 00:15:26,900
 So can you do a quasi-Newton method with 10,000 variables? 10,000, yes. 25,000? Probably.

122
00:15:26,900 --> 00:15:33,940
 50,000? Yeah, but it's getting expensive. 100,000? It's getting real expensive at 100,000,

123
00:15:33,940 --> 00:15:37,100
 although you can still do it.

124
00:15:37,100 --> 00:15:41,540
 And sort of somewhere... But you can sort of see that there's a gradient here, where

125
00:15:41,540 --> 00:15:47,900
 it feels like there's some sort of transition point around 10,000, where things get more

126
00:15:47,900 --> 00:15:54,580
 interesting and maybe you want to do something else.

127
00:15:54,580 --> 00:16:15,460
 So example... So this was one of my PhD students. Old PhD student. Now at Texas A&M. So he's

128
00:16:15,460 --> 00:16:31,300
 now at Texas A&M. So he was looking at a problem with 3 trillion constraints and 161 million

129
00:16:31,300 --> 00:16:49,740
 variables. And so this took days and weeks to solve. So with 3 trillion

130
00:16:49,740 --> 00:16:55,740
 constraints, you can't even really form all the constraints. And so you end up with some

131
00:16:55,740 --> 00:17:03,740
 interesting stuff there. And again, this was just one problem out of many we would like

132
00:17:03,740 --> 00:17:10,200
 to solve. But certainly, we're going to need some different strategies for something like

133
00:17:10,200 --> 00:17:16,640
 that. Yes?

134
00:17:16,640 --> 00:17:24,740
 So in this case, it was a particular method for clustering data. So every data point had

135
00:17:24,740 --> 00:17:35,900
 basically n choose 3 associated constraints and n choose 2 associated variables.

136
00:17:35,900 --> 00:17:42,260
 So do we need to do this to cluster data? No. We were sort of looking at some more interesting

137
00:17:42,260 --> 00:17:49,620
 slash exotic slash really principled algorithms in this space that are then much, much, much

138
00:17:49,620 --> 00:17:56,980
 more expensive to solve than a lot of other ones were. If you want a slightly more refined

139
00:17:56,980 --> 00:18:06,240
 answer, we were trying to get lower bounds on performance of some clustering algorithms.

140
00:18:06,240 --> 00:18:11,240
 But again, I think we all know of optimization problems with trillions of constraints. Sorry,

141
00:18:11,240 --> 00:18:18,280
 with trillions of variables. So has anyone heard of something called chatGPT? So chatGPT,

142
00:18:18,280 --> 00:18:22,400
 when they're constructing chatGPT, what they're doing is they're really optimizing a function

143
00:18:22,400 --> 00:18:27,680
 with the number of weights or parameters as the number of constraints. And although they

144
00:18:27,680 --> 00:18:31,280
 haven't disclosed it, it's reported to be in the multiple trillions.

145
00:18:31,280 --> 00:18:38,080
 Can I ask a question? Yeah.

146
00:18:38,080 --> 00:18:56,120
 [INAUDIBLE] Sorry. So you've got a machine learning model

147
00:18:56,120 --> 00:19:03,760
 to approximate the constraints? Yeah. Based on the given data.

148
00:19:03,760 --> 00:19:10,680
 So in general, I think just throwing a machine learning problem at it is bad science. On

149
00:19:10,680 --> 00:19:18,880
 the other hand, it's very good practice in the sense that it's a useful tool. But you

150
00:19:18,880 --> 00:19:23,800
 don't really learn anything about it. And if it works-- if it fails, you don't-- you've

151
00:19:23,800 --> 00:19:30,400
 got no understanding of why. So I think it's one of those things where you throw it at

152
00:19:30,400 --> 00:19:34,400
 it, see if it works. And if it works, you say, all right, great. Then maybe you think

153
00:19:34,400 --> 00:19:39,280
 about it a little bit harder and think, is there a reason that this works?

154
00:19:39,280 --> 00:19:47,600
 But as far as do I think it's a good idea or not, it is a tool in a tool chest. So tools

155
00:19:47,600 --> 00:19:53,040
 are always good ideas. Does this mean that this idea is particularly good? No, I don't

156
00:19:53,040 --> 00:19:58,640
 think there's anything special about ML algorithms. So if you've got a tough optimization problem

157
00:19:58,640 --> 00:20:04,240
 with lots of constraints, you need some way of handling them. And if there's a particular

158
00:20:04,240 --> 00:20:11,520
 reason that they work well for your application, great. But a priori, I don't see any sort

159
00:20:11,520 --> 00:20:15,600
 of special reason why they would work sort of fundamentally better than other techniques.

160
00:20:15,600 --> 00:20:31,920
 [INAUDIBLE]

161
00:20:31,920 --> 00:20:40,640
 So I guess we'll have to chat a little bit later. Because again, chat GPT is like-- the

162
00:20:40,640 --> 00:20:48,880
 states here are constructed to build an algorithm. And so it gets a little bit tricky to talk

163
00:20:48,880 --> 00:20:54,560
 about some of these ideas in a way that's coherent. Certainly, you can optimize over

164
00:20:54,560 --> 00:20:58,400
 the space of algorithms, right? You parameterize your algorithm by some features, and then

165
00:20:58,400 --> 00:21:03,120
 you optimize over it. So you need a parametric class of algorithms, and then you can do things

166
00:21:03,120 --> 00:21:12,000
 like this.

167
00:21:12,000 --> 00:21:41,360
 So why is it hard? So there's one reason it could be hard, which is my function takes

168
00:21:41,360 --> 00:21:47,640
 a really long time to evaluate. So again, if you're actually evaluating a function with

169
00:21:47,640 --> 00:21:54,560
 trillions of variables, it might sort of fundamentally take hours, days, to potentially just evaluate

170
00:21:54,560 --> 00:22:01,440
 a single function call and get a gradient. We are not going to look at methods that take

171
00:22:01,440 --> 00:22:16,160
 this particular scenario. So for these things, you want surrogate optimization, which is

172
00:22:16,160 --> 00:22:35,360
 a different--

173
00:22:35,360 --> 00:22:38,600
 There's this other idea called sequential quadratic programming, which also shows up

174
00:22:38,600 --> 00:22:42,880
 a lot, where you essentially build a quadratic surrogate to your function, and then spend

175
00:22:42,880 --> 00:22:47,200
 all of your time optimizing that. And then every so often, you'll update your quadratic

176
00:22:47,200 --> 00:22:53,360
 surrogate with calls to your function.

177
00:22:53,360 --> 00:23:14,080
 But we are going to assume...

178
00:23:14,080 --> 00:23:42,480
 So we're going to assume that you can evaluate your function or gradient, say, 1,000 or 10,000

179
00:23:42,480 --> 00:23:50,600
 times within whatever time budget you want to run your optimization problem. So if you're

180
00:23:50,600 --> 00:23:55,480
 willing to wait a week, then maybe you could go 10,000 or 100,000 times. If you're willing

181
00:23:55,480 --> 00:24:06,180
 to wait a day, then maybe 1,000 to 10,000 times. So these things aren't so expensive

182
00:24:06,180 --> 00:24:11,160
 that it takes a day just to evaluate your function a single time.

183
00:24:11,160 --> 00:24:15,800
 Does that make sense with everyone? Because this is sort of a difference, where in small

184
00:24:15,800 --> 00:24:20,480
 scale optimization, you generally assume that your functions are not super hard to evaluate,

185
00:24:20,480 --> 00:24:25,080
 although they may be. But if they are, then you're not going to evaluate them that many

186
00:24:25,080 --> 00:24:30,920
 times. But in large scale stuff, you do get this sort of break in how you might think

187
00:24:30,920 --> 00:24:35,280
 about that.

188
00:24:35,280 --> 00:24:54,080
 So if you want to formally... I would want them to be little o of n squared. That is,

189
00:24:54,080 --> 00:25:15,160
 they take less time to evaluate than the squared number of variables. So eg... So if I've got

190
00:25:15,160 --> 00:25:22,200
 n variables, then your function might take order, say, n log n to evaluate, or n to the

191
00:25:22,200 --> 00:25:41,240
 3/2, or something along those lines. It's just not order n squared.

192
00:25:41,240 --> 00:26:08,800
 So I'm going to do a little bit of math here. So I'm going to take n log n to the 3/2, and

193
00:26:08,800 --> 00:26:19,280
 then I'm going to... So the implications of this is we can run line search. You have to

194
00:26:19,280 --> 00:26:23,080
 be able to store all of your variables. So I guess I didn't say that, but I am assuming

195
00:26:23,080 --> 00:26:28,520
 you can store all of your variables. Hopefully everyone's okay with that.

196
00:26:28,520 --> 00:26:34,840
 Again, in the chat GPT case, where you honest to gosh have trillions of variables, or in

197
00:26:34,840 --> 00:26:41,080
 the 3 trillion constraint case, this does sort of become an issue. But for keeping our

198
00:26:41,080 --> 00:27:00,840
 discussion compact, we are going to assume that we can store the variables.

199
00:27:00,840 --> 00:27:04,400
 So the implications of these two is we can run line search, and if we can run line search,

200
00:27:04,400 --> 00:27:10,000
 then we can run gradient descent. So we can do optimization in this scenario. There's

201
00:27:10,000 --> 00:27:16,360
 nothing fundamentally that's broken about this. Typically, when I teach this stuff,

202
00:27:16,360 --> 00:27:21,640
 I talk about an algorithm called conjugate gradient before this, but I decided to switch

203
00:27:21,640 --> 00:27:34,680
 it up this year and talk about limited memory quasi-Newton first.

204
00:27:34,680 --> 00:27:59,320
 So we can also run conjugate gradient, which I think we're going to see next lecture. But

205
00:27:59,320 --> 00:28:15,100
 let's do it. So could we do something like Newton with 100,000 dimensional problems?

206
00:28:15,100 --> 00:28:23,680
 So f takes as input 100,000 variables and outputs a single scalar. And this is going

207
00:28:23,680 --> 00:28:27,800
 to be your quiz. I'd like you to think about this for a second and give one of the following

208
00:28:27,800 --> 00:28:55,200
 two answers. Yes, no, maybe. And also why.

209
00:28:55,200 --> 00:29:09,680
 All right. So I am going to shuffle these up a little bit and then go through them and

210
00:29:09,680 --> 00:29:36,120
 report out what the class thinks about this. So yes, but very expensive. Yes.

211
00:29:36,120 --> 00:30:02,160
 Yes. Maybe. Another. I've seen two very expensive. No. I like the definitiveness.

212
00:30:02,160 --> 00:30:31,980
 Sorry. I guess this was 10 to the 15 is too big. Yes. Another yes. Another 10 to the 15

213
00:30:31,980 --> 00:30:56,120
 is too complicated. Another yes. Maybe.

214
00:30:56,120 --> 00:31:25,560
 I'm going to call this one a maybe. No. Sorry. No.

215
00:31:25,560 --> 00:31:47,700
 Another yes, but very expensive. No. All right. Another no. Another maybe. I think we're going

216
00:31:47,700 --> 00:31:57,740
 to end up tied, a three-way tie amongst the class here in a second. No or maybe. OK. I'm

217
00:31:57,740 --> 00:32:05,820
 going to go with maybe here. I think that's-- all right. And at the end, no's come in to

218
00:32:05,820 --> 00:32:13,620
 too big to handle. All right. So we have almost a perfect tie amongst everyone in this class.

219
00:32:13,620 --> 00:32:18,640
 So among people who said it was possible, they said, look, there's nothing that goes

220
00:32:18,640 --> 00:32:27,340
 wrong when you try and do this. And 10 to the 15 is big. But so what do we need to do?

221
00:32:27,340 --> 00:32:41,500
 Newton said Newton. 10 to the 5 to the third. So you need to do O of n cubed. So Newton

222
00:32:41,500 --> 00:33:04,900
 needs O of n cubed work per iteration. So just to store the hashing, you need about

223
00:33:04,900 --> 00:33:13,980
 80 gigabytes of memory. All right. So 80 gigabytes is a lot of memory. Who has access to a computer

224
00:33:13,980 --> 00:33:23,900
 with more than 80 gigabytes of memory? I know a few of you do. So it's not like an impossible

225
00:33:23,900 --> 00:33:28,020
 amount of memory. On the other hand, every step you're going to have to solve a linear

226
00:33:28,020 --> 00:33:35,420
 system of equations with this, which 10 to the 15 work. Again, that's not impossible.

227
00:33:35,420 --> 00:33:40,040
 So I would actually say the answer to this one is probably maybe. It depends on how quickly

228
00:33:40,040 --> 00:33:46,480
 you need to do it. The point is it's going to be really expensive. So I'm actually sort

229
00:33:46,480 --> 00:33:55,400
 of pleased with this distribution of answers in the sense that I think the no's and maybe's

230
00:33:55,400 --> 00:34:02,180
 probably have it over the yeses. But on the other hand, if you have access to sufficient

231
00:34:02,180 --> 00:34:13,740
 compute resources, you can do this. Now, if n is equal to 10 to the 6, can I get a quick

232
00:34:13,740 --> 00:34:19,760
 show of hands who thinks it's yes, no, and maybe? So can I see the yeses?

233
00:34:19,760 --> 00:34:31,440
 Okay. Six of you said yes for 10 to the 5th. You're all going to say no for 10 to the 6th?

234
00:34:31,440 --> 00:34:43,640
 You're going to say yes? What's that? We've got two yeses.

235
00:34:43,640 --> 00:34:51,520
 That still won't store your Hessian. That's 8 terabytes of RAM. But on the other hand,

236
00:34:51,520 --> 00:34:57,780
 you're right. You can rent a computer on Amazon with 32 terabytes of RAM. So we can run it

237
00:34:57,780 --> 00:35:07,980
 on that one, which I think costs $200 an hour to rent or something like that.

238
00:35:07,980 --> 00:35:17,620
 Per iteration. He's got his own research funding, by the way. No, I'm teasing. No. Who thinks

239
00:35:17,620 --> 00:35:34,280
 no for this one? Okay. Like 10 and maybe? A few.

240
00:35:34,280 --> 00:35:39,080
 So again, I think these are all consistent answers. I haven't said anything about the

241
00:35:39,080 --> 00:35:44,560
 type of problem. So if your Hessian has any type of structure in it, I think all of you

242
00:35:44,560 --> 00:35:47,980
 would start to say, like, oh, you're all going to switch over to the yes column if I tell

243
00:35:47,980 --> 00:35:54,800
 you your Hessian's diagonal, right? Yes, you should all switch over to the yes column if

244
00:35:54,800 --> 00:36:00,080
 your Hessian happens to be diagonal, in which case then doing Newton is not really any more

245
00:36:00,080 --> 00:36:04,400
 work or expense than gradient descent. Now, you have to know that your Hessian's diagonal

246
00:36:04,400 --> 00:36:09,360
 up front in order to avoid computing all the irrelevant stuff. But that would be like a

247
00:36:09,360 --> 00:36:11,800
 super easy Hessian to work with. Yes?

248
00:36:11,800 --> 00:36:21,240
 Didn't you say f was the [INAUDIBLE] so why would-- like, the Hessian would be that square,

249
00:36:21,240 --> 00:36:22,240
 right?

250
00:36:22,240 --> 00:36:27,040
 The Hessian would be that square. But I never said-- like, your Hessian might have structure,

251
00:36:27,040 --> 00:36:34,800
 right? So like, it might not. But if you're solving something, then maybe you're solving

252
00:36:34,800 --> 00:36:41,800
 a problem where you know your Hessian has tridiagonal structure or something like that.

253
00:36:41,800 --> 00:36:46,720
 In particular, I solve a lot of problems where your Hessian is graph-structured. So it's

254
00:36:46,720 --> 00:36:51,440
 a sparse matrix. So if I tell you you've got to solve a million by a million sparse matrix

255
00:36:51,440 --> 00:36:57,560
 at every turn, well, that still becomes expensive, but doesn't become impossibly problematic

256
00:36:57,560 --> 00:37:01,920
 expensive. So that's why, like, I think sort of shifting from the no column over to the

257
00:37:01,920 --> 00:37:07,520
 maybe column. But again, all of this just becomes harder as the problems get bigger

258
00:37:07,520 --> 00:37:11,240
 and bigger and bigger. So. Yes?

259
00:37:11,240 --> 00:37:26,240
 Excuse me, question, but why is space complexity is cubed, not squared?

260
00:37:26,240 --> 00:37:27,240
 Space complexity is squared. Time complexity is cubed.

261
00:37:27,240 --> 00:37:39,680
 Oh, sorry. This is-- I miswrote this. Space. Oh, no, no, no. That was end of the work.

262
00:37:39,680 --> 00:37:47,180
 Okay.

263
00:37:47,180 --> 00:38:16,020
 So

264
00:38:16,020 --> 00:38:28,580
 we'll talk about structure in a little bit.

265
00:38:28,580 --> 00:38:34,420
 Do quasi-Newton methods help?

266
00:38:34,420 --> 00:38:38,940
 So quasi-Newton methods allowed us to get super linear convergence rates, which is sort

267
00:38:38,940 --> 00:38:39,940
 of what we want.

268
00:38:39,940 --> 00:38:47,880
 Really fast convergence.

269
00:38:47,880 --> 00:38:50,500
 They don't need the Hessian explicitly, right?

270
00:38:50,500 --> 00:39:04,420
 They were working with some type of Hessian surrogate or Hessian approximate.

271
00:39:04,420 --> 00:39:12,980
 So I would say no.

272
00:39:12,980 --> 00:39:15,160
 Why would I probably say no?

273
00:39:15,160 --> 00:39:19,080
 If I tell you my answer is no, what do you think is the best reason for why quasi-Newton

274
00:39:19,080 --> 00:39:23,460
 methods aren't going to help this?

275
00:39:23,460 --> 00:39:32,540
 What's that?

276
00:39:32,540 --> 00:39:46,460
 So there's no history in quasi-Newton methods.

277
00:39:46,460 --> 00:39:52,580
 So I think of memory as the bigger constraint than time, in that if you are pressed to,

278
00:39:52,580 --> 00:39:57,500
 you can always let your computer run longer, but you can't get more memory.

279
00:39:57,500 --> 00:40:02,060
 And so quasi-Newton methods still take order n squared memory.

280
00:40:02,060 --> 00:40:18,500
 So they're going to take 8 terabytes of RAM for that million by million problem.

281
00:40:18,500 --> 00:40:22,860
 And in fact, I would argue it's even worse in that even if your Hessian does have some

282
00:40:22,860 --> 00:40:26,840
 structure, quasi-Newton methods won't capture that at all.

283
00:40:26,840 --> 00:40:39,500
 They're going to give you a fully dense approximate matrix.

284
00:40:39,500 --> 00:40:40,500
 All right.

285
00:40:40,500 --> 00:40:49,820
 But is all hope lost?

286
00:40:49,820 --> 00:41:11,360
 Let's study BFGS for a second.

287
00:41:11,360 --> 00:41:18,800
 So remember, BFGS maintains an approximation of your inverse Hessian.

288
00:41:18,800 --> 00:41:39,240
 So the way I like to think of this is I minus rho.

289
00:41:39,240 --> 00:41:43,560
 So just because I get tired of writing these things, I'm going to call this matrix L, and

290
00:41:43,560 --> 00:41:48,640
 I'm going to call this matrix R. In some other notes, we're going to use VK

291
00:41:48,640 --> 00:41:50,040
 for those matrices.

292
00:41:50,040 --> 00:41:57,400
 So I'm just trying to keep things simple here as I explain a concept.

293
00:41:57,400 --> 00:42:16,680
 So what do we need to do?

294
00:42:16,680 --> 00:42:20,960
 So what we need to do with TK+1 is we don't actually need TK+1.

295
00:42:20,960 --> 00:42:27,320
 What we need to do with TK+1 is we just need to multiply it against the negative gradient.

296
00:42:27,320 --> 00:42:31,560
 So the idea of numerical linear algebra and lots of other things is if a problem has structure,

297
00:42:31,560 --> 00:42:33,640
 we should try and exploit that structure.

298
00:42:33,640 --> 00:42:36,280
 So we do have a little bit of structure here.

299
00:42:36,280 --> 00:42:41,900
 Suppose I gave you a routine to multiply TK by a vector.

300
00:42:41,900 --> 00:42:49,000
 I claim then we could compute TK+1 times a vector in light of that routine for TK times

301
00:42:49,000 --> 00:42:50,480
 a vector.

302
00:42:50,480 --> 00:42:51,960
 How would you do it?

303
00:42:51,960 --> 00:43:05,280
 Well, you'd say PK is equal to TK+1 times minus G is equal to L times TK times R plus

304
00:43:05,280 --> 00:43:21,380
 rho SS transpose -- let me write it this way.

305
00:43:21,380 --> 00:43:35,400
 So here and here, this is just a scalar, and R times -- let me call this a vector Z.

306
00:43:35,400 --> 00:44:02,080
 So

307
00:44:02,080 --> 00:44:09,140
 hopefully I didn't screw up any constants while I was doing this one.

308
00:44:09,140 --> 00:44:10,580
 So what do I have to do here?

309
00:44:10,580 --> 00:44:15,180
 So this is an inner product.

310
00:44:15,180 --> 00:44:20,320
 This is an inner product.

311
00:44:20,320 --> 00:44:27,380
 And then I just need to form a linear combination of G and this inner product times Y.

312
00:44:27,380 --> 00:44:31,180
 So does everyone agree I can compute the vector Z?

313
00:44:31,180 --> 00:44:50,700
 Like I've got access to the vectors Y and S and the quantity rho.

314
00:44:50,700 --> 00:44:52,920
 I see a lot of people give me blank stares.

315
00:44:52,920 --> 00:44:55,060
 What are your questions about computing the quantity Z?

316
00:44:55,060 --> 00:45:00,380
 If I gave you access to Y, S, and rho on the computer and gave you access to the vector

317
00:45:00,380 --> 00:45:04,300
 minus G, I claim we can compute Z.

318
00:45:04,300 --> 00:45:14,180
 You just literally plug these quantities into Julia, and it'll happily do it.

319
00:45:14,180 --> 00:45:20,940
 I'm seeing a few people nod along.

320
00:45:20,940 --> 00:45:23,740
 If you've got questions about this, ask, because it's about to get more complicated.

321
00:45:23,740 --> 00:45:29,420
 So it's worth appreciating what's going on here first.

322
00:45:29,420 --> 00:45:54,540
 So then we inductively-- I've said that we've got some way of doing TK times Z.

323
00:45:54,540 --> 00:46:13,060
 Let's call this one-- what's a good variable?

324
00:46:13,060 --> 00:46:17,580
 Man, I'm running out of variables here.

325
00:46:17,580 --> 00:46:22,700
 So I've used Y, S, T.

326
00:46:22,700 --> 00:46:25,060
 Let me go with D.

327
00:46:25,060 --> 00:46:53,700
 I'm going to call this one D. I'm going to call this one D. I'm going to call this one

328
00:46:53,700 --> 00:47:01,340
 Z.

329
00:47:01,340 --> 00:47:10,060
 So then this entire thing just becomes L times D, which we do just like Z, plus beta, which

330
00:47:10,060 --> 00:47:27,380
 is this quantity over here, times rho times S.

331
00:47:27,380 --> 00:47:47,900
 So the idea is-- what is going on?

332
00:47:47,900 --> 00:48:10,460
 We all on the same page?

333
00:48:10,460 --> 00:48:15,180
 Essentially, if I just store the change-- so I think you were getting at this point

334
00:48:15,180 --> 00:48:16,180
 with history.

335
00:48:16,180 --> 00:48:23,560
 Some of you have looked at limited memory quasi-Newton methods already in project one.

336
00:48:23,560 --> 00:48:29,640
 But here we're just going through them for the entire class.

337
00:48:29,640 --> 00:48:34,680
 So the idea is, if I just store the change, I can just sort of update what I've seen before.

338
00:48:34,680 --> 00:48:37,820
 And so the idea is, I'm just going to do this.

339
00:48:37,820 --> 00:48:39,540
 So let's see what happens.

340
00:48:39,540 --> 00:49:08,500
 So let Vk is equal to-- oops, these are all k's.

341
00:49:08,500 --> 00:49:10,820
 Oops.

342
00:49:10,820 --> 00:49:12,860
 Sorry.

343
00:49:12,860 --> 00:49:42,580
 So if T(k+1) is equal to that expression, then T(k)--

344
00:49:42,580 --> 00:49:48,500
 T(k) is equal to that expression just with k switched to k minus 1.

345
00:49:48,500 --> 00:49:54,140
 This is true at every step of the process.

346
00:49:54,140 --> 00:50:01,000
 And so if I just substitute this in, or I can do it one more step, T(k-1) is equal to

347
00:50:01,000 --> 00:50:21,640
 2 transpose T(k-2) V(k-2) plus rho(k-2) S(k-2) S(k-2) transpose.

348
00:50:21,640 --> 00:50:29,520
 And so I get T(k) is equal to-- if I just substitute all of these together.

349
00:50:29,520 --> 00:50:37,080
 V(k-1) transpose.

350
00:50:37,080 --> 00:50:41,760
 So I just take this guy and plug him in here, and then this guy and plug him in here, except

351
00:50:41,760 --> 00:50:45,600
 I'm just going to unroll it the other way.

352
00:50:45,600 --> 00:51:06,780
 So V(k-1) times V(k-2) transpose dot dot dot V(0) transpose T(0) V(0) V(k-1) plus rho(k-1)

353
00:51:06,780 --> 00:51:24,140
 S(k-1) S(k-1) transpose plus rho(k-2) times S-- sorry, times V(k)-- sorry, this is V(k-1).

354
00:51:24,140 --> 00:51:39,380
 This is where you need to be a little careful.

355
00:51:39,380 --> 00:51:46,300
 So essentially this is what I get from unrolling all of these, rho dot dot dot rho(1) times

356
00:51:46,300 --> 00:51:56,160
 V(k-1) transpose dot dot dot V(k)-- or sorry, V(1).

357
00:51:56,160 --> 00:51:59,100
 So you've got to be careful to get your indices right, and this is where having access to

358
00:51:59,100 --> 00:52:05,940
 a computer really helps, because you can just do this all on paper and pencil and then plug

359
00:52:05,940 --> 00:52:11,300
 it into something like Julia or Python or MATLAB and just double check that you're not

360
00:52:11,300 --> 00:52:18,820
 off by an index somewhere.

361
00:52:18,820 --> 00:52:24,700
 So I can't remember if that last one's right, but it's-- like, I certainly assume you guys

362
00:52:24,700 --> 00:52:50,860
 get the idea.

363
00:52:50,860 --> 00:52:56,220
 And so there's an algorithm to do this, which-- I mean, I guess I can write out in the last

364
00:52:56,220 --> 00:52:57,780
 few minutes of class.

365
00:52:57,780 --> 00:53:05,420
 But essentially what we do is we use this implicit representation, and the only thing

366
00:53:05,420 --> 00:53:10,940
 that's not specified here is T0.

367
00:53:10,940 --> 00:53:16,260
 So it turns out when you're using limited memory quasi-Newton methods-- so I guess I

368
00:53:16,260 --> 00:53:20,980
 should say this is usually called a limited memory quasi-Newton method.

369
00:53:20,980 --> 00:53:21,980
 And so what's the idea?

370
00:53:21,980 --> 00:53:27,300
 Are we going to store-- so when I say limited memory, what we're going to do is we're only

371
00:53:27,300 --> 00:53:30,340
 going to store a finite number of these updates.

372
00:53:30,340 --> 00:53:31,860
 So you might store 50 of them.

373
00:53:31,860 --> 00:53:33,100
 You might store 100 of them.

374
00:53:33,100 --> 00:53:35,620
 You could store 1,000 of them.

375
00:53:35,620 --> 00:53:39,820
 The point is you're not going to store capital-- like, little n of them, because if you stored

376
00:53:39,820 --> 00:53:46,740
 little n of them, then you'd be storing more than you would have stored the Hessian.

377
00:53:46,740 --> 00:53:51,260
 And then when you want to work with them, you can compute Hessian-- or sorry, inverse

378
00:53:51,260 --> 00:54:15,100
 Hessian times vector products, using an algorithm that I can sketch out, so-- so alg.

379
00:54:15,100 --> 00:54:38,340
 So it is really important how you scale that initial diagonal approximation.

380
00:54:38,340 --> 00:54:45,300
 So usually, you use some scaled multiple of the identity.

381
00:54:45,300 --> 00:54:46,300
 And then for--

382
00:54:46,300 --> 00:55:13,900
 [INAUDIBLE]

383
00:55:13,900 --> 00:55:32,340
 Scaling is important.

384
00:55:32,340 --> 00:55:45,100
 So here is a common scaling you will see.

385
00:55:45,100 --> 00:55:49,040
 So what this is trying to do is it's trying to estimate the magnitude of the Hessian based

386
00:55:49,040 --> 00:55:54,880
 on your most recent iteration.

387
00:55:54,880 --> 00:55:59,380
 And so it's attempting to roughly scale everything.

388
00:55:59,380 --> 00:56:02,300
 And so you could probably also use some ideas from momentum here.

389
00:56:02,300 --> 00:56:05,300
 I imagine someone's worked out a few of those things.

390
00:56:05,300 --> 00:56:09,020
 They're trying to get at this same type of information, just with different types of

391
00:56:09,020 --> 00:56:16,540
 techniques.

392
00:56:16,540 --> 00:56:42,540
 And then once-- OK.

393
00:56:42,540 --> 00:56:57,820
 So you just keep the k most recent things.

394
00:56:57,820 --> 00:57:00,500
 And usually, people will say something like in a circular buffer.

395
00:57:00,500 --> 00:57:02,780
 So you just have k slots allocated.

396
00:57:02,780 --> 00:57:20,320
 And as you keep updating them, you just overwrite the last one in your buffer.

397
00:57:20,320 --> 00:57:27,900
 So there's an algorithm to actually compute this t of k times the vector in the nodes.

398
00:57:27,900 --> 00:57:33,340
 It's not particularly interesting in the sense that all we do is we just take advantage of

399
00:57:33,340 --> 00:57:37,620
 the structure and compute things sort of like a step at a time.

400
00:57:37,620 --> 00:57:41,340
 So conceptually, it runs down through all of the k's.

401
00:57:41,340 --> 00:57:46,800
 So from k plus 1 all the way down to k 0 to compute some things.

402
00:57:46,800 --> 00:57:53,100
 And then it runs upward to compute the rest.

403
00:57:53,100 --> 00:57:57,580
 And so it's the kind of algorithm where I imagine anyone in this class could solve it

404
00:57:57,580 --> 00:58:04,380
 if you were given like a week of time to do it or something maybe even more like a day.

405
00:58:04,380 --> 00:58:16,140
 So it's not a-- it's nothing really super sophisticated or deep.

406
00:58:16,140 --> 00:58:21,160
 So that's the sort of high level pitch about limited memory quasi-Newton methods.

407
00:58:21,160 --> 00:58:27,460
 So I don't actually think anyone's ever proved the convergence rate of this is any faster

408
00:58:27,460 --> 00:58:28,860
 than linear.

409
00:58:28,860 --> 00:58:32,180
 But empirically, they do go quite a bit faster than linear.

410
00:58:32,180 --> 00:58:40,420
 So you do really benefit from having this Hessian information on a lot of problems.

411
00:58:40,420 --> 00:58:42,700
 But again, I don't think there's ever been a formal proof of that.

412
00:58:42,700 --> 00:58:48,020
 I think it was-- at least as of a few years ago, it was still unsolved.

413
00:58:48,020 --> 00:58:53,780
 So I should double check if it's still unsolved or not.

414
00:58:53,780 --> 00:58:56,300
 What are people's questions about these methods?

415
00:58:56,300 --> 00:59:01,740
 I don't have any good advice on how to pick the amount of updates you store, k.

416
00:59:01,740 --> 00:59:04,380
 In general, you don't want it too big.

417
00:59:04,380 --> 00:59:07,020
 And you also don't want it too small.

418
00:59:07,020 --> 00:59:13,700
 So like 50 to 100 tends to be where people often set them.

419
00:59:13,700 --> 00:59:17,820
 The reason you don't want it too big is, yes, you do more accurate work on approximating

420
00:59:17,820 --> 00:59:20,540
 your Hessian if you have it really big.

421
00:59:20,540 --> 00:59:23,920
 But then you spend a lot more work computing these products.

422
00:59:23,920 --> 00:59:28,740
 And it's not clear that you get a benefit from that versus doing something that's a

423
00:59:28,740 --> 00:59:30,260
 little bit more gradient descent-like.

424
00:59:30,260 --> 00:59:35,260
 So you just have a trade-off between how much work you do at each iteration and work with

425
00:59:35,260 --> 00:59:42,540
 your optimization algorithm versus work towards your objective.

426
00:59:42,540 --> 00:59:47,420
 If you wanted to do something fancy, I suspect once you start getting close to a solution,

427
00:59:47,420 --> 00:59:50,180
 maybe you want to increase it a little bit to make the directions just a little bit more

428
00:59:50,180 --> 00:59:51,500
 Newton-like there.

429
00:59:51,500 --> 01:00:05,840
 So you might want something that grows slightly as the iterations proceed.

430
01:00:05,840 --> 01:00:08,240
 And I should also say that this is for unconstrained optimization.

431
01:00:08,240 --> 01:00:13,100
 If you want to use these ideas on constrained optimization, again, then you go down the

432
01:00:13,100 --> 01:00:16,460
 route of some of the augmented Lagrangian methods, all of which convert this stuff into

433
01:00:16,460 --> 01:00:29,340
 an unconstrained problem that you solve at every iteration.

434
01:00:29,340 --> 01:00:38,960
 All right.

435
01:00:38,960 --> 01:00:52,320
 Questions about this before I jump into a different topic?

436
01:00:52,320 --> 01:01:12,920
 Time to figure out what I want to do next, because I don't have time to do too many...

437
01:01:12,920 --> 01:01:37,600
 So let me just give you...

438
01:01:37,600 --> 01:01:40,240
 So structured Hessians.

439
01:01:40,240 --> 01:01:45,640
 So imagine...

440
01:01:45,640 --> 01:02:13,800
 Shoot, there's a scalar here.

441
01:02:13,800 --> 01:02:15,240
 I can't remember the scalar here.

442
01:02:15,240 --> 01:02:17,520
 But there's some scalar.

443
01:02:17,520 --> 01:02:25,560
 So this is basically diagonal.

444
01:02:25,560 --> 01:02:26,840
 Where might you see this?

445
01:02:26,840 --> 01:02:30,720
 Well, if you're doing some type of linear fitting where you want non-negativity and

446
01:02:30,720 --> 01:02:36,200
 you're using a log-barrier term, then essentially you're going to get a very nice structured

447
01:02:36,200 --> 01:02:39,000
 Hessian to that one.

448
01:02:39,000 --> 01:02:45,400
 If your only non-linearity comes from your log-barrier term.

449
01:02:45,400 --> 01:02:50,000
 In which case, again, you can totally use Newton on very large scale problems of this

450
01:02:50,000 --> 01:02:53,480
 form, because your Hessian has tons of structure.

451
01:02:53,480 --> 01:02:55,160
 It's just a diagonal.

452
01:02:55,160 --> 01:03:01,640
 There are no off-diagonal terms in that at all.

453
01:03:01,640 --> 01:03:28,960
 Let me do a row here.

454
01:03:28,960 --> 01:03:54,200
 So, in a lot of problems I look at, you have something like...

455
01:03:54,200 --> 01:04:22,400
 So...

456
01:04:22,400 --> 01:04:28,640
 So you end up with something like this.

457
01:04:28,640 --> 01:04:33,840
 You've got F of X, which is X transpose LX, where L is a Laplacian matrix, if you know

458
01:04:33,840 --> 01:04:34,840
 what that is.

459
01:04:34,840 --> 01:04:38,000
 But it's a sparse matrix here with some structure I know.

460
01:04:38,000 --> 01:04:40,120
 So it's basically an input.

461
01:04:40,120 --> 01:04:44,220
 And I also have some fractional power of X over here.

462
01:04:44,220 --> 01:04:47,320
 So think P is like 4, something like this.

463
01:04:47,320 --> 01:05:05,400
 In which case, H of X is equal to 2L plus P times P minus 1 times...

464
01:05:05,400 --> 01:05:12,840
 Sorry.

465
01:05:12,840 --> 01:05:22,240
 Let me write it like this.

466
01:05:22,240 --> 01:05:29,160
 So again, there's some type of sparsity structure to your Hessian, combined with some other

467
01:05:29,160 --> 01:05:30,160
 type of nonlinearity.

468
01:05:30,160 --> 01:05:34,160
 So it's not just constant.

469
01:05:34,160 --> 01:05:47,080
 But you have some very well-defined structure.

470
01:05:47,080 --> 01:06:05,000
 So another common structure you'll see would be a banded diagonal.

471
01:06:05,000 --> 01:06:25,920
 And these come...

472
01:06:25,920 --> 01:06:46,640
 So, if I look at all the interactions, and for whatever reason, each variable only interacts

473
01:06:46,640 --> 01:06:58,680
 with, say, its second nearest neighbors, then you'd get these linear coupling terms.

474
01:06:58,680 --> 01:07:01,440
 That would give you a banded diagonal.

475
01:07:01,440 --> 01:07:05,480
 So this shows up a lot in signal processing, where you have some type of implicit filter

476
01:07:05,480 --> 01:07:10,720
 structure that has some sequential set of variables.

477
01:07:10,720 --> 01:07:15,600
 It also shows up in partial differential equations, if you're doing anything on a one-dimensional

478
01:07:15,600 --> 01:07:21,560
 grid, where you just sort of have nearest-neighbor interactions.

479
01:07:21,560 --> 01:07:23,360
 Other places where this shows up...

480
01:07:23,360 --> 01:07:24,360
 I don't know.

481
01:07:24,360 --> 01:07:29,260
 It shows up all over the place.

482
01:07:29,260 --> 01:07:34,680
 But that's another sort of common structure you'll see.

483
01:07:34,680 --> 01:07:37,120
 What else?

484
01:07:37,120 --> 01:07:41,920
 Has anyone else seen structured Hessians?

485
01:07:41,920 --> 01:07:43,980
 I guess your Hessian could be low rank.

486
01:07:43,980 --> 01:07:45,160
 That would be another type of structure.

487
01:07:45,160 --> 01:07:46,960
 You might see...

488
01:07:46,960 --> 01:07:50,920
 But those are slightly less common, as usually you want your Hessian to be full rank for

489
01:07:50,920 --> 01:07:55,720
 a lot of different reasons.

490
01:07:55,720 --> 01:07:59,680
 But again, the point here is not that any of these structures are things you'll see.

491
01:07:59,680 --> 01:08:00,680
 It's just...

492
01:08:00,680 --> 01:08:04,040
 If you have any of these structures in your Hessian, and you can show that they arise,

493
01:08:04,040 --> 01:08:08,440
 you can still do a lot of ideas in large-scale optimization, without resorting to things

494
01:08:08,440 --> 01:08:15,360
 like limited memory, quasi-Newton, or these other types of ideas.

495
01:08:15,360 --> 01:08:17,280
 All right.

496
01:08:17,280 --> 01:08:18,600
 I think that's it for today.

497
01:08:18,600 --> 01:08:22,760
 So we'll resume next time talking about conjugate gradient, and how to take advantage of some

498
01:08:22,760 --> 01:08:24,240
 of the types...

499
01:08:24,240 --> 01:08:26,480
 Some of these types of sparsity.

500
01:08:26,480 --> 01:08:27,720
 So I will see you folks on Thursday.

501
01:08:27,720 --> 01:08:37,720
 [BLANK_AUDIO]