## Deep Policy Gradient Algorithms: A Closer Look

>>[MUSIC]>>So today we’re going to

have Logan Engstrom speaking, he’s a student in Madry’s Lab. He’s going to be talking about

Reliable Machine Learning and Reinforcement Learning.>>Thanks, Jerry. So today we’re going to talk about

the algorithmic aspects of Deep Reinforcement Learning. This is joint work done with Andrew, right there and

Demetrius and Shivani, and also some of

our collaborators at Two Sigma, and of course, our wonderful

advisor, Aleksander Madry. So let’s just get started. So we’ve all heard quite a lot

about Reinforcement Learning, probably the most

well-publicized example of Reinforcement Learning

recently has been AlphaGo, in which Google made some Reinforcement Learning algorithm to solve the game or to

play the game of Go. There’s also applications in robotics as well as

even self-driving cars. But all these applications

begs the question of whether Reinforcement Learning

is really ready for the prime time, and the answer is no. Deep Reinforcement Learning is pretty unreliable even in

a very simple settings, and you can see this

in these two examples. One example, so there’s

this little half cheetah and it’s supposed to be running along like a real cheetah except

in two dimensions. But it’s not really doing

that, it’s going on its back because it fell into local minima

and it can’t get out. Another example is

this little reacher robot and so it has his arm and it’s supposed to

be reaching to touch the dot, but what happened was

the initialization, this reacher robot had

really high weights or something and so it just

gotten to the spin, and it takes too many steps. It’s a classic

exploitation trade-off, and it was not able to get out of this spin because it’s

in another local minima. So how do we really get to this

reliable Reinforcement Learning? Of course, there’s a bunch of

problems that need to be solved, like value alignment

among other things. But the avenue that we’re

going to look at today is obtaining an algorithmic understanding

of Reinforcement Learning. Because you can’t

really make something reliable until you

understand how it works. So just we’re going to

first give a brief overview of what the Reinforcement

Learning framework looks like. So you start off with

some environment and I’m going to use the example of a stock

trading environment. So you have the market, that’s your environment, and you have the initial state which is a bunch of stock prices, and a bunch of takers. Our agent is going to be a robot, and the robot can be

trying to trade on the stock market and

make a lot of money. It has some some initial policy, maybe some kind of expert policy

or just random initialization. At the very beginning, the agency

is to stay in the market, and the state of the market

is like as I just said, all of the stocks and the prices, and then the agent takes

that and then uses this policy to make a distribution over all the actions

they could possibly take in. So that could be sell the stock,

buy the stock, whatever. Then it chooses from the action

distribution uniformly, and then it feeds up back in

the environment by playing it. Then after that, the environment

shifts to the next state. So like if you sold a stock

maybe the stock market, the stock price for

that stock will go lower. Yes. Then that’s why we’re

not using it anymore. So anyway, the state will change

and then you’ll get some reward, like maybe you made money,

maybe you didn’t make money. Then based off of that information, you’re going to shift your policy to make your agent even

better for the next round. That’s going to repeat

a bunch of times, you’re going to update your policy, you’re going to do

the same thing again, you’re going to play an action and then you’re going to get

another reward in state. The ultimate goal here is to go

to maximize your overall expected reward over the trajectory that

your agent is going to play in. So a trajectory is just you

start at some initial state, you take that state, you play

an action, you see the next state, you get a reward, then you’d see that state and then

you play another action, and then you get a

reward back and so on. You do state, action,

reward; state, action, reward over and over and over

until the end of your trajectory, and that’s one trajectory. You want to maximize the total reward that you see

throughout that trajectory. So the class of algorithms that

we’re going to look at today is all about Policy

Gradient Algorithms whose key principle is

that we’re going to view this expectation maximization goal

as an optimization problem. So this term inside,

pointer pretty cool. So this term inside

of the expectation is just the total reward that you

get throughout its trajectory, and your goal is to

maximize the expectation of the reward that you get while playing a single round

of the trajectory, and you want to find

the optimal parameters for maximizing this reward. The method of choice

that we’re going to use throughout this presentation is just for shorter method

to maximize this reward. But the problem is that we

don’t have any gradient access, and it’s unclear exactly how we’re going to get the gradient

from this expectation, and so what we’re going

t do instead is we’re going to try to find

an estimate of the gradient. It turns out, I’m not going

to go into details here but, it’s pretty common in literature. You can basically model

this expectation as the expected value of

a variable that’s easily computable given a trajectory. So what you can do

here is, you can just take a finite sample approximation, and you can take a bunch of these gradient estimates,

average them together, and hopefully get something that

looks like the actual gradient in which we’re going

to analyze later. Then once we’ve got

this gradient estimate, we’re going to use it

in gradient descent. People have been really successful

at using these in practice. There’s OpenAI who has

done both OpenAI Five which can be professional humans at Dota using policy

gradient algorithms, and there’s also this little like, they have some kind of robot where you can put a cube in

it and then you can say, ”Oh, I have a cube

one orientation but I want to move to

another orientation.” Then it will handle

it a little bit to be able to shift

the cubes orientation, and it also works pretty

well in practice. But it turns out that

just like a rotten apple, it might look great on the outside, but there’s always like

underlying problems. So when you bite into

an apple that’s rotten, it looks good but then

you bite into it, it’s little moist and juicy, maybe a little brown. You

don’t want to eat that. Just like that, you probably

also don’t want to do Deep Reinforcement Learning

because it’s really annoying. I’ll tell you why. So one reason why it’s

pretty annoying is because there is a super poor

reliability over repeated runs. So this is the same game

with the same algorithm, and this corresponds the time steps. This corresponds to

the return that you get at each step of the time step, and the only difference between

these two clusters is that so this line represents

five random seeds. You start at a random seed, then you play the algorithm. So this represents five, this represents another five. They look like totally

different algorithms, even though it’s

the exact same algorithm. The only difference

between these two clusters is a choice of five random seeds. So we clearly have pretty bad

reliability over repeated runs. Another problem is super high

sensitivity hyperparameters. So the x-axis is going

to be Learning Rate, it’s logarithmic, and the y-axis

is the total worth that you get. Each one of these lines represents a different algorithm that

we use to train the agent, but we’re just going to

look at this green one, which ultimately achieves

the highest reward possible. So you can see at Learning Rate 11 times 10 to negative

four, you get rewards zero. At Learning Rate eight times

10 to the negative four, you get reward 3,000.

That’s crazy, isn’t it? Super high variance just based on this tiny little

change learning rate. Then the final issue that we have, there’s a lot of issues. But the final issues that we

haven’t discussed in this slide is, the poor robustness to

environmental artifacts. So one example is that, you have the same game

again and we’re just going to scale the rewards at

the very end by constant factor. Each one of these represents

a different constant factor. They should all have

the same ultimate reward, but because of the word scaling, they do significantly worse. So it’s pretty weird, and so that’s another issue with

Deep Reinforcement Learning. Notably, so the benchmarks that

everyone looks at is basically, we’re going to take the algorithm

and then we’re going to get the highest expected reward

at the very end, and that’s the benchmarks

that people care about. That’s the benchmark that people care about reinforcement learning, and none of these problems are

revealed by these benchmarks. So the question is, where

do these issues come from? But it’s unclear, because Deep Reinforcement Algorithms

are super-complicated. They have tons of moving parts, and it’s just very unclear how to implement them

often from just the papers. So one example here is

the OpenAI baseline repository. It has high quality implementations of Reinforced Learning Algorithms, and so in particular, we’re going to look at

the PPO1 and PPO2 algorithms which are from the paper

that they have about PPO, which is just an algorithm Deep RL. So these are all GitHub issues

of people complaining about the differences from the paper and between these two implementations. So between the two implementations, there’s huge architectural

differences, there’s huge differences

between the policies, they have all these different

optimizations on top of the algorithm that they use that they don’t mention

in the paper at all. There’s super non-trivial changes in the repository compared

to the paper, and so on. So what the overall message

of the slide is that, the Deep RL Algorithms are

really complicated and they’re really underspecified when

you just look at the papers. So basically, in PPO at least, they have the actual item

they have in the paper, and then they have

the implementations and the implementation of all

these different kinds of optimizations on top. One example is orthogonal. There we go. One example is orthogonal neural

network initialization. So it’s just a different way

of initializing the weights. So normally in PyTorch you use Xavier initialization

which works really well for image classification tasks, but they suggest using orthogonal

neural network initialization. So it turns out that when

you run the algorithm, using orthogonal

initialization, you do way better than Xavier, and

it’s a little unclear why. Even you wouldn’t think a priority,

this would be a big deal.>>What task?>>What task it do?>>Yes.>>This is->>Humanoid>>Yeah, Humanoid using PPO.>>Is it stable across

different tasks?>>Yeah, you see the same kind

of effects already from it. Is definitely more saturated

with harder tasks. So where was I? Yeah. So this experiment

is essentially that we took all the different

optimizations and we did the Cartesian product of all of them, and then we plotted

the maximum reward for the half of the Cartesian product with this optimization,

without the optimization. So it turns out that when

you use the optimization you do way better in terms

of the maximum reward. This is true over a bunch of

the different optimizations. I’m not going to go into what all these optimizations actually are. But when you look at

the maximum reward for all these different

optimizations with and without, they’re

drastically different. By the way, these are not

even listed in the paper because that’s how unimportant the author’s originally

thought that they were, that they’re very common. But people clearly have

a very hard time reimplementing these algorithms because

it’s often unclear what’s exactly part of

the deep RL algorithms that they present and what’s

just another optimization on top. So even with these

seemingly small changes, performance can super widely vary. So the overall takeaway here is that these deep RL methods are underspecified and they’re

really complicated, and the reasons for unreliability and performance are somewhat unclear. It’s not sure if it’s

the algorithms or if it’s all the little optimizations

that they put on top of them. So this calls for us to go back to the first principles look at what these algorithms are really doing. To do that, we’re going

to look at a bunch of different tenants of

the policy gradient framework. One of which is gradient estimates, and I’m going to explain

all of these as we go, so I’m just going to go

quickly through them. The other one is value prediction. We’re also going to look at

optimization landscapes, and finally, we’re going to look at trust gradients at the very end. So the first thing we are going to

look at is gradient estimation. So if you recall in

our policy gradient framework, one of the key assumptions

that we have is that the gradient

that we actually take is pretty correlated

or at least correlated with the finite sample

approximation that we get, and we want to look at

how this is in practice. So the experiment that we’re

going to do is we’re going to fix a single policy and then we’re

going to take a bunch of steps. Each of which uses this case

sample gradient estimate. So we’re going to take a bunch of samples and then we’re going to

make a step based off this sample. So you should expect that

if you have more samples your gradient estimates

are going to get better. We’re going to do this

a bunch of times, and what probably we want to

be able to do is make sure their concentration and see how well these actually

concentrates the true gradient. The way we’re going to do that

is we’re going to measure the mean pairwise correlation

between all the different gradients. Between all the different gradient

estimates that we collect. So you can think of this as, if you have higher pairwise

gradient correlation you’re going to have

better concentration. If you have lower mean

pairwise correlation you’re going to have

worse concentration. So this is a plot where on the x-axis you have

the number of samples that we use, and on the y-axis, we have the average, basically, just the concentration

in that regime. So higher means that you’re basically concentrating to

get the actual gradient, and lower means that

you’re not as much. So this black line is what the algorithms

actually use in practice. So you can see that roughly,

I don’t want to say half, but about half the time, a little less than half the time, the steps that you

take are actually in opposite directions from one another. The gradient is so much less

concentrated than it should be. But that’s not necessarily as big of a problem as you might

think because in high dimensions, if you have very low

cosine similarity it’s still pretty significant

because you’re in high dimensions.>>So the x-axis is, you’re changing

the state action space?>>So the x-axis is the number

of samples that we use.>>Okay.>>Yeah, and so you

would expect, yeah.>>You can finish your sentence.>>No, no no.>>Okay. Is this.>>You can just ask, if you guys have confusions just ask a question.>>Is this consistent across different architectures

of the policy.>>So we only tested

one architecture of the policy, but if I had to guess,

I think it would be.>>Consistent across

tasks variance system.>>Yeah, it is very consistent across tasks across different

variations. Yeah.>>So in this case, what’s

the dimension of the parameter? What’s the dimension

of the parameter?>>About 5,000. So if you consider random Gaussians and then you have correlations between

the random Gaussians, so those like, you get roughly one over square root

of de-correlation, cosine similarity I guess

between two random Gaussians. So you get maybe 0.01 or less than that in terms of correlation if you just drew Gaussians

and 5,000 dimensions. So this is non-trivial one.

Alex pointed this out. Well, multiple people pointed

out similar events throughout. Never mind, but anyway. Yeah. So it’s not quite as bad as it looks, but there’s clearly a lot of

room for improvement here.>>How does it vary across the occupations of

the organizational algorithm?>>Yeah. So at earlier iterations, this looks much better. So if you’re at the very start, this graph actually just

shifted left and so you get actually pretty

reasonable correlation at the very first iteration. I think there’s iteration

150 or so out of 500 or 300, but it very quickly drops off. So as you go further in

the iteration process, this graph shifts to

the right and you’ll get much worse estimates in terms of concentrating.

Any other questions [inaudible]>>What do you mean harder tasks?>>Harder tasks, yeah. So there’s an informal hierarchy of how hard

all these different tasks are. I guess you can make

it more fun and you get like typically in terms

of sample complexity, how many samples you need

to go learn this task. So this is for humanoid, sorry, for Walker 2D and which is

considered one of the harder tasks on [inaudible] Then they’re like easier

ones like Hopper where it’s like, what going out there? For some reason that’s easier, yeah.>>Can you give some sense

of how bad this could be. Are this doing much better

than the worst-case scenario?>>Yeah. Actually, we’re going to

get to that in the next section, yeah, when we look

at value estimation. So I guess the key

takeaway here is that we don’t have

a great understanding of the training dynamics for

how this variance really impacts our optimization process, but it would be great if

we could use insights from stochastic optimization to

be able to look at this. It’s not exactly the same regime

because the samples that we get, they are independent

and not only that, but the actual objective because of the way that the deep

policy gradient methods are organized is non-stationary. So you can exactly apply SGD, but it would be great to

use insights from it. Yeah, and another key thing here

is that we’re really missing a link between reliability

and sample size. So it turns out that

when you really scale up these algorithms and use

many more [inaudible] samples, these algorithms become

much more reliable. It hints at this in this plot

because you can see that the gradient estimation is much better when you

use many more samples. So it’ll be great to get a better understanding

of what that’s like. There’s actually opening

IP for about that. But it would be great to

look into it even more. So the next aspect of policy gradient methods

that we’re going to look into is value prediction. So as we just saw, the gradient estimation

that we get is really hindered by in terms

of concentration. In terms of concentration, the concentration that we get is

really hindered by poor variance. So it be great if we could

lower this variance, and it turns out that

one way to do that is to estimate the values and then use that in

our policy gradient method. So the value of the state is

if you have a given state, the value is the expected reward that you get after

visiting that state. So the idea is that if you can

estimate these states well, then you can better separate

out the action quality like what action you take from the action that you take versus

what the state quality is. So for example, if you have a robot and it’s about

to fall over something, then if you take an action you don’t want

to say the action is bad because you were about

to fall over anyway. So the idea here is that if you can understand what the state

contribution to how good the algorithm does versus what the action

contribution is and you can significantly lower the variance. To reduce the variance, you

need good value estimates. The way that we get value estimates

is we use a during training, we collect all these different

samples of states and rewards and you can calculate

what the values are from that. So we basically just perform a supervised regression task at every point in training process

using the data that we collect. So we really want to

understand here what do the value essence we get do in terms of reducing our variance and how well could we actually do, and also how bad is it in terms of the baseline

if you don’t do anything? So the way we’re going to do

that is we’re going to do a similar experiment to last

time where we vary the number of samples on the x-axis and we vary

the concentration on the y-axis. It’s the same exact plot

as previously. But now, we’re going to look at

three different kinds of agents. So the first agent is when you

don’t use any baseline at all. So you don’t use

any value estimation. So then the red line here represents what happens when you use the standard

value estimation from the algorithm and then this blue line represents what would happen if you got a near ideal value function, and the way that

we’ve calculated that is we basically just take a ton of samples and we evaluate very well approximate what

the value function is.>>So just to make clear,

we start here actually refers to the value function of the gradient policy not the

optimal value function.>>Yes.>>Okay.>>The value function

of the current pulse. So this has no value function. This is the agent’s value function and that’s a true value function. So it turns out that the agent

does significantly worse than what it could be doing if

it had the true value function. But it’s still doing significantly better because again remember

they we’re in high dimensions, so this is actually pretty good, and it’s actually doing quite

better than no value function. But there’s clearly significant room

for improvement here. You can see that the concentration gets much better for

the true value function. So one of the key questions here is if we were able to get

better value functions, how would that affect training? How much better would be able to do? How much more liable

would it be able to be? Not only that, but how can we

actually get better value functions? Because it’s clear that

there’s a big benefit here, but what’s unclear is how

well that would really translate to optimization in general. So now the third time that we’re going to look at

is optimization landscapes. So a key assumption again in our policy gradient framework is that when we take

these gradient steps, we increase the overall reward that we’re going to

get from that policy, and so what we want to see is how valid this assumption is in practice. So we’re going to

look at lot plots of this form in the next few slides. So we’re going to make it very

clear what these plots are. Essentially, in this direction, we’re going to fix a policy, and then in this direction represents moving in the actual step

direction that we get, and so this point represents what the actual step

would be in the next step. Then this direction

represents going in a random direction and

chosen from a Gaussian, and then what the actual plots here represent are the reward that

you get at that new policy. So you fix a policy

you then you move this much in the agent step direction and move this much in

the random direction. So one example here is

this red stuff right here represents you move 2.5 times in the step direction and you move

1.5 times in a random direction, and then this is the ultimate reward that you get from that policy. So this is step zero. You can see that we’re doing

pretty well here. You move in the step

direction you get increasing rewards which is great. But then by Step 150, you can see that this really

degrades quite a lot. But just moving in the step directions actually

lowers your reward. That stays true at Step 300, and we looked at a lot

of these kinds of plots, and essentially shows that

the oftentimes the steps are not predictive later in

the optimization process. This looks even worse

again for a 100 tasks. So for easier task it

looks a little better. But for these harder tasks

it looks much worse. So natural question asks

here is, what’s going on? It turns out that when you look at what the algorithms

are actually doing, they’re not maximizing

the actual true rewards. Instead they’re maximizing

is some surrogate reward.>>Sorry, I have a question

on the previous model. So the x-axis is the step

taken by the agent?>>Yes. So you’re talking

about one, right? Yeah. So I fix a policy and

then I move like x times step in policy space and I evaluate how that agent does in terms of reward.

Does that make sense?>>It’s like it’s moving

optimization like parameter space.>>Yes.>>Okay.>>Because agents step

taken sounds like some agent taken steps,

but [inaudible].>>Random detection would be like

adding noise to the [inaudible].>>Yes.>>But after results

it showed that adding noise to policy parameters

actually helps.>>I’m not aware of any, but I’m happy

to talk to a [inaudible] about that.>>Sure.>>You are also adding noise

to the policy parameters.>>Yes. Also adding noise to the policy parameters.

I’m not sure of that.>>Yeah. There’s a paper from

Open AI parameter space exploration.>>Yeah. Sure.>>Okay. Any other questions

about these lines? Is everyone clear about

what’s going on here? Move in the parameter space and the sub-direction and you can

move in a random direction. This is the kind of

plots you get. Yeah?>>For the second part, because we are kind of following

that, we didn’t do actually. Why is that kind of reward

is going down? This part.>>We’re going to explain

that in the next slide. Yeah. That’s a good question here. It’s like why is it that when we go in the direction of

the actual step that we take, why is it that the true

reward is going down? It turns out that these methods actually don’t optimize

the true reward. What they instead optimize is

something called a surrogate reward. I’m not going to go into detail

about it but we can talk afterwards about what

this actually is, and what we want to check here is how the landscape of surrogate rewards compared to the landscapes

of true rewards. This is a surrogate reward landscape. It’s the exact same format as our previous landscape

except that now, it’s a surrogate rewards

instead of the true rewards.>>So [inaudible] is

coming in because of the proximity discussion of entity?>>Yeah.>>Okay.>>Yeah. So they look

at like the base of the policy ratio times the advantage. That’s what they’re optimizing

instead of the true rewards. So this is what the agent

is actually optimizing. So you can see that it has a maximum grid where

the sub-direction is. So it’s actually maximizing the

surrogate reward really well. That makes sense because it has full access to the certain rewards. So it should be able to

optimize it really well. It should be clear that the surrogate reward is based

off of what it sees in training. It’s just that they have

some of the approximation and it’s based on some theory. But what actually what

ends up happening here is that these actually work

pretty well. This is subzero. What happens is you move up in the

surrogate landscape direction. You also move up in

the corresponding reward direction. So this step does pretty well. But then by step 150, you run into the same problem

that we saw in the last slide, where you move in

the circuit direction. It looks like this is the optimum. But then what actually ends

up happening is that in the true reward landscape,

you’re going down. You’re getting worse and worse. This continues to be true

to step 300 and 450. This looks great. It is

optimized as much as we want. But it turns out that even when you move in the

surrogate direction increasing the surrogate rewards, you’re seriously not doing very well on the true

rewards landscape.>>So here, this surrogate

landscape is a landscape for the value network or how do you guess

this surrogates landscape?>>Yes. So how do you

find out about it? But essentially, the

surrogate landscape is it’s just a function of the

trajectories that you see. So you collect all these

different trajectories and training and then you make a circuit out of it which is

motivated by some theory work. Yeah, it uses

a value function as well.>>It’s for the policy. It’s

the one for the policy.>>Yeah, it’s a landscape that

the policy. I’m sorry, what?>>It’s the landscape for the policy, like this is all the

policy parameters.>>Yeah. This is where, again, offering a policy space.

Does that make sense?>>Yeah.>>So you can think about

this like just abstractly. You can think about the

surrogate reward as something that the algorithm

is actually optimizing over, like it takes in the trajectories and then it makes a landscape that it thinks corresponds to

the true landscape, and then it optimizes that instead of optimizing

over this true landscape.>>On the [inaudible] side, I think there’s two approximations. One is you use a multichannel to some port to estimate,

say, accreditation. That is one kind of approximation. The other approximation is that even do not know the true value function. So I need to use some network

to optimize that well. So by surrogate here, to which approximation

you are already->>It uses both of them.>>Both of them.>>Yeah. So it uses a value function and it uses

the samples that we get. But it’s not looking at

the actual rewards you get. It’s looking at some function of

the actual rewards you get. Yeah?>>So you’re measuring

these functions on the same scale, sir? Are we seeing->>Yeah. So you should->>You just sort of reward

by a very small amount of->>Right. So you should

ignore the scale in terms of like the only thing that’s important about,

this is a direction. Like this goes up, this

should also go up.>>I’m just curious in

trying to take a step to achieve a very small improvement

in the surrogate reward. Are we seeing the comparatively

much larger decrement in the true reward or are

these scales not comparable?>>Yeah, these scales

are not comparable.>>Okay.>>But it’s definitely regardless like it is concerning that you move. When you’re increasing

the surrogate reward, you are decreasing

the true reward. Yeah.>>Does the network, does

the agent become better?>>Yes. Yes, it does. Yes. So that’s the key thing throughout

this whole presentation. I am not saying that

these deeper algorithms do not work because they do work.>>But how can it become

better if I always, pretty much, after some point,

decrease my true reward?>>That’s a great question.

That’s a great question. I mean we have some hypotheses

but it’s long and clear. Yeah. This could be a source of the unreliability

issues that we see.>>So for these slides, do you think that’s the error, many form the value

function approximation?>>I think both of

them are, probably. But I think both probably

play a part in this. But it’s more about the, I think, both of these

play a part in it. But actually, so when

we take this step, we use many, many samples. So these actually

should be pretty good. Like when we take the step, we’re not taking the actual agent stuff. What we’re actually doing

is we’re using many, many samples to get a pretty good approximation of

what the agent step should be. So I don’t think that

those issues are super big.>>It’s like the third source

of approximation.>>Yeah.>>It’s a kind of approximation.>>Yeah. It’s kind of like a third

source of approximation error. It’s like the actual function

that they build out of the trajectories, not the value function in

social, in the estimates.>>So in the same way

that you can’t optimize accuracy when you train networks, you optimize cross entropy. That’s exactly the same.

You can optimize reward. You optimize some surrogate

of the reward.>>Yeah.>>That’s the procedure.>>Yeah. That’s a great explanation.>>So I guess the quantity

on the x-axis, the way I should think

about it is like the step size in my

optimization algorithm, right?>>Yes.>>But then presumably, the scale at which you’re plotting

things is much larger than the regime in which the step sizes

are operating and training it. So what is the actual step?>>Yeah. So if you->>Oh, I see. You scale with the->>Yeah.>>I see. Okay.>>Yeah.>>So if you’re just too early

stopping at 150, would you get->>No. I mean, you definitely improve overtime. So it’s not like these arguments

are not working. It’s just a little unclear what the mechanisms behind them

are there and making them work.>>[inaudible] being in total report

or do we turn [inaudible]?>>I’m sorry. What?>>Is the y-axis in the red plot

here is the same thing you’re claiming that the total reward will increase overtime

like it will cut them all?>>Yes. This quantity exactly corresponds to what

should be increasing overtime.>>So how can that be if

every step you take decreases?>>That’s not every step.>>Like so we sample

a couple of steps. It could be that you take like

very few high magnitude jumps but then if you sample at any point, you’re likely to see like

some small detrimental.>>So maybe on average

like this doesn’t do so well but you might get a few steps

that are doing really well, so it bounces out. Yeah?>>So is this reward the return

or is the step-wise reward?>>The total reward over->>Yeah. This is

the total reward. Yeah.>>So it’s just discounted return.>>Yeah, it’s discounted

return actually. Yeah. So as we can see here, the surrogate reward is often very misaligned with

the true reward landscape. It’s important to

note that everything we’ve looked at so far has been

in this high sample regime. So every time that we look, we try to make an approximation of how good each one

of these agents are. We get the discounted reward

using a thousand trajectories, which is pretty good, like you can get

a pretty good estimate of this. But when the agents

are actually learning, it only uses about 20 trajectories. So what does agent

actually see when we’re going throughout

this optimization process? So these are 20 sample

estimates which means that each one of

these points correspond to taking 20 trajectories

for reward estimate. So you you take the step, you look at this new agent, then you run through 20 trajectories, and then you see what the mean

discounted reward was, mean return. Then this is what happens when you take 200 samples or

unit of every point. This is what happens when you take a thousand samples of every point. So you can see here

that if you use many, many samples, you get a really

nice like smooth landscape. So if you move in this direction, even though you are definitely

actually improving these rewards, it’s little hard to detect

in the agent sample regime. This is concerning because this is what the agent

actually uses to make steps. So it’s hard for the agent

to even know if it’s making progress because of how

noisy this landscape is. So the two key takeaways here are, that first these landscapes are not very reflective

of the true rewards, and it’ll be great to understand why, and how that impacts

the optimization process. It would also be great to

understand how we can better navigate their award landscape because it seems like

the secret “Word”, maybe it’s not

the best way to do this. The final aspects of

these policy grading methods that we’re going to look

at is trust regions. So in parameter space you can think of our optimization

process as follows, you have this original point

in optimization space, then you take a step and

you go to the next point, take another step, you go

to the next point so on. At every step what happens is you

take a bunch of trajectories, and you make sure that

the step that you take based on those samples is

within a trust region. So each one of these steps

has to be within this trust region because

the samples that you take are only informative

locally around where the current region

is because that’s where you took the samples. So you want to make sure that

the steps that you actually take are actually informed by

the samples that you took. So the idea is that you want

to be able to take steps in this trust region but if you

go outside the trust region then it’s a little unclear about

what you’re actually going to get because you didn’t take samples

there, you took samples here. So what PPO and TRPO use, they’re motivated by this KL-based based trust region

where you’re looking at the maximum KL distance between the action distributions

induced by states. So intuitively, you can

think of this as key. Make sure that even in across all the states

that you could possibly see, make sure that the way that

I choose my action is not too different from my current policy

to my next policy. So when I take a step, I want to make sure that in my next policy it’s

not going to be too different than the way

that I take my steps, and the way that I take my actions, and even in across all the states. But this is hard to

enforce in practice because we don’t see

the whole state from the space. So instead what we do is we

relax you and expectation, and what we see here instead

is we want to intuitively constraint the mean way that we take actions at every state,

does it make sense? So what we want to see here is

what actually happens in practice, does our next agent actually

satisfy the constraint? So right here this represents the iteration in

our optimization process, so there’s 450 steps, and we’re going to

look at every step, we’re going to see what

the mean KL distance was between our current agent

and the next agent, and so it should be around

here which is what TRPO gets, and TRPO actually does

this very nicely, and it really maintains the trust region but is

everything clear so far? That’s great, okay. So with

a PPO algorithm does not. So you can see here that we go

from two to negative six mean KL, to two to negative three KL,

across the pointer training. It doesn’t look like there’s any sign of stopping but interestingly, the optimizations helped quite a lot. So we have this core PPO algorithm, and it purports to

keep this mean KL the same or it has a relaxation but it purports that this is

the overall goal of this algorithm is to keep

this mean KL all the same, and while relaxing this constraints

is computationally, so they’re easier to compute.>>Certainly is it

just approximation that they are going from two to the minus

five to two minus three?>>I’m sorry what?>>The PPO is using

that mode approximate.>>Yes right.>>So that is going to

be a loss also just from Taylor’s approximation

that’s involving.>>Yes.>>You have a sense of

whether it’s all [inaudible].>>Yes. So that’s really interesting. Yes. So that’s what we’re

looking at right here. Is the fact that this algorithm

and this algorithm use the exact same enforcement method

like these two methods in terms of if you just

read the paper you would think that these would have forced this mean KL just as well because they use the exact

same enforcement method. But when you put all these different

optimizations on top of PPO, it turns out that you can get significantly better

trust region enforcement and these optimizations include

learning rate annealing, value clipping and so on and using

orthogonal weights and so on. So it’s unclear exactly

what is causing this trust region to be enforced but this trust region

ought to be enforced because they are the same algorithm. This is PPO greenest BPO and this blue on his PPO M

which is like PPO minimal which is what you

would get if you just implemented the PPR algorithm as

stated in the original paper. Then this is what

you get when you use all the different

optimizations that you find in the opening

I GitHub repository. So what’s interesting here is that even though the enforcement

method mechanisms if you just looked at the algorithm

appear to be the same, the optimizations cause the actual enforcement in practice

to be drastically different along these two algorithms.>>This part is not

as surprising just because presumably a lot of the optimizations were going

to actually stabilize. The numerical aspects and at least if your policies are changing in a more stable fashion then by definition trust regions that can

be better maintained as well.>>So what they

algorithm is for here is not they aren’t trying to

make anything more stable. What they’re trying to do is maximize total rewards

at the end, right?>>Right, but they are

presumably trying to do it in somewhat more reliable manner across the different tasks

that they’re evaluating on. So it might be an artifact of.>>Right. It could me artifact

of that optimization. I guess what’s interesting here in general is that

I don’t know but I mean, maybe this is just me

but when I look at the optimizations that I don’t see maintaining trust regions

at all in any of them. The only mechanism

that I actually see in the algorithm for maintaining

trust regions is like the key PPO like

ratio clipping thing which is kept constant

across both of these.>>It’s interesting or at least it’s somewhat

surprising thing to us that the mechanism that’s

designed to maintain the trust region does

not seem to be the thing that’s actually maintaining

the trust region. It seems to be some other stuff

that we add on top.>>So I guess as we

just talked about, one of the key questions to ask here is what part of

these algorithms are actually doing what and how do we reason about these algorithms when

they’re using such relaxations to the original trust regions

that they were supposed to be at least in terms of

theoretically grounded using. Not only that but how can we

capture the different kinds of uncertainty that we have in our algorithms in our trust regions. So the original trust regions

that motivated the trust regions that these deep RL algorithms use

don’t take into account stuff like like bad value functions or really unconcentrated

gradients and so on. So it’ll be great to see

what kinds of trust regions we can come up with that

take these into account.>>I guess the difficulty am

having with this part is so PPO is fundamentally once you do

a tailor approximation of KL, sure you could still

go back and measure KL and which is kind of

what you’re doing but you could also say that it is just defining a different notion of

what a trust region should be and what do things look

like if you actually just evaluate what PPO is enforcing.>>Yeah, we actually

looked at that too. I didn’t choose to include that in

the slides because I thought it would be too much I guess but I’m happy to talk about that.

We have that in our paper.>>Sure.>>What’s the mean takeaway

from this fact because all of these even PPO people

they say that, “Yeah, we use very loose relaxation

but empirically we observe success which is essentially what this section of the

presentation is also saying.>>It doesn’t seem to be due to the relaxation like you

could just not have the. It turns out that if you just

remove the relaxation that PPO does then you draw the optimizations

that you do slightly better, you can just enforce

the same trust region. So like the whole clipping thing, you can just set the

hyper parameters exactly right since it never leaves the clipping

thing and then it’s all fine. So the clipping doesn’t

actually seem to be doing it. It’s more like the optimizations that we added on top of

that the thickening make it make the optimization so nice

that you don’t actually need the trust region in the first place.>>Yeah, easy. So just general takeaways that

we can get from this. In general, the deeper RL

methods are really complicated and they have a lot of moving parts and they’re hard to understand. Not only that but these deeper RL trained dynamics

I really poorly understood. The steps that we take are

often really uncorrelated. The surrogate words don’t match the true rewards and

the trust regions don’t hold oftentimes at least for

the reasons that we think. So the big question here is how do we proceed like what what are

we go do in the future about this on and so the first thing that we

might want to do is shut. It reconcile RL with

our conceptual framework tried to make our deep RL

algorithms actually match the policy gradient frame work better. So how can we do that? Another stuff that we

could do is try to rethink our framework for

these deep RL methods trying to move our framework closer. So for that we would have to

figure out how to deal with high dimensionality

and these algorithm, different kinds of

optimizations that they put on top of the core method and not only that but dealing with these non-convex function

approximations of deep networks. Finally, our results suggest that we need barrier evaluation

for our RL-systems. We have to move past a return-based centric

benchmark system and try to look holistically at

all the different aspects of these algorithms like trying

to look at reliability, and robustness, and safety. If you want to read

more we have of paper and we also have

a bunch of blog posts.>>More questions. I’m curious to see if

you run similar probes on some bandwidth like settings or just subway settings to see how much the gradient estimation issue or the transmission issue come up.>>Yeah. So we actually looked

at using SGD to maximize. So one of our buddies

looked at maximizing, basically just looking

at toy settings in SGD, like using SGD to maximize

our quadratic or something. So it turns out that you can make

this stuff super uncorrelated. You’re still going to maximize

the quadratic pretty well. So we thought that was

pretty interesting, but the dynamics in RL

are very different. I guess for the reasons that

we mentioned before about lacking independence

and non-stationarity. So we’ve looked at some

experiments that are similar in these regimes. I think bandwidths

would be a great place to look as well, but I mean, I think bandwidths are very theoretically well understood and there’s not too many

moving parts in them. There’s a core algorithm, but I think we can go with that.>>So how do you view [inaudible]>>Yeah.>>I’m just still over confused about the part that like

your gradient seem to be very uncorrelated and your [inaudible] seems to be

going down most of the time, so is it just very much like

they can random match them. Like what if you actually like do that try

to just instead of four, and your gradient just takes

a random direction but keep with a different because

of that or something like that.>>Yes. So that’s actually

a technique that people use, is like finite difference methods. There’s a paper from Ben Recht

about it called like Random Search is a Competitive

Baseline for [inaudible] or something. It was a good paper,

pretty interesting paper at least. So basically, what

they do is they just take a bunch of random directions

and see which ones do well only. They do a bunch of

other optimizations on top of this. They do some wacky stuff about like throwing away

different directions, but yeah, it’s the same core

algorithm, and it works pretty well.>>If you are going to

start from scratch, approaching this problem domain, are there things you think you would leave out out of

the present framework for RL, or replace with something else? Or what ideas do you have about how to avoid having

some of these issues at all?>>Yeah. So I think that, it would be great to look like as we

design these algorithms, it would have been great to look at how the different optimizations that we use actually impact

the performance. Like, people at least, I mean, so the policy gradient

framework that people came up with is not intended for

the deep RL methods, or for these kinds of tasks as much. So I think it’d be good to design. I think would be good,

I’m not sure, I guess, about how I would

design the framework, but I think that in general, when developing

these kinds of methods, I would be more careful about

looking what the impact is of different agronomic aspects. Then trying to really

understand what’s causing performance and what’s causing reliability or unreliability

or lack of performance.>>Have you tried using

linear model as in like, is a problem here to

classify it [inaudible].>>Yeah. So actually, you

can solve this without any deep learning using a linear

model with these algorithms.>>Right. On the same,

have you done experiments?>>Have we done experiments on that?>>Like say like

the optimization landscape, you have a linear

function approximator. Does it also look like that?>>Yeah, that’s a great question. I would suspect that

these environments would be similar. But I think that it would be a

very interestingly to look at. Actually, so Ben Recht’s paper

uses a linear approximator.>>The random search one?>>Yes. [inaudible].>>The surrogate landscapes

look like [inaudible].>>Yeah. So if you look at

the surrogate landscapes. So if you look at

the surrogate landscapes, they’re like vaguely linear. I mean, which makes sense because, I mean, I don’t

know if it makes sense, but the actual thing you’re optimizing is linear in

the outputs of the network. So these preliminary, yeah, I mean, I don’t know if

that’s, I don’t think, I’m not sure if that’s

a good connection.>>If you have

a multi layer linear network and the optimization is not linear.>>Yeah, absolutely. Sorry.>>What’s the network on

the picture using all these?>>Yeah. It was a two-layer MOP.>>What’s in there? We mainly use for most of these.>>It varied a lot per experiment. I think we used, whatever the best one it was during

[inaudible] I’ll be referring to any.>>Like 10 to a negative four. I saw earlier you had this bot.>>Yeah. I think I was

probably something around 10 million to

10 million or four.>>Do all the environments

show plots like this?>>Yeah. So we actually saw in

our appendix we have everything. We’ve got like 30 pages of appendix or something, so

you can take a look at that. So I mean, for easier task, they look much better, I would say. We’ve mostly looked at it in this Walker TD, which

is the hardest one.>>Yeah. It’s curious, you

would imagine that more unstable environment’s probably

they’re more sensitive, but more stable than

perhaps things that balance, they’re more unstable

as well as maybe even think [inaudible]

stuff they’re more stable, that was needed [inaudible].>>Yeah. I guess, I’m not sure. Yeah, I don’t really

have a good intuition for how the different

game shift work.>>Maybe I missed this, so

thinking about other evaluation, do you have a constructor

solution for how we might go beyond just benchmarks

and eventual average? What might be an

alternative evaluation that exposes like this

hyper sensitive video, hyper parameters and

things like that?>>Yeah. I mean, so we never

talked about ideas for that. We haven’t talked

about [inaudible] yet. I guess, we haven’t thought

about that too much. I would say that one thing

would at the very least. So if you look at

a lot of these papers, when they showcase results, they show the results in a way that makes it look

more stable than it is. So one example is the super

common practice to use smoothing. So they basically say

like we’re going to look at a weighted average

of what my returns are over time rather than actually getting what

their true rewards are overtime and plotting that. I think it would be good

just as a very basic start to have more rigorous

evaluation there. Actually, one big problem

with comparing methods is that when people use all of

these different kinds of, like these smoothing

or they say, “Oh, I’m going to collect five Cs, and I’m going to choose

the one that does the best,” which is crazy, right? I think as a very basic start, it would be good to just have

some honest guidelines for just even showing reward curves.

There’s a long way to go there.>>Is there any more questions [inaudible].

If there’s any more questions, I think they’re like around today and tomorrow, so if

you want to be with them. By all means, then let’s

thank our speaker again.

For all slopes there exist a slope equal to zero such that the slope ….

thanks for this talk