Gayblack Canadian Man

Foreign Policy Analysis
Deep Policy Gradient Algorithms: A Closer Look

Deep Policy Gradient Algorithms: A Closer Look


>>[MUSIC]>>So today we’re going to
have Logan Engstrom speaking, he’s a student in Madry’s Lab. He’s going to be talking about
Reliable Machine Learning and Reinforcement Learning.>>Thanks, Jerry. So today we’re going to talk about
the algorithmic aspects of Deep Reinforcement Learning. This is joint work done with Andrew, right there and
Demetrius and Shivani, and also some of
our collaborators at Two Sigma, and of course, our wonderful
advisor, Aleksander Madry. So let’s just get started. So we’ve all heard quite a lot
about Reinforcement Learning, probably the most
well-publicized example of Reinforcement Learning
recently has been AlphaGo, in which Google made some Reinforcement Learning algorithm to solve the game or to
play the game of Go. There’s also applications in robotics as well as
even self-driving cars. But all these applications
begs the question of whether Reinforcement Learning
is really ready for the prime time, and the answer is no. Deep Reinforcement Learning is pretty unreliable even in
a very simple settings, and you can see this
in these two examples. One example, so there’s
this little half cheetah and it’s supposed to be running along like a real cheetah except
in two dimensions. But it’s not really doing
that, it’s going on its back because it fell into local minima
and it can’t get out. Another example is
this little reacher robot and so it has his arm and it’s supposed to
be reaching to touch the dot, but what happened was
the initialization, this reacher robot had
really high weights or something and so it just
gotten to the spin, and it takes too many steps. It’s a classic
exploitation trade-off, and it was not able to get out of this spin because it’s
in another local minima. So how do we really get to this
reliable Reinforcement Learning? Of course, there’s a bunch of
problems that need to be solved, like value alignment
among other things. But the avenue that we’re
going to look at today is obtaining an algorithmic understanding
of Reinforcement Learning. Because you can’t
really make something reliable until you
understand how it works. So just we’re going to
first give a brief overview of what the Reinforcement
Learning framework looks like. So you start off with
some environment and I’m going to use the example of a stock
trading environment. So you have the market, that’s your environment, and you have the initial state which is a bunch of stock prices, and a bunch of takers. Our agent is going to be a robot, and the robot can be
trying to trade on the stock market and
make a lot of money. It has some some initial policy, maybe some kind of expert policy
or just random initialization. At the very beginning, the agency
is to stay in the market, and the state of the market
is like as I just said, all of the stocks and the prices, and then the agent takes
that and then uses this policy to make a distribution over all the actions
they could possibly take in. So that could be sell the stock,
buy the stock, whatever. Then it chooses from the action
distribution uniformly, and then it feeds up back in
the environment by playing it. Then after that, the environment
shifts to the next state. So like if you sold a stock
maybe the stock market, the stock price for
that stock will go lower. Yes. Then that’s why we’re
not using it anymore. So anyway, the state will change
and then you’ll get some reward, like maybe you made money,
maybe you didn’t make money. Then based off of that information, you’re going to shift your policy to make your agent even
better for the next round. That’s going to repeat
a bunch of times, you’re going to update your policy, you’re going to do
the same thing again, you’re going to play an action and then you’re going to get
another reward in state. The ultimate goal here is to go
to maximize your overall expected reward over the trajectory that
your agent is going to play in. So a trajectory is just you
start at some initial state, you take that state, you play
an action, you see the next state, you get a reward, then you’d see that state and then
you play another action, and then you get a
reward back and so on. You do state, action,
reward; state, action, reward over and over and over
until the end of your trajectory, and that’s one trajectory. You want to maximize the total reward that you see
throughout that trajectory. So the class of algorithms that
we’re going to look at today is all about Policy
Gradient Algorithms whose key principle is
that we’re going to view this expectation maximization goal
as an optimization problem. So this term inside,
pointer pretty cool. So this term inside
of the expectation is just the total reward that you
get throughout its trajectory, and your goal is to
maximize the expectation of the reward that you get while playing a single round
of the trajectory, and you want to find
the optimal parameters for maximizing this reward. The method of choice
that we’re going to use throughout this presentation is just for shorter method
to maximize this reward. But the problem is that we
don’t have any gradient access, and it’s unclear exactly how we’re going to get the gradient
from this expectation, and so what we’re going
t do instead is we’re going to try to find
an estimate of the gradient. It turns out, I’m not going
to go into details here but, it’s pretty common in literature. You can basically model
this expectation as the expected value of
a variable that’s easily computable given a trajectory. So what you can do
here is, you can just take a finite sample approximation, and you can take a bunch of these gradient estimates,
average them together, and hopefully get something that
looks like the actual gradient in which we’re going
to analyze later. Then once we’ve got
this gradient estimate, we’re going to use it
in gradient descent. People have been really successful
at using these in practice. There’s OpenAI who has
done both OpenAI Five which can be professional humans at Dota using policy
gradient algorithms, and there’s also this little like, they have some kind of robot where you can put a cube in
it and then you can say, ”Oh, I have a cube
one orientation but I want to move to
another orientation.” Then it will handle
it a little bit to be able to shift
the cubes orientation, and it also works pretty
well in practice. But it turns out that
just like a rotten apple, it might look great on the outside, but there’s always like
underlying problems. So when you bite into
an apple that’s rotten, it looks good but then
you bite into it, it’s little moist and juicy, maybe a little brown. You
don’t want to eat that. Just like that, you probably
also don’t want to do Deep Reinforcement Learning
because it’s really annoying. I’ll tell you why. So one reason why it’s
pretty annoying is because there is a super poor
reliability over repeated runs. So this is the same game
with the same algorithm, and this corresponds the time steps. This corresponds to
the return that you get at each step of the time step, and the only difference between
these two clusters is that so this line represents
five random seeds. You start at a random seed, then you play the algorithm. So this represents five, this represents another five. They look like totally
different algorithms, even though it’s
the exact same algorithm. The only difference
between these two clusters is a choice of five random seeds. So we clearly have pretty bad
reliability over repeated runs. Another problem is super high
sensitivity hyperparameters. So the x-axis is going
to be Learning Rate, it’s logarithmic, and the y-axis
is the total worth that you get. Each one of these lines represents a different algorithm that
we use to train the agent, but we’re just going to
look at this green one, which ultimately achieves
the highest reward possible. So you can see at Learning Rate 11 times 10 to negative
four, you get rewards zero. At Learning Rate eight times
10 to the negative four, you get reward 3,000.
That’s crazy, isn’t it? Super high variance just based on this tiny little
change learning rate. Then the final issue that we have, there’s a lot of issues. But the final issues that we
haven’t discussed in this slide is, the poor robustness to
environmental artifacts. So one example is that, you have the same game
again and we’re just going to scale the rewards at
the very end by constant factor. Each one of these represents
a different constant factor. They should all have
the same ultimate reward, but because of the word scaling, they do significantly worse. So it’s pretty weird, and so that’s another issue with
Deep Reinforcement Learning. Notably, so the benchmarks that
everyone looks at is basically, we’re going to take the algorithm
and then we’re going to get the highest expected reward
at the very end, and that’s the benchmarks
that people care about. That’s the benchmark that people care about reinforcement learning, and none of these problems are
revealed by these benchmarks. So the question is, where
do these issues come from? But it’s unclear, because Deep Reinforcement Algorithms
are super-complicated. They have tons of moving parts, and it’s just very unclear how to implement them
often from just the papers. So one example here is
the OpenAI baseline repository. It has high quality implementations of Reinforced Learning Algorithms, and so in particular, we’re going to look at
the PPO1 and PPO2 algorithms which are from the paper
that they have about PPO, which is just an algorithm Deep RL. So these are all GitHub issues
of people complaining about the differences from the paper and between these two implementations. So between the two implementations, there’s huge architectural
differences, there’s huge differences
between the policies, they have all these different
optimizations on top of the algorithm that they use that they don’t mention
in the paper at all. There’s super non-trivial changes in the repository compared
to the paper, and so on. So what the overall message
of the slide is that, the Deep RL Algorithms are
really complicated and they’re really underspecified when
you just look at the papers. So basically, in PPO at least, they have the actual item
they have in the paper, and then they have
the implementations and the implementation of all
these different kinds of optimizations on top. One example is orthogonal. There we go. One example is orthogonal neural
network initialization. So it’s just a different way
of initializing the weights. So normally in PyTorch you use Xavier initialization
which works really well for image classification tasks, but they suggest using orthogonal
neural network initialization. So it turns out that when
you run the algorithm, using orthogonal
initialization, you do way better than Xavier, and
it’s a little unclear why. Even you wouldn’t think a priority,
this would be a big deal.>>What task?>>What task it do?>>Yes.>>This is->>Humanoid>>Yeah, Humanoid using PPO.>>Is it stable across
different tasks?>>Yeah, you see the same kind
of effects already from it. Is definitely more saturated
with harder tasks. So where was I? Yeah. So this experiment
is essentially that we took all the different
optimizations and we did the Cartesian product of all of them, and then we plotted
the maximum reward for the half of the Cartesian product with this optimization,
without the optimization. So it turns out that when
you use the optimization you do way better in terms
of the maximum reward. This is true over a bunch of
the different optimizations. I’m not going to go into what all these optimizations actually are. But when you look at
the maximum reward for all these different
optimizations with and without, they’re
drastically different. By the way, these are not
even listed in the paper because that’s how unimportant the author’s originally
thought that they were, that they’re very common. But people clearly have
a very hard time reimplementing these algorithms because
it’s often unclear what’s exactly part of
the deep RL algorithms that they present and what’s
just another optimization on top. So even with these
seemingly small changes, performance can super widely vary. So the overall takeaway here is that these deep RL methods are underspecified and they’re
really complicated, and the reasons for unreliability and performance are somewhat unclear. It’s not sure if it’s
the algorithms or if it’s all the little optimizations
that they put on top of them. So this calls for us to go back to the first principles look at what these algorithms are really doing. To do that, we’re going
to look at a bunch of different tenants of
the policy gradient framework. One of which is gradient estimates, and I’m going to explain
all of these as we go, so I’m just going to go
quickly through them. The other one is value prediction. We’re also going to look at
optimization landscapes, and finally, we’re going to look at trust gradients at the very end. So the first thing we are going to
look at is gradient estimation. So if you recall in
our policy gradient framework, one of the key assumptions
that we have is that the gradient
that we actually take is pretty correlated
or at least correlated with the finite sample
approximation that we get, and we want to look at
how this is in practice. So the experiment that we’re
going to do is we’re going to fix a single policy and then we’re
going to take a bunch of steps. Each of which uses this case
sample gradient estimate. So we’re going to take a bunch of samples and then we’re going to
make a step based off this sample. So you should expect that
if you have more samples your gradient estimates
are going to get better. We’re going to do this
a bunch of times, and what probably we want to
be able to do is make sure their concentration and see how well these actually
concentrates the true gradient. The way we’re going to do that
is we’re going to measure the mean pairwise correlation
between all the different gradients. Between all the different gradient
estimates that we collect. So you can think of this as, if you have higher pairwise
gradient correlation you’re going to have
better concentration. If you have lower mean
pairwise correlation you’re going to have
worse concentration. So this is a plot where on the x-axis you have
the number of samples that we use, and on the y-axis, we have the average, basically, just the concentration
in that regime. So higher means that you’re basically concentrating to
get the actual gradient, and lower means that
you’re not as much. So this black line is what the algorithms
actually use in practice. So you can see that roughly,
I don’t want to say half, but about half the time, a little less than half the time, the steps that you
take are actually in opposite directions from one another. The gradient is so much less
concentrated than it should be. But that’s not necessarily as big of a problem as you might
think because in high dimensions, if you have very low
cosine similarity it’s still pretty significant
because you’re in high dimensions.>>So the x-axis is, you’re changing
the state action space?>>So the x-axis is the number
of samples that we use.>>Okay.>>Yeah, and so you
would expect, yeah.>>You can finish your sentence.>>No, no no.>>Okay. Is this.>>You can just ask, if you guys have confusions just ask a question.>>Is this consistent across different architectures
of the policy.>>So we only tested
one architecture of the policy, but if I had to guess,
I think it would be.>>Consistent across
tasks variance system.>>Yeah, it is very consistent across tasks across different
variations. Yeah.>>So in this case, what’s
the dimension of the parameter? What’s the dimension
of the parameter?>>About 5,000. So if you consider random Gaussians and then you have correlations between
the random Gaussians, so those like, you get roughly one over square root
of de-correlation, cosine similarity I guess
between two random Gaussians. So you get maybe 0.01 or less than that in terms of correlation if you just drew Gaussians
and 5,000 dimensions. So this is non-trivial one.
Alex pointed this out. Well, multiple people pointed
out similar events throughout. Never mind, but anyway. Yeah. So it’s not quite as bad as it looks, but there’s clearly a lot of
room for improvement here.>>How does it vary across the occupations of
the organizational algorithm?>>Yeah. So at earlier iterations, this looks much better. So if you’re at the very start, this graph actually just
shifted left and so you get actually pretty
reasonable correlation at the very first iteration. I think there’s iteration
150 or so out of 500 or 300, but it very quickly drops off. So as you go further in
the iteration process, this graph shifts to
the right and you’ll get much worse estimates in terms of concentrating.
Any other questions [inaudible]>>What do you mean harder tasks?>>Harder tasks, yeah. So there’s an informal hierarchy of how hard
all these different tasks are. I guess you can make
it more fun and you get like typically in terms
of sample complexity, how many samples you need
to go learn this task. So this is for humanoid, sorry, for Walker 2D and which is
considered one of the harder tasks on [inaudible] Then they’re like easier
ones like Hopper where it’s like, what going out there? For some reason that’s easier, yeah.>>Can you give some sense
of how bad this could be. Are this doing much better
than the worst-case scenario?>>Yeah. Actually, we’re going to
get to that in the next section, yeah, when we look
at value estimation. So I guess the key
takeaway here is that we don’t have
a great understanding of the training dynamics for
how this variance really impacts our optimization process, but it would be great if
we could use insights from stochastic optimization to
be able to look at this. It’s not exactly the same regime
because the samples that we get, they are independent
and not only that, but the actual objective because of the way that the deep
policy gradient methods are organized is non-stationary. So you can exactly apply SGD, but it would be great to
use insights from it. Yeah, and another key thing here
is that we’re really missing a link between reliability
and sample size. So it turns out that
when you really scale up these algorithms and use
many more [inaudible] samples, these algorithms become
much more reliable. It hints at this in this plot
because you can see that the gradient estimation is much better when you
use many more samples. So it’ll be great to get a better understanding
of what that’s like. There’s actually opening
IP for about that. But it would be great to
look into it even more. So the next aspect of policy gradient methods
that we’re going to look into is value prediction. So as we just saw, the gradient estimation
that we get is really hindered by in terms
of concentration. In terms of concentration, the concentration that we get is
really hindered by poor variance. So it be great if we could
lower this variance, and it turns out that
one way to do that is to estimate the values and then use that in
our policy gradient method. So the value of the state is
if you have a given state, the value is the expected reward that you get after
visiting that state. So the idea is that if you can
estimate these states well, then you can better separate
out the action quality like what action you take from the action that you take versus
what the state quality is. So for example, if you have a robot and it’s about
to fall over something, then if you take an action you don’t want
to say the action is bad because you were about
to fall over anyway. So the idea here is that if you can understand what the state
contribution to how good the algorithm does versus what the action
contribution is and you can significantly lower the variance. To reduce the variance, you
need good value estimates. The way that we get value estimates
is we use a during training, we collect all these different
samples of states and rewards and you can calculate
what the values are from that. So we basically just perform a supervised regression task at every point in training process
using the data that we collect. So we really want to
understand here what do the value essence we get do in terms of reducing our variance and how well could we actually do, and also how bad is it in terms of the baseline
if you don’t do anything? So the way we’re going to do
that is we’re going to do a similar experiment to last
time where we vary the number of samples on the x-axis and we vary
the concentration on the y-axis. It’s the same exact plot
as previously. But now, we’re going to look at
three different kinds of agents. So the first agent is when you
don’t use any baseline at all. So you don’t use
any value estimation. So then the red line here represents what happens when you use the standard
value estimation from the algorithm and then this blue line represents what would happen if you got a near ideal value function, and the way that
we’ve calculated that is we basically just take a ton of samples and we evaluate very well approximate what
the value function is.>>So just to make clear,
we start here actually refers to the value function of the gradient policy not the
optimal value function.>>Yes.>>Okay.>>The value function
of the current pulse. So this has no value function. This is the agent’s value function and that’s a true value function. So it turns out that the agent
does significantly worse than what it could be doing if
it had the true value function. But it’s still doing significantly better because again remember
they we’re in high dimensions, so this is actually pretty good, and it’s actually doing quite
better than no value function. But there’s clearly significant room
for improvement here. You can see that the concentration gets much better for
the true value function. So one of the key questions here is if we were able to get
better value functions, how would that affect training? How much better would be able to do? How much more liable
would it be able to be? Not only that, but how can we
actually get better value functions? Because it’s clear that
there’s a big benefit here, but what’s unclear is how
well that would really translate to optimization in general. So now the third time that we’re going to look at
is optimization landscapes. So a key assumption again in our policy gradient framework is that when we take
these gradient steps, we increase the overall reward that we’re going to
get from that policy, and so what we want to see is how valid this assumption is in practice. So we’re going to
look at lot plots of this form in the next few slides. So we’re going to make it very
clear what these plots are. Essentially, in this direction, we’re going to fix a policy, and then in this direction represents moving in the actual step
direction that we get, and so this point represents what the actual step
would be in the next step. Then this direction
represents going in a random direction and
chosen from a Gaussian, and then what the actual plots here represent are the reward that
you get at that new policy. So you fix a policy
you then you move this much in the agent step direction and move this much in
the random direction. So one example here is
this red stuff right here represents you move 2.5 times in the step direction and you move
1.5 times in a random direction, and then this is the ultimate reward that you get from that policy. So this is step zero. You can see that we’re doing
pretty well here. You move in the step
direction you get increasing rewards which is great. But then by Step 150, you can see that this really
degrades quite a lot. But just moving in the step directions actually
lowers your reward. That stays true at Step 300, and we looked at a lot
of these kinds of plots, and essentially shows that
the oftentimes the steps are not predictive later in
the optimization process. This looks even worse
again for a 100 tasks. So for easier task it
looks a little better. But for these harder tasks
it looks much worse. So natural question asks
here is, what’s going on? It turns out that when you look at what the algorithms
are actually doing, they’re not maximizing
the actual true rewards. Instead they’re maximizing
is some surrogate reward.>>Sorry, I have a question
on the previous model. So the x-axis is the step
taken by the agent?>>Yes. So you’re talking
about one, right? Yeah. So I fix a policy and
then I move like x times step in policy space and I evaluate how that agent does in terms of reward.
Does that make sense?>>It’s like it’s moving
optimization like parameter space.>>Yes.>>Okay.>>Because agents step
taken sounds like some agent taken steps,
but [inaudible].>>Random detection would be like
adding noise to the [inaudible].>>Yes.>>But after results
it showed that adding noise to policy parameters
actually helps.>>I’m not aware of any, but I’m happy
to talk to a [inaudible] about that.>>Sure.>>You are also adding noise
to the policy parameters.>>Yes. Also adding noise to the policy parameters.
I’m not sure of that.>>Yeah. There’s a paper from
Open AI parameter space exploration.>>Yeah. Sure.>>Okay. Any other questions
about these lines? Is everyone clear about
what’s going on here? Move in the parameter space and the sub-direction and you can
move in a random direction. This is the kind of
plots you get. Yeah?>>For the second part, because we are kind of following
that, we didn’t do actually. Why is that kind of reward
is going down? This part.>>We’re going to explain
that in the next slide. Yeah. That’s a good question here. It’s like why is it that when we go in the direction of
the actual step that we take, why is it that the true
reward is going down? It turns out that these methods actually don’t optimize
the true reward. What they instead optimize is
something called a surrogate reward. I’m not going to go into detail
about it but we can talk afterwards about what
this actually is, and what we want to check here is how the landscape of surrogate rewards compared to the landscapes
of true rewards. This is a surrogate reward landscape. It’s the exact same format as our previous landscape
except that now, it’s a surrogate rewards
instead of the true rewards.>>So [inaudible] is
coming in because of the proximity discussion of entity?>>Yeah.>>Okay.>>Yeah. So they look
at like the base of the policy ratio times the advantage. That’s what they’re optimizing
instead of the true rewards. So this is what the agent
is actually optimizing. So you can see that it has a maximum grid where
the sub-direction is. So it’s actually maximizing the
surrogate reward really well. That makes sense because it has full access to the certain rewards. So it should be able to
optimize it really well. It should be clear that the surrogate reward is based
off of what it sees in training. It’s just that they have
some of the approximation and it’s based on some theory. But what actually what
ends up happening here is that these actually work
pretty well. This is subzero. What happens is you move up in the
surrogate landscape direction. You also move up in
the corresponding reward direction. So this step does pretty well. But then by step 150, you run into the same problem
that we saw in the last slide, where you move in
the circuit direction. It looks like this is the optimum. But then what actually ends
up happening is that in the true reward landscape,
you’re going down. You’re getting worse and worse. This continues to be true
to step 300 and 450. This looks great. It is
optimized as much as we want. But it turns out that even when you move in the
surrogate direction increasing the surrogate rewards, you’re seriously not doing very well on the true
rewards landscape.>>So here, this surrogate
landscape is a landscape for the value network or how do you guess
this surrogates landscape?>>Yes. So how do you
find out about it? But essentially, the
surrogate landscape is it’s just a function of the
trajectories that you see. So you collect all these
different trajectories and training and then you make a circuit out of it which is
motivated by some theory work. Yeah, it uses
a value function as well.>>It’s for the policy. It’s
the one for the policy.>>Yeah, it’s a landscape that
the policy. I’m sorry, what?>>It’s the landscape for the policy, like this is all the
policy parameters.>>Yeah. This is where, again, offering a policy space.
Does that make sense?>>Yeah.>>So you can think about
this like just abstractly. You can think about the
surrogate reward as something that the algorithm
is actually optimizing over, like it takes in the trajectories and then it makes a landscape that it thinks corresponds to
the true landscape, and then it optimizes that instead of optimizing
over this true landscape.>>On the [inaudible] side, I think there’s two approximations. One is you use a multichannel to some port to estimate,
say, accreditation. That is one kind of approximation. The other approximation is that even do not know the true value function. So I need to use some network
to optimize that well. So by surrogate here, to which approximation
you are already->>It uses both of them.>>Both of them.>>Yeah. So it uses a value function and it uses
the samples that we get. But it’s not looking at
the actual rewards you get. It’s looking at some function of
the actual rewards you get. Yeah?>>So you’re measuring
these functions on the same scale, sir? Are we seeing->>Yeah. So you should->>You just sort of reward
by a very small amount of->>Right. So you should
ignore the scale in terms of like the only thing that’s important about,
this is a direction. Like this goes up, this
should also go up.>>I’m just curious in
trying to take a step to achieve a very small improvement
in the surrogate reward. Are we seeing the comparatively
much larger decrement in the true reward or are
these scales not comparable?>>Yeah, these scales
are not comparable.>>Okay.>>But it’s definitely regardless like it is concerning that you move. When you’re increasing
the surrogate reward, you are decreasing
the true reward. Yeah.>>Does the network, does
the agent become better?>>Yes. Yes, it does. Yes. So that’s the key thing throughout
this whole presentation. I am not saying that
these deeper algorithms do not work because they do work.>>But how can it become
better if I always, pretty much, after some point,
decrease my true reward?>>That’s a great question.
That’s a great question. I mean we have some hypotheses
but it’s long and clear. Yeah. This could be a source of the unreliability
issues that we see.>>So for these slides, do you think that’s the error, many form the value
function approximation?>>I think both of
them are, probably. But I think both probably
play a part in this. But it’s more about the, I think, both of these
play a part in it. But actually, so when
we take this step, we use many, many samples. So these actually
should be pretty good. Like when we take the step, we’re not taking the actual agent stuff. What we’re actually doing
is we’re using many, many samples to get a pretty good approximation of
what the agent step should be. So I don’t think that
those issues are super big.>>It’s like the third source
of approximation.>>Yeah.>>It’s a kind of approximation.>>Yeah. It’s kind of like a third
source of approximation error. It’s like the actual function
that they build out of the trajectories, not the value function in
social, in the estimates.>>So in the same way
that you can’t optimize accuracy when you train networks, you optimize cross entropy. That’s exactly the same.
You can optimize reward. You optimize some surrogate
of the reward.>>Yeah.>>That’s the procedure.>>Yeah. That’s a great explanation.>>So I guess the quantity
on the x-axis, the way I should think
about it is like the step size in my
optimization algorithm, right?>>Yes.>>But then presumably, the scale at which you’re plotting
things is much larger than the regime in which the step sizes
are operating and training it. So what is the actual step?>>Yeah. So if you->>Oh, I see. You scale with the->>Yeah.>>I see. Okay.>>Yeah.>>So if you’re just too early
stopping at 150, would you get->>No. I mean, you definitely improve overtime. So it’s not like these arguments
are not working. It’s just a little unclear what the mechanisms behind them
are there and making them work.>>[inaudible] being in total report
or do we turn [inaudible]?>>I’m sorry. What?>>Is the y-axis in the red plot
here is the same thing you’re claiming that the total reward will increase overtime
like it will cut them all?>>Yes. This quantity exactly corresponds to what
should be increasing overtime.>>So how can that be if
every step you take decreases?>>That’s not every step.>>Like so we sample
a couple of steps. It could be that you take like
very few high magnitude jumps but then if you sample at any point, you’re likely to see like
some small detrimental.>>So maybe on average
like this doesn’t do so well but you might get a few steps
that are doing really well, so it bounces out. Yeah?>>So is this reward the return
or is the step-wise reward?>>The total reward over->>Yeah. This is
the total reward. Yeah.>>So it’s just discounted return.>>Yeah, it’s discounted
return actually. Yeah. So as we can see here, the surrogate reward is often very misaligned with
the true reward landscape. It’s important to
note that everything we’ve looked at so far has been
in this high sample regime. So every time that we look, we try to make an approximation of how good each one
of these agents are. We get the discounted reward
using a thousand trajectories, which is pretty good, like you can get
a pretty good estimate of this. But when the agents
are actually learning, it only uses about 20 trajectories. So what does agent
actually see when we’re going throughout
this optimization process? So these are 20 sample
estimates which means that each one of
these points correspond to taking 20 trajectories
for reward estimate. So you you take the step, you look at this new agent, then you run through 20 trajectories, and then you see what the mean
discounted reward was, mean return. Then this is what happens when you take 200 samples or
unit of every point. This is what happens when you take a thousand samples of every point. So you can see here
that if you use many, many samples, you get a really
nice like smooth landscape. So if you move in this direction, even though you are definitely
actually improving these rewards, it’s little hard to detect
in the agent sample regime. This is concerning because this is what the agent
actually uses to make steps. So it’s hard for the agent
to even know if it’s making progress because of how
noisy this landscape is. So the two key takeaways here are, that first these landscapes are not very reflective
of the true rewards, and it’ll be great to understand why, and how that impacts
the optimization process. It would also be great to
understand how we can better navigate their award landscape because it seems like
the secret “Word”, maybe it’s not
the best way to do this. The final aspects of
these policy grading methods that we’re going to look
at is trust regions. So in parameter space you can think of our optimization
process as follows, you have this original point
in optimization space, then you take a step and
you go to the next point, take another step, you go
to the next point so on. At every step what happens is you
take a bunch of trajectories, and you make sure that
the step that you take based on those samples is
within a trust region. So each one of these steps
has to be within this trust region because
the samples that you take are only informative
locally around where the current region
is because that’s where you took the samples. So you want to make sure that
the steps that you actually take are actually informed by
the samples that you took. So the idea is that you want
to be able to take steps in this trust region but if you
go outside the trust region then it’s a little unclear about
what you’re actually going to get because you didn’t take samples
there, you took samples here. So what PPO and TRPO use, they’re motivated by this KL-based based trust region
where you’re looking at the maximum KL distance between the action distributions
induced by states. So intuitively, you can
think of this as key. Make sure that even in across all the states
that you could possibly see, make sure that the way that
I choose my action is not too different from my current policy
to my next policy. So when I take a step, I want to make sure that in my next policy it’s
not going to be too different than the way
that I take my steps, and the way that I take my actions, and even in across all the states. But this is hard to
enforce in practice because we don’t see
the whole state from the space. So instead what we do is we
relax you and expectation, and what we see here instead
is we want to intuitively constraint the mean way that we take actions at every state,
does it make sense? So what we want to see here is
what actually happens in practice, does our next agent actually
satisfy the constraint? So right here this represents the iteration in
our optimization process, so there’s 450 steps, and we’re going to
look at every step, we’re going to see what
the mean KL distance was between our current agent
and the next agent, and so it should be around
here which is what TRPO gets, and TRPO actually does
this very nicely, and it really maintains the trust region but is
everything clear so far? That’s great, okay. So with
a PPO algorithm does not. So you can see here that we go
from two to negative six mean KL, to two to negative three KL,
across the pointer training. It doesn’t look like there’s any sign of stopping but interestingly, the optimizations helped quite a lot. So we have this core PPO algorithm, and it purports to
keep this mean KL the same or it has a relaxation but it purports that this is
the overall goal of this algorithm is to keep
this mean KL all the same, and while relaxing this constraints
is computationally, so they’re easier to compute.>>Certainly is it
just approximation that they are going from two to the minus
five to two minus three?>>I’m sorry what?>>The PPO is using
that mode approximate.>>Yes right.>>So that is going to
be a loss also just from Taylor’s approximation
that’s involving.>>Yes.>>You have a sense of
whether it’s all [inaudible].>>Yes. So that’s really interesting. Yes. So that’s what we’re
looking at right here. Is the fact that this algorithm
and this algorithm use the exact same enforcement method
like these two methods in terms of if you just
read the paper you would think that these would have forced this mean KL just as well because they use the exact
same enforcement method. But when you put all these different
optimizations on top of PPO, it turns out that you can get significantly better
trust region enforcement and these optimizations include
learning rate annealing, value clipping and so on and using
orthogonal weights and so on. So it’s unclear exactly
what is causing this trust region to be enforced but this trust region
ought to be enforced because they are the same algorithm. This is PPO greenest BPO and this blue on his PPO M
which is like PPO minimal which is what you
would get if you just implemented the PPR algorithm as
stated in the original paper. Then this is what
you get when you use all the different
optimizations that you find in the opening
I GitHub repository. So what’s interesting here is that even though the enforcement
method mechanisms if you just looked at the algorithm
appear to be the same, the optimizations cause the actual enforcement in practice
to be drastically different along these two algorithms.>>This part is not
as surprising just because presumably a lot of the optimizations were going
to actually stabilize. The numerical aspects and at least if your policies are changing in a more stable fashion then by definition trust regions that can
be better maintained as well.>>So what they
algorithm is for here is not they aren’t trying to
make anything more stable. What they’re trying to do is maximize total rewards
at the end, right?>>Right, but they are
presumably trying to do it in somewhat more reliable manner across the different tasks
that they’re evaluating on. So it might be an artifact of.>>Right. It could me artifact
of that optimization. I guess what’s interesting here in general is that
I don’t know but I mean, maybe this is just me
but when I look at the optimizations that I don’t see maintaining trust regions
at all in any of them. The only mechanism
that I actually see in the algorithm for maintaining
trust regions is like the key PPO like
ratio clipping thing which is kept constant
across both of these.>>It’s interesting or at least it’s somewhat
surprising thing to us that the mechanism that’s
designed to maintain the trust region does
not seem to be the thing that’s actually maintaining
the trust region. It seems to be some other stuff
that we add on top.>>So I guess as we
just talked about, one of the key questions to ask here is what part of
these algorithms are actually doing what and how do we reason about these algorithms when
they’re using such relaxations to the original trust regions
that they were supposed to be at least in terms of
theoretically grounded using. Not only that but how can we
capture the different kinds of uncertainty that we have in our algorithms in our trust regions. So the original trust regions
that motivated the trust regions that these deep RL algorithms use
don’t take into account stuff like like bad value functions or really unconcentrated
gradients and so on. So it’ll be great to see
what kinds of trust regions we can come up with that
take these into account.>>I guess the difficulty am
having with this part is so PPO is fundamentally once you do
a tailor approximation of KL, sure you could still
go back and measure KL and which is kind of
what you’re doing but you could also say that it is just defining a different notion of
what a trust region should be and what do things look
like if you actually just evaluate what PPO is enforcing.>>Yeah, we actually
looked at that too. I didn’t choose to include that in
the slides because I thought it would be too much I guess but I’m happy to talk about that.
We have that in our paper.>>Sure.>>What’s the mean takeaway
from this fact because all of these even PPO people
they say that, “Yeah, we use very loose relaxation
but empirically we observe success which is essentially what this section of the
presentation is also saying.>>It doesn’t seem to be due to the relaxation like you
could just not have the. It turns out that if you just
remove the relaxation that PPO does then you draw the optimizations
that you do slightly better, you can just enforce
the same trust region. So like the whole clipping thing, you can just set the
hyper parameters exactly right since it never leaves the clipping
thing and then it’s all fine. So the clipping doesn’t
actually seem to be doing it. It’s more like the optimizations that we added on top of
that the thickening make it make the optimization so nice
that you don’t actually need the trust region in the first place.>>Yeah, easy. So just general takeaways that
we can get from this. In general, the deeper RL
methods are really complicated and they have a lot of moving parts and they’re hard to understand. Not only that but these deeper RL trained dynamics
I really poorly understood. The steps that we take are
often really uncorrelated. The surrogate words don’t match the true rewards and
the trust regions don’t hold oftentimes at least for
the reasons that we think. So the big question here is how do we proceed like what what are
we go do in the future about this on and so the first thing that we
might want to do is shut. It reconcile RL with
our conceptual framework tried to make our deep RL
algorithms actually match the policy gradient frame work better. So how can we do that? Another stuff that we
could do is try to rethink our framework for
these deep RL methods trying to move our framework closer. So for that we would have to
figure out how to deal with high dimensionality
and these algorithm, different kinds of
optimizations that they put on top of the core method and not only that but dealing with these non-convex function
approximations of deep networks. Finally, our results suggest that we need barrier evaluation
for our RL-systems. We have to move past a return-based centric
benchmark system and try to look holistically at
all the different aspects of these algorithms like trying
to look at reliability, and robustness, and safety. If you want to read
more we have of paper and we also have
a bunch of blog posts.>>More questions. I’m curious to see if
you run similar probes on some bandwidth like settings or just subway settings to see how much the gradient estimation issue or the transmission issue come up.>>Yeah. So we actually looked
at using SGD to maximize. So one of our buddies
looked at maximizing, basically just looking
at toy settings in SGD, like using SGD to maximize
our quadratic or something. So it turns out that you can make
this stuff super uncorrelated. You’re still going to maximize
the quadratic pretty well. So we thought that was
pretty interesting, but the dynamics in RL
are very different. I guess for the reasons that
we mentioned before about lacking independence
and non-stationarity. So we’ve looked at some
experiments that are similar in these regimes. I think bandwidths
would be a great place to look as well, but I mean, I think bandwidths are very theoretically well understood and there’s not too many
moving parts in them. There’s a core algorithm, but I think we can go with that.>>So how do you view [inaudible]>>Yeah.>>I’m just still over confused about the part that like
your gradient seem to be very uncorrelated and your [inaudible] seems to be
going down most of the time, so is it just very much like
they can random match them. Like what if you actually like do that try
to just instead of four, and your gradient just takes
a random direction but keep with a different because
of that or something like that.>>Yes. So that’s actually
a technique that people use, is like finite difference methods. There’s a paper from Ben Recht
about it called like Random Search is a Competitive
Baseline for [inaudible] or something. It was a good paper,
pretty interesting paper at least. So basically, what
they do is they just take a bunch of random directions
and see which ones do well only. They do a bunch of
other optimizations on top of this. They do some wacky stuff about like throwing away
different directions, but yeah, it’s the same core
algorithm, and it works pretty well.>>If you are going to
start from scratch, approaching this problem domain, are there things you think you would leave out out of
the present framework for RL, or replace with something else? Or what ideas do you have about how to avoid having
some of these issues at all?>>Yeah. So I think that, it would be great to look like as we
design these algorithms, it would have been great to look at how the different optimizations that we use actually impact
the performance. Like, people at least, I mean, so the policy gradient
framework that people came up with is not intended for
the deep RL methods, or for these kinds of tasks as much. So I think it’d be good to design. I think would be good,
I’m not sure, I guess, about how I would
design the framework, but I think that in general, when developing
these kinds of methods, I would be more careful about
looking what the impact is of different agronomic aspects. Then trying to really
understand what’s causing performance and what’s causing reliability or unreliability
or lack of performance.>>Have you tried using
linear model as in like, is a problem here to
classify it [inaudible].>>Yeah. So actually, you
can solve this without any deep learning using a linear
model with these algorithms.>>Right. On the same,
have you done experiments?>>Have we done experiments on that?>>Like say like
the optimization landscape, you have a linear
function approximator. Does it also look like that?>>Yeah, that’s a great question. I would suspect that
these environments would be similar. But I think that it would be a
very interestingly to look at. Actually, so Ben Recht’s paper
uses a linear approximator.>>The random search one?>>Yes. [inaudible].>>The surrogate landscapes
look like [inaudible].>>Yeah. So if you look at
the surrogate landscapes. So if you look at
the surrogate landscapes, they’re like vaguely linear. I mean, which makes sense because, I mean, I don’t
know if it makes sense, but the actual thing you’re optimizing is linear in
the outputs of the network. So these preliminary, yeah, I mean, I don’t know if
that’s, I don’t think, I’m not sure if that’s
a good connection.>>If you have
a multi layer linear network and the optimization is not linear.>>Yeah, absolutely. Sorry.>>What’s the network on
the picture using all these?>>Yeah. It was a two-layer MOP.>>What’s in there? We mainly use for most of these.>>It varied a lot per experiment. I think we used, whatever the best one it was during
[inaudible] I’ll be referring to any.>>Like 10 to a negative four. I saw earlier you had this bot.>>Yeah. I think I was
probably something around 10 million to
10 million or four.>>Do all the environments
show plots like this?>>Yeah. So we actually saw in
our appendix we have everything. We’ve got like 30 pages of appendix or something, so
you can take a look at that. So I mean, for easier task, they look much better, I would say. We’ve mostly looked at it in this Walker TD, which
is the hardest one.>>Yeah. It’s curious, you
would imagine that more unstable environment’s probably
they’re more sensitive, but more stable than
perhaps things that balance, they’re more unstable
as well as maybe even think [inaudible]
stuff they’re more stable, that was needed [inaudible].>>Yeah. I guess, I’m not sure. Yeah, I don’t really
have a good intuition for how the different
game shift work.>>Maybe I missed this, so
thinking about other evaluation, do you have a constructor
solution for how we might go beyond just benchmarks
and eventual average? What might be an
alternative evaluation that exposes like this
hyper sensitive video, hyper parameters and
things like that?>>Yeah. I mean, so we never
talked about ideas for that. We haven’t talked
about [inaudible] yet. I guess, we haven’t thought
about that too much. I would say that one thing
would at the very least. So if you look at
a lot of these papers, when they showcase results, they show the results in a way that makes it look
more stable than it is. So one example is the super
common practice to use smoothing. So they basically say
like we’re going to look at a weighted average
of what my returns are over time rather than actually getting what
their true rewards are overtime and plotting that. I think it would be good
just as a very basic start to have more rigorous
evaluation there. Actually, one big problem
with comparing methods is that when people use all of
these different kinds of, like these smoothing
or they say, “Oh, I’m going to collect five Cs, and I’m going to choose
the one that does the best,” which is crazy, right? I think as a very basic start, it would be good to just have
some honest guidelines for just even showing reward curves.
There’s a long way to go there.>>Is there any more questions [inaudible].
If there’s any more questions, I think they’re like around today and tomorrow, so if
you want to be with them. By all means, then let’s
thank our speaker again.

2 comments on “Deep Policy Gradient Algorithms: A Closer Look

Leave a Reply

Your email address will not be published. Required fields are marked *