Policies – Georgia Tech – Machine Learning
Okay, so Michael, in the spirit of what we just went through in deriving the geometric series, I’m now going to write down a bunch of math. And what I’m going to do is I’m just going to say it at you, and you’re going to interrupt me if it doesn’t make sense. Okay?>>That makes sense.>>It does. Okay, so here’s the first thing I’m going to write down. I’ve decided that by going through this excercise of Of utilities in this kind of reward, we can now write down what the optimal policy is. The optimal policy, which as you recall is simply pi star, is simply the one that maximizes our long-term expected reward. Which looks like what? Well, it looks like this. There, does that make sense?>>Let me think, so. We have an expected value of the sum, of the discounted rewards, at time t. And, given, pi. Meaning that we’re going to be following pi?>>Mm-hm. So these are the, the sequence of states we’re going to see in a world where we follow pi. And it’s an expectation because, things are non-deterministic. Or may be non-deterministic. And do we know which state we started?>>It doesn’t matter, it’s whatever s zero is.>>I see. Whatever s zero is, but isn’t that random? I mean s one and s two, s three; those are all random.>>Well, we start at some state, it doesn’t matter, so t is starting out a zero. And going to infinity. Okay? So does this make sense?>>Yes, so then, so we’re saying, I would like to know the policy that maximizes the value of that expression. So it gives us the highest expected reward. Yeah, that’s the kind of policy I would want.>>Exactly. So, good, we’re done, we know what the optimum policy is. Except that it’s not really clear what to do with this. All we’ve really done is written down what we knew it was we were trying to solve. But it turns out that we’ve defined utility in such a way that it’s going to help us to solve this. So let me write that down as well. I’m going to say that the utility of a particular seque, of a particular state okay. Well it’s going to depend upon the policy that were following. So I’m going to rewrite the utility that takes the superscript pie. And thats simply going to be the expected set of states that I’m going to see from that point on given that I’ve followed the policy. There, does that make sense?>>It feels like the same thing. I guess the difference now is that you’re saying the utility of the policy out of state is what happens if we start running from that state.>>Yep. And we follow that policy.>>Got it.>>Right. So, this answers the question you asked me before about, well, what’s S0? Well, we talk about that in terms of the utility of the state. So how good is it to be in some state? Well, it’s exactly as good to be in that state as what we will expect to see from that point on. Given that we’re a following a specific policy where we started in that state.>>Hm,.>>Does that make sense?>>Kay. Yeah.>>Very important point here, Michael, is that the reward for entering a state is not the same thing as the utility for that state. Right? And in particular. What reward gives us is immediate gratification or immediate feedback. Okay? But utility gives us long term feedback. Does that make sense? So when reward [UNKNOWN] is the actual value that we get for being in that state. Utility [UNKNOWN] state is both the reward we get for that state. But also, all the reward that we’re going to get from that point on.>>I see. So yeah. That seems like a really important difference. Like, if I say, here’s a dollar. You know? Would you poke the president of your university in the eye? You’d be, like, okay. The immediate reward for that is one. But the long term utility of that could be actually quite low.>>Right. On the other hand, I say, well, why don’t you go to college? And you say, but that’s going to cost me $40.000. Or better yet, why don’t you get a masters degree in computer science from Georgia tech, bu you can say that’s going to cost me $6600. Yes, but at the end of it you will have a degree. And by the way it turns out the average starting salary for people who are getting a masters degree or undergraduate degree is about $45000.>>So is it considered product placement if you. Plug your own product within the product itself?>>No, I’m just simply stating fact Michael. This is all I’m doing. Just facts.>>Alright.>>This is called fact placement.>>Alright.>>The point is, there’s a, an immediate negative reward, of say, $6,600 for, I’m going through a degree. Or maybe it’s $10,000 by the time, the 15th person sees this. But anyway, it’s some cost. But, presumably it’s okay to go to college, or go to grad school, or whatever. Because at the end of it you are going to get something positive out of it. So it is not just that it prevents you from taking short term positive things if that is going to lead to long term negative things. I also always you to take short term negatives if it will lead to long term positives. That makes sense. What this does is this gets us back to what I mentioned earlier. Which is this notion of delayed reward. So we have this notion of reward, but utilities are really about accounting for all delayed rewards. And if you think about that, I think you can begin to see how, given you have a mathematical expression delayed rewards, you will be able to start dealing with the credit assignment problem.>>Cool.>>Okay, so let’s keep going and write more equations.