Lecture 20: Markov Decision Processes:
2.1 MarkovDecisionProcesses ................................. 1
2.2 Example ........................................... 2
5.1 Example(cont’d) ...................................... 5
6.1 PolicyEvaluation ...................................... 6
6.2 PolicyImprovement .................................... 6
6.3 Example(cont’d) ...................................... 7
In this lecture, we extend our discussion of Markov reward processes to Markov decision processes (MDP). For MDPs, we pose and solve the control problem—the search for an optimal policy. Specifically, wedescribevalueiteration andpolicy iteration,twodynamicprogramming algorithms that are used to compute an optimal policy in an MDP.
What follows generalizes the definition of Markov reward processes presented in Lecture 19.
An agent operating in a Markovian environment transitions from state to state, in general making decisions and obtaining rewards along the way, as follows: at time t,
10:30 AM, Apr 9, 2009
Markovdecisionprocesses(MDPs) model such agent-environmentinteractions. A(discrete-time) Markov decision process is a tuple �S,A,R,P�, where time is discrete: i.e., t ∈ T = {0,1,...}, and
EachTACTravel flight “auction” (viewedinisolation)isanexampleof anMDP.Letussimplify one such auction and model it as an MDP.
The state is defined in terms of the price of the flight and the time remaining until the end of the auction. Specifically, the state space is the cross product of the set of possible prices, say P = {150,160,..., 590,600} and the time, which we assume varies discretely from t = 0 through time T =30, unioned with a designated state end. Let pt denote the price at time t.
The set of possible actions A includes buy now (B) and (re)consider later (C). Rewards depend on the flight’s valuation. Assuming v represents this valuation, R(pt,B)= v − pt and R(pt,C)=0, for all pt; in addition, R(end,a)=0,forall actions a ∈ A. Finally, transition probabilities depend on states and actions: for all prices p∈P, actions a ∈ A, and times t ∈{0,...,T },
P[end|pt = p,at = B]=1.0 P[pt+1 = p+10|pt = p,at = C]=0.5 P[pt+1 = p− 10|pt = p,at = C]=0.5 P[end|end,at = a]=1.0 P[end|pT = p,aT = a]=1.0
At state pt, what is the optimal action?
A policy is a map from states to actions: i.e., π : S → A. The state value Vπ(s)associated with state s underpolicy π isdefined astheexpected reward thatisaccruedfromstate s onbyfollowing
10:30 AM, Apr 9, 2009
Figure 1: TAC Travel Flight Auctions as an MDP: States are indicated by circles. Fat arrows indicate actions; they arelabeled with rewards. Skinnyarrowsindicate transitions; they arelabeled with probabilities.
policy π. ByBellman’s theorem(for state values), this state value canbe equivalently expressed as the sum of the immediate reward obtained by taking action π(s)in state s and the discounted
| expected value of the next state s ′, assuming the policy π is followed from state | s ′ | on: | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| V | π(s) | = | R(s,π(s))+γE[Vπ(s ′ )] | (1) | |||||||
| Policy π dominates policy ˆπ (notation | π ≫ ˆπ) | iff Vπ(s) | ≥ V | ˆπ(s) | for all states s ∈ S. | We seek | |||||
an optimal policy: i.e., π∗ s.t. π∗ ≫ π, for all policies π. It suffices to restrict our attention to deterministic, stationary policies π, in which the same pure (i.e., non-randomized)action is taken every time state s is visited. (Why?)
An optimal policy can be computed by solving Bellman’s optimality equations:
V(s)= max R(s,a)+γE[V(s ′ )] (2)
a
These equations state that a state’s valueis that which canbe obtainedby choosing the action that maximizesthesumof theimmediate reward atthecurrent state and thediscounted expected value of the next state. As in the case of Markov reward processes, to find a solution to this system of (|S|)equations(with |S| unknowns), we rely on Banach’s fixed point theorem. The optimal value
∗
function Vis the unique solution to this system of equations.
10:30 AM, Apr 9, 2009
Exercise: Showthatthe mapping implicitinEquation 2 is a contraction on(RS,L∞).
Given V∗, the optimal policy π∗ maps state s into an optimal action, as follows:
π∗(s)∈ argmax R(s,a)+γE[V∗(s ′ )] (3)
a
∗
While the optimal value function Vis unique, the optimal policy π∗ need not be unique.
The action value Qπ(s,a) associated with state s and action a under policy π is defined as the expected reward that is accrued from state s on by following policy π, except at state s, where it is assumed that action a is taken instead of action π(s). By Bellman’stheorem(for action values), this value can be equivalently expressed as the sum of the immediate reward obtained by taking action a in state s and the discounted expected value of the next state s ′, assuming the policy π is
′
followed from state s on:
Qπ(s,a)= R(s,a)+γE[Vπ(s ′ )] (4)
= R(s,a)+γE[Qπ(s ′ ,π(s ′ ))] (5)
Restating Bellman’s optimality equations in terms of action values yields:
Q(s,a)= R(s,a)+γE[V(s ′ )] (6)
′
= R(s,a)+γE[max Q(s,a)] (7)
a
AsinthecaseofMarkovrewardprocesses,to find asolutiontothissystemof(|S × A|)equations (with |S × A| unknowns), we rely on Banach’s fixed point theorem. The optimal action-value function Q∗ is the unique solution to this system of equations.
Exercise: Showthatthe mapping implicitinEquation 7 is a contraction on(RS,L∞).
Given Q∗, the optimal policy π∗ maps state s into an optimal action, as follows:
π∗(s)∈ argmax Q∗(s,a) (8)
a
While the optimal action-value function Q∗ is unique, the optimal policy π∗ need not be unique.
5 Value Iteration
The value iteration algorithm, which is based on Equation 2, updates as follows:
′
V(s)← max {R(s,a)+γP[s | s,a]V(s ′ )} (9)
a
s ′
Equivalently,
′
Q(s,a)← R(s,a)+γP[s | s,a]V(s ′ ) (10)
s ′
V(s)← max Q(s,a) (11)
a The algorithm, which is depicted in Table 1, first computes the value of each state for all actions, and then sets each state’s value to be the greatest value achieved among all courses of action. The actions that yield the optimal state values can be extracted as the optimal policy.
10:30 AM, Apr 9, 2009
value iteration(MDP,γ,ǫ) Inputs discount factor γ convergence test ǫ
∗
Output optimal state-value function V
′
Initialize V =0 and V= ∞ while maxs |V(s)− V′(s)| >ǫ do
′
(a) for all a ∈ A
′
i. Q(s,a)= R(s,a)+γ s ′ P[s | s,a]V(s ′)
(b) V(s)= maxa Q(s,a) return V
Table 1: Value Iteration ´a la Gauss-Seidel.
The following tables depict the computation of state and action values and the optimal policy in a TAC flight auction, assuming 3 prices, namely $100, $200, and $300, and 4 time steps, with V =500 and γ =1.
Q(s,a) t =0 t =1 t =2 t =3
BC
300 200 300 200 300 337.5 100 400 362.5
B
200 300 400
C
275 325 350
B
200 300 400
C
250 300 350
B
200 300 400
C
0 0 0
V(s) t =0 t =1 t =2 t =3 300 300 275 250 200 200 337.5 325 300 300 100 400 400 400 400
π(s) t =0 t =1 t =2 t =3
200 C CC/B B
Theoptimalpolicyisfairly intuitive:Itprescribesthatan agent shouldbuyif all timehaselapsed, regardless ofprice. Similarly, an agent shouldbeif everthepricehitsthelowerbound. Otherwise, if time remains and the price is not rockbottom, it is optimal to consider buying later, since there is some chance of seeing the price drop.
10:30 AM, Apr 9, 2009
Policy iteration is a two-phase dynamic programming method for computing optimal policies in an MDP. The first phase, policy evaluation, computesthe state valuesforthe current(fixed) policy viaEquation 1. The secondphase, policy improvement, improves upon the currentpolicy (wheneverpossible)in agreedy fashion. Policy improvement updatesbased onEquations 4 and 8.
In practice, value iteration is faster than policy iteration per iteration; however, policy iteration takes far fewer iterations to converge. One modified version of policy iteration does not wait for the policy evaluation phase of policy iteration to converge, and instead produces approximations
π
of V. This modification leads to substantial speedups in the runtime of policy iteration.
Policy evaluation in Markov decision processes computes state values given some policy exactly as state values are evaluated in Markov reward processes:
′
Vπ(s)← R(s,π(s))+γP[s | s,π(s)]Vπ(s ′ ) (12) s ′
The convergence of policy iteration follows from the policy improvement theorem and the one-shot deviation principle. Theformer statesthatgreedy improvements(i.e.,improvements in immediate rewards) lead to improved policies. Conversely, the latter states: if there are no greedy improvements to a policy to be had, then the policy is optimal. Hence, by repeatedly improving a policy in a greedy fashion until no further improvements can be made, one arrives at the optimal policy. A proof of the one-shot deviation principle for Markov Decision Processes appearsinBlackwell’spaper entitledDiscountedDynamicProgramming(Annals of Mathematical Statistics, 1965). Here is the formal statement of the policy improvement theorem.
Theorem: Given policies π1 and π2, if Qπ1 (s,π2(s)) ≥ Qπ1 (s,π1(s)) for all states s ∈ S, then Vπ2 (s)= Qπ2 (s,π2(s))≥ Qπ1 (s,π1(s))= Vπ1 (s)for all s ∈ S.
Proof: (Sketch)
Vπ1 (s)= Qπ1 (s,π1(s))
≤ Qπ1 (s,π2(s))
= R(s,π2(s))+γE[Vπ1 (s ′ )]
= R(s,π2(s))+γE[Qπ1 (s ′ ,π1(s ′ ))]
≤ R(s,π2(s))+γE[Qπ1 (s ′ ,π2(s ′ ))]
= R(s,π2(s))+γE[R(s ′ ,π2(s ′ ))+γE[Vπ1 (s ′′ )]]
= R(s,π2(s))+γE[R(s ′ ,π2(s ′ ))]+γ2E[Vπ1 (s ′′ )] = ··· = Vπ2 (s)
The policy improvement steps in the policy iteration algorithm are as follows:
′
Qπ(s,a)← R(s,a)+γP[s | s,a]Vπ(s ′ ) (13) s ′
10:30 AM, Apr 9, 2009
π(s)∈ argmax Q(s,a) (14)
a
The following tables depict the iterative computation of policies, state, and action values in one TAC flight auction. The flight’s valuation is 500.
Initialization
πt =0 t =1 t =2 t =3 300 BBBB 200 BBBB 100 BBBB
Iteration 0
π
Vt =0 t =1 t =2 t =3
300 200 200 200 200
200 300 300 300 300
100 400 400 400 400
Qπ(s,a) t =0 t =1 t =2 t =3
| B | C | |
|---|---|---|
| 300 | 200 | 250 |
| 200 | 300 | 300 |
| 100 | 400 | 350 |
B 200 300 400
πt =0 t =1 t =2 300 CCC 200 C/B C/B C/B 100 BBB
Iteration 1
π
Vt =0 t =1 t =2 300 287.5 275 250 200 300 300 300 100 400 400 400
C 250 300 350
t =3
B
B
B
t =3 200 300 400
B 200 300 400 C 250 300 350 B 200 300 400
C
0 0 0
10:30 AM, Apr 9, 2009
policy iteration(MDP,γ,ǫ)
| Inputs | discount factor γ |
| convergence test ǫ | |
| Output | optimal policy π∗ |
| Initialize | π �= π′ |
π′
while π �do
=
1. π′ = π
π
2. V= policy evaluation(MDP,π,γ,ǫ)
3. π = policy improvement(MDP,V π,γ) return π
policy evaluation(MDP,π,γ,ǫ)
Inputs policy π discount factor γ convergence test ǫ
π
Output state-value function V
′
Initialize V =0 and V= ∞
while maxs |V(s)− V′(s)| >ǫ do
′
′
(a) V(s)= R(s,π(s))+γ s ′ P[s | s,π(s)]V(s ′) return V
policy improvement(MDP,V,γ)
| Inputs | value function V |
| discount factor γ | |
| Output | improved policy π |
for all s ∈ S
1. for all a ∈ A
′
(a) Q(s,a)= R(s,a)+γ s ′ P[s | s,a]V(s ′)
2. π(s)∈ argmaxa Q(s,a) return π
Table 2: Policy Iteration.
10:30 AM, Apr 9, 2009
modified policy iteration(MDP,γ) Inputs discount factor γ Output optimal policy π∗ Initialize π =�
π′
π′
while π �do
=
1. π′ = π
2. for all s ∈ S /* approximate policy evaluation */
′
(a) V(s)= R(s,π(s))+γ s ′ P[s | s,π(s)]V(s ′)
3. for all s ∈ S /* policy improvement */
| (a) | for all a ∈ A |
| i. Q(s,a)= R(s,a)+γ � s ′ P[s ′ | s,a]V(s ′) | |
| (b) | π(s)∈ argmaxa Q(s,a) |
| return π |
Table 3: Modified Policy Iteration Qπ(s,a) t =0 t =1 t =2 t =3
BC
300 200 287.5 200 300 337.5 100 400 350
B
200 300 400
C
275 325 350
B
200 300 400
C
250 300 350
B
200 300 400
C
0 0 0
πt =0 t =1 t =2 t =3 300 CCCB 200 C CC/B B 100 BBBB
Iteration 2
π
Vt =0 t =1 t =2 t =3 300 300 275 250 200 200 337.5 325 300 300 100 400 400 400 400
Qπ(s,a) t =0 t =1 t =2 t =3
| B | C | |
|---|---|---|
| 300 | 200 | 300 |
| 200 | 300 | 337.5 |
| 100 | 400 | 362.5 |
B
200 300 400
C
275 325 350
B
200 300 400
C
250 300 350
B
200 300 400
C
0 0 0
10:30 AM, Apr 9, 2009
πt =0 t =1 t =2 t =3 300 CCCB 200 C CC/B B 100 BBBB
As the new policy does not differ from the old, policy iteration has converged. Moreover, the current values of Vπ represent the values of the optimal policy.