Lecture 20: Markov Decision Processes:

Control

10:30 AM, Apr 9, 2009

Contents

1 Overview 1

2 Definitions and An Example 1

2.1 MarkovDecisionProcesses ................................. 1

2.2 Example ........................................... 2

3 State Values 2

4 Action Values 4

5 Value Iteration 4

5.1 Example(cont’d) ...................................... 5

6 Policy Iteration 6

6.1 PolicyEvaluation ...................................... 6

6.2 PolicyImprovement .................................... 6

6.3 Example(cont’d) ...................................... 7

1 Overview

In this lecture, we extend our discussion of Markov reward processes to Markov decision processes (MDP). For MDPs, we pose and solve the control problem—the search for an optimal policy. Specifically, wedescribevalueiteration andpolicy iteration,twodynamicprogramming algorithms that are used to compute an optimal policy in an MDP.

2 Definitions and An Example

What follows generalizes the definition of Markov reward processes presented in Lecture 19.

2.1 Markov Decision Processes

An agent operating in a Markovian environment transitions from state to state, in general making decisions and obtaining rewards along the way, as follows: at time t,

10:30 AM, Apr 9, 2009

  1. state is st
  2. choose action at
  3. receive reward rt
  4. transition to state st+1 with probability P[st+1 | st,at]

Markovdecisionprocesses(MDPs) model such agent-environmentinteractions. A(discrete-time) Markov decision process is a tuple S,A,R,P, where time is discrete: i.e., t T = {0,1,...}, and

  • S is a finite set of states(s S)
  • A is a finite set of actions(a A)
  • R : S × A R is a reward function
  • P : S × A Δ(S)is aprobability transitionfunction(or matrix) Δ(S)is the set of probability distributions over S

2.2 Example

EachTACTravel flight “auction” (viewedinisolation)isanexampleof anMDP.Letussimplify one such auction and model it as an MDP.

The state is defined in terms of the price of the flight and the time remaining until the end of the auction. Specifically, the state space is the cross product of the set of possible prices, say P = {150,160,..., 590,600} and the time, which we assume varies discretely from t = 0 through time T =30, unioned with a designated state end. Let pt denote the price at time t.

The set of possible actions A includes buy now (B) and (re)consider later (C). Rewards depend on the flight’s valuation. Assuming v represents this valuation, R(pt,B)= v pt and R(pt,C)=0, for all pt; in addition, R(end,a)=0,forall actions a A. Finally, transition probabilities depend on states and actions: for all prices p∈P, actions a A, and times t ∈{0,...,T },

P[end|pt = p,at = B]=1.0 P[pt+1 = p+10|pt = p,at = C]=0.5 P[pt+1 = p10|pt = p,at = C]=0.5 P[end|end,at = a]=1.0 P[end|pT = p,aT = a]=1.0

At state pt, what is the optimal action?

3 State Values

A policy is a map from states to actions: i.e., π : S A. The state value Vπ(s)associated with state s underpolicy π isdefined astheexpected reward thatisaccruedfromstate s onbyfollowing

10:30 AM, Apr 9, 2009

Figure 1: TAC Travel Flight Auctions as an MDP: States are indicated by circles. Fat arrows indicate actions; they arelabeled with rewards. Skinnyarrowsindicate transitions; they arelabeled with probabilities.

policy π. ByBellman’s theorem(for state values), this state value canbe equivalently expressed as the sum of the immediate reward obtained by taking action π(s)in state s and the discounted

expected value of the next state s , assuming the policy π is followed from state s on:
V π(s) = R(s,π(s))+γE[Vπ(s )] (1)
Policy π dominates policy ˆπ (notation π ˆπ) iff Vπ(s) V ˆπ(s) for all states s S. We seek

an optimal policy: i.e., πs.t. ππ, for all policies π. It suffices to restrict our attention to deterministic, stationary policies π, in which the same pure (i.e., non-randomized)action is taken every time state s is visited. (Why?)

An optimal policy can be computed by solving Bellman’s optimality equations:

V(s)= max R(s,a)+γE[V(s )] (2)

a

These equations state that a state’s valueis that which canbe obtainedby choosing the action that maximizesthesumof theimmediate reward atthecurrent state and thediscounted expected value of the next state. As in the case of Markov reward processes, to find a solution to this system of (|S|)equations(with |S| unknowns), we rely on Banach’s fixed point theorem. The optimal value

function Vis the unique solution to this system of equations.

10:30 AM, Apr 9, 2009

Exercise: Showthatthe mapping implicitinEquation 2 is a contraction on(RS,L).

Given V, the optimal policy πmaps state s into an optimal action, as follows:

π(s)argmax R(s,a)+γE[V(s )] (3)

a

While the optimal value function Vis unique, the optimal policy πneed not be unique.

4 Action Values

The action value Qπ(s,a) associated with state s and action a under policy π is defined as the expected reward that is accrued from state s on by following policy π, except at state s, where it is assumed that action a is taken instead of action π(s). By Bellman’stheorem(for action values), this value can be equivalently expressed as the sum of the immediate reward obtained by taking action a in state s and the discounted expected value of the next state s , assuming the policy π is

followed from state s on:

Qπ(s,a)= R(s,a)+γE[Vπ(s )] (4)

= R(s,a)+γE[Qπ(s (s ))] (5)

Restating Bellman’s optimality equations in terms of action values yields:

Q(s,a)= R(s,a)+γE[V(s )] (6)

= R(s,a)+γE[max Q(s,a)] (7)

a

AsinthecaseofMarkovrewardprocesses,to find asolutiontothissystemof(|S × A|)equations (with |S × A| unknowns), we rely on Banach’s fixed point theorem. The optimal action-value function Qis the unique solution to this system of equations.

Exercise: Showthatthe mapping implicitinEquation 7 is a contraction on(RS,L).

Given Q, the optimal policy πmaps state s into an optimal action, as follows:

π(s)argmax Q(s,a) (8)

a

While the optimal action-value function Qis unique, the optimal policy πneed not be unique.

5 Value Iteration

The value iteration algorithm, which is based on Equation 2, updates as follows:

V(s)max {R(s,a)+γP[s | s,a]V(s )} (9)

a

s

Equivalently,

Q(s,a)R(s,a)+γP[s | s,a]V(s ) (10)

s

V(s)max Q(s,a) (11)

a The algorithm, which is depicted in Table 1, first computes the value of each state for all actions, and then sets each state’s value to be the greatest value achieved among all courses of action. The actions that yield the optimal state values can be extracted as the optimal policy.

10:30 AM, Apr 9, 2009

value iteration(MDP,γ,ǫ) Inputs discount factor γ convergence test ǫ

Output optimal state-value function V

Initialize V =0 and V= while maxs |V(s)V(s)| do

  1. V= V
  2. for all s S

(a) for all a A

i. Q(s,a)= R(s,a)+γ s P[s | s,a]V(s )

(b) V(s)= maxa Q(s,a) return V

Table 1: Value Iteration ´a la Gauss-Seidel.

5.1 Example(cont’d)

The following tables depict the computation of state and action values and the optimal policy in a TAC flight auction, assuming 3 prices, namely $100, $200, and $300, and 4 time steps, with V =500 and γ =1.

Q(s,a) t =0 t =1 t =2 t =3

BC

300 200 300 200 300 337.5 100 400 362.5

B

200 300 400

C

275 325 350

B

200 300 400

C

250 300 350

B

200 300 400

C

0 0 0

V(s) t =0 t =1 t =2 t =3 300 300 275 250 200 200 337.5 325 300 300 100 400 400 400 400

π(s) t =0 t =1 t =2 t =3

300 CCCB

200 C CC/B B

100 BBBB

Theoptimalpolicyisfairly intuitive:Itprescribesthatan agent shouldbuyif all timehaselapsed, regardless ofprice. Similarly, an agent shouldbeif everthepricehitsthelowerbound. Otherwise, if time remains and the price is not rockbottom, it is optimal to consider buying later, since there is some chance of seeing the price drop.

10:30 AM, Apr 9, 2009

6 Policy Iteration

Policy iteration is a two-phase dynamic programming method for computing optimal policies in an MDP. The first phase, policy evaluation, computesthe state valuesforthe current(fixed) policy viaEquation 1. The secondphase, policy improvement, improves upon the currentpolicy (wheneverpossible)in agreedy fashion. Policy improvement updatesbased onEquations 4 and 8.

In practice, value iteration is faster than policy iteration per iteration; however, policy iteration takes far fewer iterations to converge. One modified version of policy iteration does not wait for the policy evaluation phase of policy iteration to converge, and instead produces approximations

π

of V. This modification leads to substantial speedups in the runtime of policy iteration.

6.1 Policy Evaluation

Policy evaluation in Markov decision processes computes state values given some policy exactly as state values are evaluated in Markov reward processes:

Vπ(s)R(s,π(s))+γP[s | s,π(s)]Vπ(s ) (12) s

6.2 Policy Improvement

The convergence of policy iteration follows from the policy improvement theorem and the one-shot deviation principle. Theformer statesthatgreedy improvements(i.e.,improvements in immediate rewards) lead to improved policies. Conversely, the latter states: if there are no greedy improvements to a policy to be had, then the policy is optimal. Hence, by repeatedly improving a policy in a greedy fashion until no further improvements can be made, one arrives at the optimal policy. A proof of the one-shot deviation principle for Markov Decision Processes appearsinBlackwell’spaper entitledDiscountedDynamicProgramming(Annals of Mathematical Statistics, 1965). Here is the formal statement of the policy improvement theorem.

Theorem: Given policies π1 and π2, if Qπ1 (s,π2(s)) Qπ1 (s,π1(s)) for all states s S, then Vπ2 (s)= Qπ2 (s,π2(s))Qπ1 (s,π1(s))= Vπ1 (s)for all s S.

Proof: (Sketch)

Vπ1 (s)= Qπ1 (s,π1(s))

Qπ1 (s,π2(s))

= R(s,π2(s))+γE[Vπ1 (s )]

= R(s,π2(s))+γE[Qπ1 (s 1(s ))]

R(s,π2(s))+γE[Qπ1 (s 2(s ))]

= R(s,π2(s))+γE[R(s 2(s ))+γE[Vπ1 (s ′′ )]]

= R(s,π2(s))+γE[R(s 2(s ))]+γ2E[Vπ1 (s ′′ )] = ··· = Vπ2 (s)

The policy improvement steps in the policy iteration algorithm are as follows:

Qπ(s,a)R(s,a)+γP[s | s,a]Vπ(s ) (13) s

10:30 AM, Apr 9, 2009

π(s)argmax Q(s,a) (14)

a

6.3 Example(cont’d)

The following tables depict the iterative computation of policies, state, and action values in one TAC flight auction. The flight’s valuation is 500.

Initialization

πt =0 t =1 t =2 t =3 300 BBBB 200 BBBB 100 BBBB

Iteration 0

π

Vt =0 t =1 t =2 t =3

300 200 200 200 200

200 300 300 300 300

100 400 400 400 400

Qπ(s,a) t =0 t =1 t =2 t =3

B C
300 200 250
200 300 300
100 400 350

B 200 300 400

πt =0 t =1 t =2 300 CCC 200 C/B C/B C/B 100 BBB

Iteration 1

π

Vt =0 t =1 t =2 300 287.5 275 250 200 300 300 300 100 400 400 400

C 250 300 350

t =3

B

B

B

t =3 200 300 400

B 200 300 400 C 250 300 350 B 200 300 400

C

0 0 0

10:30 AM, Apr 9, 2009

policy iteration(MDP,γ,ǫ)

Inputs discount factor γ
convergence test ǫ
Output optimal policy π
Initialize π = π

π

while π do

=

1. π= π

π

2. V= policy evaluation(MDP,π,γ,ǫ)

3. π = policy improvement(MDP,V π) return π

policy evaluation(MDP,π,γ,ǫ)

Inputs policy π discount factor γ convergence test ǫ

π

Output state-value function V

Initialize V =0 and V=

while maxs |V(s)V(s)| do

  1. V= V
  2. for all s S

(a) V(s)= R(s,π(s))+γ s P[s | s,π(s)]V(s ) return V

policy improvement(MDP,V,γ)

Inputs value function V
discount factor γ
Output improved policy π

for all s S

1. for all a A

(a) Q(s,a)= R(s,a)+γ s P[s | s,a]V(s )

2. π(s)argmaxa Q(s,a) return π

Table 2: Policy Iteration.

10:30 AM, Apr 9, 2009

modified policy iteration(MDP) Inputs discount factor γ Output optimal policy πInitialize π =

π

π

while π do

=

1. π= π

2. for all s S /* approximate policy evaluation */

(a) V(s)= R(s,π(s))+γ s P[s | s,π(s)]V(s )

3. for all s S /* policy improvement */

(a) for all a A
i. Q(s,a)= R(s,a)+γ s P[s | s,a]V(s )
(b) π(s)argmaxa Q(s,a)
return π

Table 3: Modified Policy Iteration Qπ(s,a) t =0 t =1 t =2 t =3

BC

300 200 287.5 200 300 337.5 100 400 350

B

200 300 400

C

275 325 350

B

200 300 400

C

250 300 350

B

200 300 400

C

0 0 0

πt =0 t =1 t =2 t =3 300 CCCB 200 C CC/B B 100 BBBB

Iteration 2

π

Vt =0 t =1 t =2 t =3 300 300 275 250 200 200 337.5 325 300 300 100 400 400 400 400

Qπ(s,a) t =0 t =1 t =2 t =3

B C
300 200 300
200 300 337.5
100 400 362.5

B

200 300 400

C

275 325 350

B

200 300 400

C

250 300 350

B

200 300 400

C

0 0 0

10:30 AM, Apr 9, 2009

πt =0 t =1 t =2 t =3 300 CCCB 200 C CC/B B 100 BBBB

As the new policy does not differ from the old, policy iteration has converged. Moreover, the current values of Vπ represent the values of the optimal policy.