Lecture 19: Markov Decision Processes:

Prediction

10:30 AM, Apr 7, 2009

Contents

1 Overview 1

2 Definitions and An Example 2

2.1 MarkovRewardProcesses ................................. 2

3 State Values 3

3.1 Return ............................................ 3

3.2 Bellman’sTheorem ..................................... 4

3.3 Bellman’sEquations .................................... 5

4 Policy Evaluation 6

4.1 Example:Gambler’sRuin ................................. 6

1 Overview

Our aim in this series of lectures is to extend our suite of heuristic search and optimization algorithms to the case in which transitions to successor states are non-deterministic. We introduce Markov rewardprocesses(MRPs) andMarkovdecisionprocesses(MDPs) as modeling toolsinthe study of non-deterministicstate-space searchproblems. Thesemodelsprovideframeworksforcomputing optimal behavior in uncertain worlds. For example, we might be interested in planning an optimal route to work, or maximizing the returns on an investment in the stock market. Solutions mayinvolve linearprogramming ordynamicprogramming methods(e.g., valueiteration andpolicy iteration) in the case where the non-deterministic nature of the process is known with certainty and the state and action spaces are sufficiently small; otherwise, they may be solved using Monte Carlo simulations or reinforcementlearning(e.g.,TD-learning,Q-learning, andSARSA).

OurdiscussionofMarkovprocessesisdividedintotwoparts: the firstisconcerned with computing state values V inMarkov reward(ordecision) processes, andthe second with computing action values Qin Markov decision processes. This division coincides with two related problems, namely:

  1. the (passive) prediction problem, or policy evaluation: compute the state-value function Vπ, given policy π
  2. the (active)control problem: find anoptimalpolicy π, by computing the optimal action-value function Q

10:30 AM, Apr 7, 2009

This series oflectures(Lectures19-22) areprimarily based onChapters4,5, and6 ofSutton and Barto’s book entitled Reinforcement Learning.

2 Definitions and An Example

A (discrete-time) stochastic process is a sequence of random variables {Xt}A stochastic

t=0. process {Xt}induces a probability transition function of the form P[Xt+1 = st+1 | Xt =

t=0

st,...,X0 = s0]: i.e., the probability that the state at future time t +1 is st+1, given that the states at past times t,. . ., 0 were st,...,s0, respectively.

A Markov process (or chain) is a stochastic process that satisfies the following conditional indepdendence conditions: for all t, for all s0,...,st,st+1,

P[Xt+1 = st+1 | Xt = st,...,X0 = s0]= P[Xt+1 = st+1 | Xt = st] (1)

Equation 1 is theMarkovproperty, sometimes called the memorylessproperty;itimpliesthat the probability of transitioning to a future state st+1 depends on the present state st, but is otherwise independent of the remote past, st1,...,s0.

2.1 Markov Reward Processes

An agent operatingin aMarkovian environment transitionsfrom state to state,ingeneral obtaining rewards along the way, as follows: at time t,

  1. state is st
  2. receive reward rt
  3. transition to state st+1 with probability P[st+1 | st]

We modelthis agent’sinteractions as a(discrete-time) Markov rewardprocess, a tuple S,R,P, where time is discrete: i.e., t T = {0,1,...}, and

  • S is a finite set of states
  • R : S R is a reward function
  • P : S Δ(S)is aprobability transitionfunction(or matrix) Δ(S)is the set of probability distributions over S

Theform oftheprobabilitytransitionfunction encodes thefact thatitsatisfies theMarkovproperty.

Remark: Markov reward processes can have stochastic rewards as well as stochastic transitions. But our framework is sufficiently general, since such processes can be reduced to Markov reward processes withdeterministic rewards simplybylettingthedeterministic rewards equal the expected values of the corresponding stochastic rewards.

Example: Gambler’s Ruin is an example of aMarkov rewardprocess. Agamblergambles until he either wins a set amount of money, say $N, or loses all his money. At state st, his wealth increases by $1 with probability 1/3, and it decreases by $1 with probability 2/3.

10:30 AM, Apr 7, 2009

The set of states is defined by the worth of the gambler: S = {0,. . .,N, end}. The rewards are defined as R(end)= R(i)= 0, for i =0,...,N 1, but R(N)= 1. The transition probabilities are such that P[i +1 | i] =1/3 and P[i 1 | i] =2/3, for i =1,...,N 1; P[end | i] = 1, for i =0,N; and P[end | end]=1.

Given initial state i 1,...,N 1, what is the probability that the gambler wins: i.e., reaches state N?

Figure 1: Gambler’s Ruin: N = 4. An absorbing state s S is s.t. P[s | s]= 1. The end state is an absorbing state in Gambler’s Ruin.

3 State Values

The value V(st)of state st is defined as the expected reward that is accrued from time t on; that is, the expected value of ρτ, where ρτ isthe reward(or return) thatis accrued alongtrajectory

tt

τ =(st,st+1,st+2,...):

V(st)= P[τ | st]ρτ (2)

t τ

3.1 Return

Given trajectory τ = ...), the returnρτ at time t is a function of the current reward

(st,st+1,st+2,t rt and the stream of future rewards rt+1,rt+2,.... In the case of a finite horizon, say of length T< , return can be computed simply as the sum of current and future rewards: i.e.,

Tt

ρτ

= rt +rt+1 + rt+2 + ... +rT = rt+i (3)

t i=0

In the case of infinite horizons, however, the sum of future rewards is potentially infinite. If all trajectories areproper(atrajectory iscalled a proper trajectory iff it transitions to a zero-reward, absorbing state with positive probability), return can be computed simply as the sum of current andfuturerewards, asinEquation3. Otherwise,returnis computed asthe sumof currentrewards

10:30 AM, Apr 7, 2009

and the discounted sum of future rewards: i.e., assuming discount factor 0 γ< 1,

ρτ = rt + γrt+1 + γ2 rt+2 + ... = γi rt+i (4)

t i=0

If rewards are assumed to be bounded, return as defined by Equatio n 4 is finite. (Why?) But even inthecaseof finitehorizonsorpropertrajectories, ρτ is often computed with discounting, because

t

of the following economic “law”: a dollar today is worth more than a dollar tomorrow.

Thislaw encapsulatesthefollowing equation, which statesthatthefuturevalue(FV) of money equalsthepresent value(PV) scaledby theinterest rate(r):

FV =PV(1+r)

For example, $d today wouldbeworth$(1+ r)d 1 year from now. Similarly, $d that are scheduled to be accrued 1 year from now are worth only $d/(1+r)today.

The discount factor γ is inversely related to the interest rate: γ =1/(1+ r). Intuitively, γ determines the relative worth of immediate vs. future rewards. As γ 0, immediate rewards are deemed more and more relevant; agents that attempt to maximize return in these circumstances are called myopic. As γ 1,futurerewardsareweighted more and moreheavily; agentsthat aim to maximize return in these circumstances exhibit foresight.

3.2 Bellman’s Theorem

We can now stateBellman’s seminaltheoremforMarkov rewardprocesses(i.e.,for state values).

Theorem: The state value V(st) at state st—which is defined as the expected reward that is accrued from time t on—can be equivalently expressed as the sum of the immediate reward rt and the discounted expected value of state st+1: i.e., for 0 γ< 1,

V(st)= rt + γE[V(st+1)] (5)

′ ′′

Proof: (Sketch)In what follows, τ =(st+1,st+2,...)and τ =(st+2,...).

V(st)= P[τ | st]ρτ

t τ

= rt + γP[τ | st]ρτ

t+1

τ

′′

= rt + γP[τ | st+1,st]ρτ

P[st+1 | st] rt+1 + γ t+2

st+1

′′

= rt + γP[st+1 | st] rt+1 + γP[τ | st+1]ρtτ +2

st+1

= rt + γP[τ

P[st+1 | st] | st+1]ρτt+1

st+1

= rt + γP[st+1 | st]V(st+1)

st+1S

= rt + γE[V(st+1)]

10:30 AM, Apr 7, 2009

The fourth line follows from the Markov property. The seventh line is simply an abbreviation.

3.3 Bellman’s Equations

Bellmans’ theoremgives risetothefollowing system of |S| equations with |S| unknowns, known as Bellman’s equations: for all states s S,

V(s)= R(s)+γP[s | s]V(s ) (6)

s

To find asolutiontothissystemof equations,werely onBanach’s fixedpointtheorem,alsocalled the contraction mapping theorem. Given a metric space 1 (X,d), a mappingf : X X is called a contraction iff there exists some 0 k< 1 s.t. d(f(x),f(y))kd(x,y), for allx,y X.

The L, or max, norm is defined as follows on Rn: for all x,y Rn , ||xy|| = max1in |xi yi|. In the special case where n =1, the max norm reduces to absolute value.

1

Example: The function f(x)= x is a contraction mapping on(R,L), since |f(x)f(y)| =

2

1 1

x 1 y= |x y|.

22 2

Banach’s Theorem: Given a complete 2 metric space (X,d) as well as a contraction mapping

∗∗ 0

f : X X,(i) there exists a unique x X s.t. f(x )= x ; and(ii) forarbitrary xX, the n+1 fn+1(x

sequence {xn} defined by x= f(xn)= 0)converges to x .

Define the mapping f : RS RS as follows:

(f(x))(s)= R(s)+γP[s | s]x(s ) (7)

s

Theorem: The mapping f inEquation 7 is a contraction on(RS,L).

Proof: For all x,y X, and for arbitrary state s S,

|(f(x))(s)(f(y))(s)|

′′

= R(s)+γP[s | s]x(s )R(s)+γP[s | s]y(s )

s s

= γP[s | s]x(s )y(s )

s

′ ′′ )y(s ′′ )|

γP[s | s]max |x(s

s s

= γP[s | s]||x y||

s

= γ||x y||

It follows that |(f(x))(s)(f(y))(s)|γ||x y||, for all states s. Therefore, ||f(x)f(y)|| = maxs |(f(x))(s)(f(y))(s)|γ||x y||.

Corollary: Bellman’ssystemof equations(Equation 6) indeedhasa fixedpointsolution,and the iterative application of f converges to this solution.

1A metric space(X, d)is a set X together with a distance function d : X × X R that satisfies: (i) d(x, x)=0, for all x X;(ii) d(x, y)= d(y, x), for all x, y X; and(iii) thetriangleinequality—d(x, z)d(x, y)+d(y, z), for all x, y, z X.

2An example of a complete metric space is R.

10:30 AM, Apr 7, 2009

Policy Evaluation

Policy evaluation is a dynamic programming method that computes state values via iterative updates based on Bellman’s equations:

V(s)R(s)+γP[s | s]V(s ) (8)

s

The Gauss-Seidel version of this algorithm incorporates in-place updating: i.e., updating with V, as shown in Table 2, rather than V , as shown in Table 1.

policy evaluation(MRP,γ,ǫ) Inputs discount factor γ convergence test ǫ Output state-value function V

Initialize V =0 and V =

while maxs |V(s)V (s)| do

  1. V = V
  2. for all s S

(a) V(s)= R(s)+γ s P[s | s]V (s )

return V

Table 1: Policy Evaluation.

gauss seidel(MRP,γ,ǫ) Inputs discount factor γ convergence test ǫ Output state-value function V

Initialize V =0 and V =

while maxs |V(s)V (s)| do

  1. V = V
  2. for all s S

(a) V(s)= R(s)+γ s P[s | s]V(s ) return V

Table 2: Gauss-Seidel.

4.1 Example: Gambler’s Ruin

Assuming γ =1, the extent of thegambler’s ruinisindicatedby the state valuesin the tablebelow. These values are computed via policy evaluation as follows:

Lecture 19: Markov Decision Processes: Prediction

CS 141
V 0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 1 3
3 0 0 1 9 1 3
4 0 1 27 1 9 11 27
5 0 1 27 13 81 11 27

10:30 AM, Apr 7, 2009

4 0 1 1 1 1 1

end

0 0 0 0 0 0

With in-place computation ´to state 4 down to

a la Gauss-Seidel, working backwards from end state 0, the computation proceeds as follows:

V 0 1 2
0 0 0 0
1 0 1 27 1 9
2 0 13 243 13 81
3 0 133 2187 133 729
100 0 0.0667 0.2

3

0

1

3 11 27

107 243

0.4667

4 0 1 1 1 1

end

0 0 0 0 0

Since rewards are 0 everywhere unless the gambler is not ruined, in which case rewards are 1, the final state values canbeinterpreted astheprobability thatthegambleris not ruined. At all states 1,...,N 1, the gambler is more likely to be ruined than not.