Lecture 21–22: Reinforcement Learning

TBA

Contents

1 Overview 1

2 Incremental Estimation 1

3 Learning State Values 2

3.1 MonteCarloPolicyEvaluation .............................. 2

3.2 TD-Learning ........................................ 3

3.3 Example:Gambler’sRuin ................................. 4

4 Learning Action Values 5

4.1 Explorationvs.Exploitation ................................ 5

4.2 MonteCarloControl .................................... 6

4.3 SARSA ........................................... 7

4.4 Q-Learning ......................................... 8

4.5 Example:DeterministicMaze ............................... 8

1 Overview

Inthislecture, we continue our studyofMarkov reward anddecisionprocesses, shifting our emphasis fromdynamicprogramming(whichhasitsfoundationsin operations research) to reinforcement learning (which is true AI). Reinforcement learning is more generally applicable than dynamic programming, since(i) itdoesnotrequiresweepsovertheentire state space and(ii) itdoesnot depend on the assumption that the probabilistic nature of the environment as well as the reward structure areknown. In thislecture, we compute state and action valuefunctions using only agents’ trial-and-error “experiences.” The algorithms we study, Monte Carlo simulations, TD-learning, Q-learning and sarsa, incrementally estimate state and action values from sample trajectories.

2 Incremental Estimation

Oneplausible estimate of an unknownquantityis simply the average value, say Ak, of k measurements, say z1,...,zk. Given Ak and the k +1st measurement, rather than recompute the sum of the first k measurements, add the value of the k+1stmeasurement, anddivideby k+1, we update Ak+1 incrementally as follows:

k

1

Ak+1 = zt+1

k+1

t=0

k1

1

= zk+1 + zt+1

k+1

t=0

1

=[zk+1 + kAk + Ak Ak]

k+1 1

=[zk+1 +(k +1)Ak Ak]

k+1 1

= Ak +[zk+1 Ak] (1)

k+1 k 1

= Ak + zk+1 (2)

k+1 k +1

That is, the new estimate Ak+1 depends in part on the old estimate Ak and in part on the k+1st measurement.

More generally, the value of the k +1st measurement zk+1 in Equation 1 can be replaced by an arbitrary “target” value A. Similarly, the fraction 1/(k +1), which decreases with the number of measurements, can be generalized by a function 0 k 1 that decays with time t, in which case k/(k +1) is replaced by 1αk.

In the following equations, the new estimate Ak+1 depends in part on the old estimate Ak and in part on the target A, where “in part” is quantified by αk:

Ak+1 = (1αk)Ak + αkA (3)

= Ak + αk [AAk] (4)

Equation 3 generalizesEquation 2; Equation 4 generalizesEquation 1. The reinforcementlearning update rules we study are all instances of Equation 4.

3 Learning State Values

Effective techniques for learning state-value functions(e.g.,policy evaluation) includeMonteCarlo policy evaluation and TD-learning. At a high-level, these methods learn state values in an MDP by repeatedly sampling trajectories, and averaging their rewards.

3.1 Monte Carlo Policy Evaluation

Recall that the value V(st)of state st is defined as the expected reward that is accrued from time t on; that is, the expected value of ρτ, where ρτ is the reward that is accrued along trajectory

tt

τ =(st,st+1,st+2,...):

V(st)= P[τ | st]ρτ (5)

t τ

Given policy π, Monte Carlo policy evaluation repeatedly generates state trajectories τ according to π and computes Vπ(st) via Equation 4, setting the target value A = ρτ whenever trajectory τ

t

is traversed, as follows: Vπ(st)Vπ(st)+αk[ρτ Vπ(st)] (6)

t

Thistechniquedepends onthe computation ofρτ = rt+γrt+1+γ2rt+2 .... Thus,itis onlyapplicable

t

t ′′

if there exists t >t s.t. for all >t , rt =0. GivenanMDP,an absorbing (or terminal)state, is one at which rewardis zero andfrom whichitisimpossible todepart. Inparticular, if an absorbing state is reached at time t , then for all t ′′ >t , rt =0. Apolicy is called proper iff all trajectories it engenders eventually lead to an absorbing state, with probability 1. Assuming the policy π is proper, Monte Carlo policy evaluation simulates episodes, beginning at a random start state and leading to an absorbing state(withprobability 1). Notethatfor such episodesitis well-defined to simply let ρτ bethe sum offuture rewards(i.e., γ =1).

t

mc evaluation(MDP,π,γ)

Inputs policy π
discount factor γ
Output value function Vπ
Initialize V = 0, α according to schedule

repeat

  1. initialize s,τ,ρ
    1. while s T do
      1. let τ = τ ∪{s}
      2. take action a = π(s)

(c)
observe reward r and next state s
(d)
for all s τ, let ρ(s)= ρ(s)+r

(e) let s = s

  1. for all s τ, V(s)= V(s)+α[ρ(s)V(s)]
  2. decay α according to schedule

forever

Table 1: Monte Carlo Method for Prediction, assuming γ =1.

In the pseudocode given in Table 1, the values of the states that are visited during an episode are updated by letting Rt be the value of the returns following the first visit to state s. A variant of this approach instead lets Rt be the average value of the returns following every visit to state s. Both methods converge to Vπ(s)as the number of visits to state s approaches infinity.

3.2 TD-Learning

TD-learning iteratively computes Vπ(st)via the following instantiation of Eq. 4:

Vπ(st)Vπ(st)+αk[rt + γVπ(st+1)Vπ(st)] (7) Here the target value A = rt + γVπ(st+1). The difference between A and the current estimate Vπ(st) is called the temporal difference. Unlike Monte Carlo methods, which set the target

value according to the returns achieved upon termination of a trajectory, TD-learning—inspired by Bellman’s theorem—updates based on intermediate rewards. For this reason, TD-learning does not rely on the assumption that the policy π is proper.

td learning(MDP,π,γ)

Inputs policy π
discount factor γ
Output value function Vπ
Initialize V = 0, α according to schedule

repeat

1. initialize s

2. while s T do

(a) take action a = π(s)

(b)
observe reward r and next state s
(c)
V(s)= V(s)+α[r + γV(s )V(s)]
(d)
let s = s

3. decay α according to schedule

forever

Table 2: TD-Learning.

Given policy π, Monte Carlo simulations and TD-learning are both guaranteed to converge to Vπ if the learning rate αk decreases overtime(fixed values such as0.1 are often usedinpractice). TD typically converges faster, because it makes use of intermediate estimates, whereas Monte Carlo simulation methods update based on the final return.

3.3 Example: Gambler’s Ruin

We now compare the behavior of the Monte Carlo method and TD-learning on several sample trajectories in the Gambler’s Ruin, for fixed α =0.1 and γ =1.

Trajectory Monte Carlo TD-learning

4 V(4)=0+.1[10] = .1
3 4 V(3)=0+.1[10] = .1 V(4)= .1+.1[1.1] = .19

2 3 4 V(2)=0+.1[10] = .1 V(3)= .1+.1[1.1] = .19 V(4)= .19+.1[1.19] = .271

3 2 1 0 V(3)= .19+.1[0.19] = .171 V(2)= .1+.1[0.1] = .09 V(1)=0+.1[00] =0 V(0)=0+.1[00] =0

4 Learning Action Values

V(4)=0+.1[1+00] = .1 V(3)=0+.1[0+.10] = .01 V(4)= .1+.1[1+0.1] = .19 V(2)=0+.1[0+.010] = .001 V(3)= .01+.1[0+.19.01] = .028 V(4)= .19+.1[1+0.19] = .271 V(3)= .028+.1[0+.001.028] = .0253 V(2)= .001+.1[0+0.001] = .0009 V(1)=0+.1[0+00] =0 V(0)=0+.1[0+00] =0

We now turn our attention to algorithms that learn action-value functions, from which we can deriveoptimalpolicies. Following thestructureof theprevioussection, wepresent oneMonte-Carlo based learning algorithm for control, and another, called sarsa, which is based on TD-learning. We also present a third algorithm, Q-learning, that uses an update equation inspired by Bellman’s optimality equations. But before presenting any reinforcement learning algorithms for control, we revisit the issue of exploration vs. exploitation, which arises again in this application domain.

4.1 Exploration vs. Exploitation

Recall thatinthe reinforcementlearning frameworkitis not assumed thattheprobabilistic nature of the environment is known. Moreover, it is also not assumed that the reward structure is known. Instead, reinforcement learning agents wander through their environments learning about rewards only at the states they visit for the actions they employ.

Naturally, such agents would aim to reinforce, that is “become more and more likely to employ,” those actions that are found to be the most rewarding. With this objective in mind, reinforcementlearning agents are susceptibleto thetrade-offsbetween exploration and exploitation(asin simulated annealing) while learning action values. By exploiting actions that have been proven themselves to be successful in the past, it is possible to perform well; but by exploring alternative actions, it is possible to perform even better.

Onepopular methodof explorationis ǫ-greedy: if π is the currentoptimalpolicy and s is the current state, with probability 1 ǫ, exploit—take action π(s)—but with probability ǫ, explore—choose an action at random. Typically, ǫ isdecayed overtime(e.g., ǫ 1/t). This technique, however, explores seemingly optimal and sub-optimal actions with equal probability.

An alternative is to use the softmax action selection method, which relies on the Boltzmann distribution. Specifically, given state st, action a is selected with the following probability:

Q(st ,a)/T

e

eQ(st ,a )/T a

where the temperature parameter T graduallydecreases(asinsimulatedannealing). All actions are nearly equiprobable at initial higher temperatures; in contrast, lower temperatures extol the virtues of some actions but belittle others.

4.2 Monte Carlo Control

Recall that policy iteration alternates between improving the current policy to arrive at a new policy, and then evaluating that new policy. To extend Monte Carlo evaluation to control, it sufficestoinsertimprovement stepsbetweenthe repeated evaluation steps(seeTable3).

Note that no Monte Carlo control algorithm can converge to a suboptimal policy. If it were to do so,thenthevaluefunctioncorresponding tothatpolicy wouldeventuallybelearned(viaMonte Carlo evaluation), at which point it would be determined that alternative actions are preferable. Convergence requires both the policy and the value function to be optimal.

mc control(MDP,π,γ,ǫ)

Inputs policy π
discount factor γ
rate of exploration ǫ
Output value function Vπ
Initialize V = 0, α according to schedule

repeat

  1. initialize s,a,τ,ρ
    1. while s T do
      1. let τ = τ ∪{(s,a)}
      2. take action a = π(s)with probability 1ǫ take random action a with probability ǫ

(c)
observe reward r and next state s
(d)
for all(s,a)τ, let ρ(s,a)= ρ(s,a)+r

′′

(e) let s = s , a = a

  1. for all(s,a)τ, Q(s,a)= Q(s,a)+α[ρ(s,a)Q(s,a)]
  2. for all s S, π(s)argmaxa Q(s,a)
  3. decay α according to schedule

forever

Table 3: Monte Carlo Method for Control, assuming γ =1.

4.3 SARSA

Just asMonteCarlo controlis a control algorithmthatgeneralizesMonteCarlo evaluation, sarsa (see Table 4 is a control algorithm that generalizes TD-learning. sarsa updates notjust on the trajectory(st,rt,st+1),but rather on the trajectory(st,at,rt,st+1,at+1). More specifically, given state-action pair (st,at), sarsa simulates the action at in state st to obtain the reward rt and transition to state st+1. The algorithm then uses its current optimal policy—based on the current Qvalues—togenerateits next action at+1 (but withprobabilityǫ it chooses an action at random). At this point, sarsa updates Q(st,at)as follows:

Q(st,at)Q(st,at)+αk[rt + γQ(st+1,at+1)Q(st,at)] (8)

This update rule is based on the following variant of Bellman’s optimality equations:

Q(st,at)= R(st,at)+γE[Q(st+1(st+1))] (9)

where π(st)argmax Q(st+1,a) (10)

a

sarsa(MDP,γ,ǫ)

Inputs discount factor γ
rate of exploration ǫ
Output action-value function Q
Initialize Q= 0, random π, α according to schedule
repeat
  1. initialize s,a
  2. while s T do

(a) take action a

(b)
observe reward r and next state s
(c)
choose random action a , with probability ǫ

choose action a = π(s ), with probability1ǫ

(d) Q(s,a)= Q(s,a)+α[r + γQ(s,a )Q(s,a)] ′′ )

(e) π(s)argmaxa Q(s,a

′′

(f) s = s , a = a

3. decay α according to schedule

forever

Table 4: SARSA: On-policy Reinforcement Learning.

4.4 Q-Learning

WhereasTD-learningis an application ofBellman’s theoremforV, Q-learningisbased onBellman’s optimality equations for Q:

Q(st,at)= R(st,at)+γE[max Q(st+1,a)] (11)

a

The corresponding update rule is the basis for Q -learning(see Table 5):

Q(st,at)Q(st,at)+αk[rt + γmax Q(st+1,a)Q(st,at)] (12)

a

sarsa is an on-policy reinforcement learning algorithm, which means that the algorithm learns a policy while simultaneouslyfollowing thatpolicy(or a close approximation thereof). In contrast, Q-learning is an off-policy reinforcement learning algorithm. The policy Q-learning follows while learning need not bear any resemblance to the policy the algorithm is following. Because it learns off-policy, the rate of exploration input to Q-learning (or any off-policy algorithm) can greatly exceed that which is input to sarsa (or any on-policy algorithm) leading to faster convergence. But Q-learning is not prevented from taking actions that are on-policy; doing so leads to behavior that is closely related to that of sarsa.

q learning(MDP,γ,ǫ)

Inputs discount factor γ

rate of exploration ǫ

Output action-value function Q

Initialize Q=0, α according to schedule

repeat

  1. initialize s,a
  2. while s T do

(a) take action a

(b) observe reward r and next state s

(c) Q(s,a)= Q(s,a)+α[r + γmaxa Q(s,a )Q(s,a)]

(d) choose action a

′′

(e) s = s , a = a

3. decay α according to schedule

forever

Table 5: Q-Learning: Off-policy reinforcement learning.

4.5 Example: Deterministic Maze

In case ofdeterministic environments, the update rulesforQ-learning and sarsa simplify asfollows: Q(st,at)rt + γQ(st+1,at+1) (13) Q(st,at)rt + γmaxa Q(st+1,a) (14)

Figure 1 depicts adeterministic maze. Possible moves areindicatedbyarrows. The final(absorbing) state is F; upon transitioning into state F, a reward of 100 is obtained. All other rewards are zero. Let γ =0.9.

C E 100 F
A B D 100

Figure 1: Deterministic Maze.

Value Iteration

Q(s,a)l r A —81 B 090 C —90 D0 — E 0 100

Q(s,a)l r A —81 B 73 90 C —90 D81 — E 81 100

Q-Learning

u 81 90 — 100 —

u 81 90 — 100 — d — — 0 — 0

d — — 73 — 81

A B C D E

A B C D E

V(s) 81 90 90 100 100

V(s) 81 90 90 100 100

Trajectory Q-Learning

D F E F C E F A C E F B A C E F D B A C E F E B D F Q(D,u) = 100 + .9maxa Q(F,a) = 100 Q(E,r) = 100 + .9maxa Q(F,a) = 100 Q(C,r) = 0 + .9maxa Q(E,a) =90 Q(A,u) = 0 + .9maxa Q(C,a) =81 Q(B,l) = 0 + .9maxa Q(A,a) =73 Q(D,l) = 0 + .9maxa Q(B,a) =66 Q(B,r) = 0 + .9maxa Q(D,a) =90 Q(E,d) = 0 + .9maxa Q(B,a) =81

Trajectory Q-Learning

D F E F C E F A C E F B A C E F D B A C E F E B D F Q(D,u) = 100 + .9Q(F,q) = 100 Q(E,r) = 100 + .9Q(F,q) = 100 Q(C,r) = 0+ .9Q(E,r) = 90 Q(A,u) = 0+ .9Q(C,r) = 81 Q(B,l) = 0+ .9Q(A,u) = 73 Q(D,l) = 0+ .9Q(B,l) = 66 Q(B,r) = 0+ .9Q(D,u) = 90 Q(E,d) = 0+ .9Q(B,r) = 81