TBA
3.1 MonteCarloPolicyEvaluation .............................. 2
3.2 TD-Learning ........................................ 3
3.3 Example:Gambler’sRuin ................................. 4
4.1 Explorationvs.Exploitation ................................ 5
4.2 MonteCarloControl .................................... 6
4.3 SARSA ........................................... 7
4.4 Q-Learning ......................................... 8
4.5 Example:DeterministicMaze ............................... 8
Inthislecture, we continue our studyofMarkov reward anddecisionprocesses, shifting our emphasis fromdynamicprogramming(whichhasitsfoundationsin operations research) to reinforcement learning (which is true AI). Reinforcement learning is more generally applicable than dynamic programming, since(i) itdoesnotrequiresweepsovertheentire state space and(ii) itdoesnot depend on the assumption that the probabilistic nature of the environment as well as the reward structure areknown. In thislecture, we compute state and action valuefunctions using only agents’ trial-and-error “experiences.” The algorithms we study, Monte Carlo simulations, TD-learning, Q-learning and sarsa, incrementally estimate state and action values from sample trajectories.
Oneplausible estimate of an unknownquantityis simply the average value, say Ak, of k measurements, say z1,...,zk. Given Ak and the k +1st measurement, rather than recompute the sum of the first k measurements, add the value of the k+1stmeasurement, anddivideby k+1, we update Ak+1 incrementally as follows:
k
1
Ak+1 = zt+1
k+1
t=0
k−1
1
= zk+1 + zt+1
k+1
t=0
1
=[zk+1 + kAk + Ak − Ak]
k+1 1
=[zk+1 +(k +1)Ak − Ak]
k+1 1
= Ak +[zk+1 − Ak] (1)
k+1 k 1
= Ak + zk+1 (2)
k+1 k +1
That is, the new estimate Ak+1 depends in part on the old estimate Ak and in part on the k+1st measurement.
More generally, the value of the k +1st measurement zk+1 in Equation 1 can be replaced by an arbitrary “target” value A. Similarly, the fraction 1/(k +1), which decreases with the number of measurements, can be generalized by a function 0 <αk ≤ 1 that decays with time t, in which case k/(k +1) is replaced by 1− αk.
In the following equations, the new estimate Ak+1 depends in part on the old estimate Ak and in part on the target A, where “in part” is quantified by αk:
Ak+1 = (1− αk)Ak + αkA (3)
= Ak + αk [A− Ak] (4)
Equation 3 generalizesEquation 2; Equation 4 generalizesEquation 1. The reinforcementlearning update rules we study are all instances of Equation 4.
Effective techniques for learning state-value functions(e.g.,policy evaluation) includeMonteCarlo policy evaluation and TD-learning. At a high-level, these methods learn state values in an MDP by repeatedly sampling trajectories, and averaging their rewards.
Recall that the value V(st)of state st is defined as the expected reward that is accrued from time t on; that is, the expected value of ρτ, where ρτ is the reward that is accrued along trajectory
tt
τ =(st,st+1,st+2,...):
V(st)= P[τ | st]ρτ (5)
t τ
Given policy π, Monte Carlo policy evaluation repeatedly generates state trajectories τ according to π and computes Vπ(st) via Equation 4, setting the target value A = ρτ whenever trajectory τ
t
is traversed, as follows: Vπ(st)← Vπ(st)+αk[ρτ − Vπ(st)] (6)
t
Thistechniquedepends onthe computation ofρτ = rt+γrt+1+γ2rt+2 .... Thus,itis onlyapplicable
t
t ′′
if there exists t ′ >t s.t. for all >t ′ , rt ′′ =0. GivenanMDP,an absorbing (or terminal)state, is one at which rewardis zero andfrom whichitisimpossible todepart. Inparticular, if an absorbing state is reached at time t ′, then for all t ′′ >t ′ , rt ′′ =0. Apolicy is called proper iff all trajectories it engenders eventually lead to an absorbing state, with probability 1. Assuming the policy π is proper, Monte Carlo policy evaluation simulates episodes, beginning at a random start state and leading to an absorbing state(withprobability 1). Notethatfor such episodesitis well-defined to simply let ρτ bethe sum offuture rewards(i.e., γ =1).
t
mc evaluation(MDP,π,γ)
| Inputs | policy π | |
|---|---|---|
| discount factor γ | ||
| Output | value function Vπ | |
| Initialize | V | = 0, α according to schedule |
repeat
′
′
(e) let s = s
forever
Table 1: Monte Carlo Method for Prediction, assuming γ =1.
In the pseudocode given in Table 1, the values of the states that are visited during an episode are updated by letting Rt be the value of the returns following the first visit to state s. A variant of this approach instead lets Rt be the average value of the returns following every visit to state s. Both methods converge to Vπ(s)as the number of visits to state s approaches infinity.
TD-learning iteratively computes Vπ(st)via the following instantiation of Eq. 4:
Vπ(st)← Vπ(st)+αk[rt + γVπ(st+1)− Vπ(st)] (7) Here the target value A = rt + γVπ(st+1). The difference between A and the current estimate Vπ(st) is called the temporal difference. Unlike Monte Carlo methods, which set the target
value according to the returns achieved upon termination of a trajectory, TD-learning—inspired by Bellman’s theorem—updates based on intermediate rewards. For this reason, TD-learning does not rely on the assumption that the policy π is proper.
td learning(MDP,π,γ)
| Inputs | policy π | |
|---|---|---|
| discount factor γ | ||
| Output | value function Vπ | |
| Initialize | V | = 0, α according to schedule |
repeat
1. initialize s
2. while s �∈ T do
(a) take action a = π(s)
′
3. decay α according to schedule
forever
Table 2: TD-Learning.
Given policy π, Monte Carlo simulations and TD-learning are both guaranteed to converge to Vπ if the learning rate αk decreases overtime(fixed values such as0.1 are often usedinpractice). TD typically converges faster, because it makes use of intermediate estimates, whereas Monte Carlo simulation methods update based on the final return.
We now compare the behavior of the Monte Carlo method and TD-learning on several sample trajectories in the Gambler’s Ruin, for fixed α =0.1 and γ =1.
Trajectory Monte Carlo TD-learning
2 → 3 → 4 V(2)=0+.1[1− 0] = .1 V(3)= .1+.1[1− .1] = .19 V(4)= .19+.1[1− .19] = .271
3 → 2 → 1 → 0 V(3)= .19+.1[0− .19] = .171 V(2)= .1+.1[0− .1] = .09 V(1)=0+.1[0− 0] =0 V(0)=0+.1[0− 0] =0
V(4)=0+.1[1+0− 0] = .1 V(3)=0+.1[0+.1− 0] = .01 V(4)= .1+.1[1+0− .1] = .19 V(2)=0+.1[0+.01− 0] = .001 V(3)= .01+.1[0+.19− .01] = .028 V(4)= .19+.1[1+0− .19] = .271 V(3)= .028+.1[0+.001− .028] = .0253 V(2)= .001+.1[0+0− .001] = .0009 V(1)=0+.1[0+0− 0] =0 V(0)=0+.1[0+0− 0] =0
We now turn our attention to algorithms that learn action-value functions, from which we can deriveoptimalpolicies. Following thestructureof theprevioussection, wepresent oneMonte-Carlo based learning algorithm for control, and another, called sarsa, which is based on TD-learning. We also present a third algorithm, Q-learning, that uses an update equation inspired by Bellman’s optimality equations. But before presenting any reinforcement learning algorithms for control, we revisit the issue of exploration vs. exploitation, which arises again in this application domain.
Recall thatinthe reinforcementlearning frameworkitis not assumed thattheprobabilistic nature of the environment is known. Moreover, it is also not assumed that the reward structure is known. Instead, reinforcement learning agents wander through their environments learning about rewards only at the states they visit for the actions they employ.
Naturally, such agents would aim to reinforce, that is “become more and more likely to employ,” those actions that are found to be the most rewarding. With this objective in mind, reinforcementlearning agents are susceptibleto thetrade-offsbetween exploration and exploitation(asin simulated annealing) while learning action values. By exploiting actions that have been proven themselves to be successful in the past, it is possible to perform well; but by exploring alternative actions, it is possible to perform even better.
Onepopular methodof explorationis ǫ-greedy: if π is the currentoptimalpolicy and s is the current state, with probability 1 − ǫ, exploit—take action π(s)—but with probability ǫ, explore—choose an action at random. Typically, ǫ isdecayed overtime(e.g., ǫ ∼ 1/t). This technique, however, explores seemingly optimal and sub-optimal actions with equal probability.
An alternative is to use the softmax action selection method, which relies on the Boltzmann distribution. Specifically, given state st, action a is selected with the following probability:
Q(st ,a)/T
e
eQ(st ,a ′ )/T a ′
where the temperature parameter T graduallydecreases(asinsimulatedannealing). All actions are nearly equiprobable at initial higher temperatures; in contrast, lower temperatures extol the virtues of some actions but belittle others.
Recall that policy iteration alternates between improving the current policy to arrive at a new policy, and then evaluating that new policy. To extend Monte Carlo evaluation to control, it sufficestoinsertimprovement stepsbetweenthe repeated evaluation steps(seeTable3).
Note that no Monte Carlo control algorithm can converge to a suboptimal policy. If it were to do so,thenthevaluefunctioncorresponding tothatpolicy wouldeventuallybelearned(viaMonte Carlo evaluation), at which point it would be determined that alternative actions are preferable. Convergence requires both the policy and the value function to be optimal.
mc control(MDP,π,γ,ǫ)
| Inputs | policy π | |
|---|---|---|
| discount factor γ | ||
| rate of exploration ǫ | ||
| Output | value function Vπ | |
| Initialize | V | = 0, α according to schedule |
repeat
′
′′
(e) let s = s , a = a
forever
Table 3: Monte Carlo Method for Control, assuming γ =1.
Just asMonteCarlo controlis a control algorithmthatgeneralizesMonteCarlo evaluation, sarsa (see Table 4 is a control algorithm that generalizes TD-learning. sarsa updates notjust on the trajectory(st,rt,st+1),but rather on the trajectory(st,at,rt,st+1,at+1). More specifically, given state-action pair (st,at), sarsa simulates the action at in state st to obtain the reward rt and transition to state st+1. The algorithm then uses its current optimal policy—based on the current Qvalues—togenerateits next action at+1 (but withprobabilityǫ it chooses an action at random). At this point, sarsa updates Q(st,at)as follows:
Q(st,at)← Q(st,at)+αk[rt + γQ(st+1,at+1)− Q(st,at)] (8)
This update rule is based on the following variant of Bellman’s optimality equations:
Q∗(st,at)= R(st,at)+γE[Q∗(st+1,π∗(st+1))] (9)
where π∗(st)∈ argmax Q∗(st+1,a) (10)
a
sarsa(MDP,γ,ǫ)
| Inputs | discount factor γ |
| rate of exploration ǫ | |
| Output | action-value function Q∗ |
| Initialize | Q= 0, random π, α according to schedule |
| repeat |
(a) take action a
′
′
choose action a = π(s ′ ), with probability1− ǫ ′
(d) Q(s,a)= Q(s,a)+α[r + γQ(s,a ′ )− Q(s,a)] ′′ )
(e) π(s)∈ argmaxa ′′ Q(s,a
′′
(f) s = s , a = a
3. decay α according to schedule
forever
Table 4: SARSA: On-policy Reinforcement Learning.
WhereasTD-learningis an application ofBellman’s theoremforV, Q-learningisbased onBellman’s optimality equations for Q:
Q∗(st,at)= R(st,at)+γE[max Q∗(st+1,a)] (11)
a
The corresponding update rule is the basis for Q -learning(see Table 5):
Q(st,at)← Q(st,at)+αk[rt + γmax Q(st+1,a)− Q(st,at)] (12)
a
sarsa is an on-policy reinforcement learning algorithm, which means that the algorithm learns a policy while simultaneouslyfollowing thatpolicy(or a close approximation thereof). In contrast, Q-learning is an off-policy reinforcement learning algorithm. The policy Q-learning follows while learning need not bear any resemblance to the policy the algorithm is following. Because it learns off-policy, the rate of exploration input to Q-learning (or any off-policy algorithm) can greatly exceed that which is input to sarsa (or any on-policy algorithm) leading to faster convergence. But Q-learning is not prevented from taking actions that are on-policy; doing so leads to behavior that is closely related to that of sarsa.
q learning(MDP,γ,ǫ)
Inputs discount factor γ
rate of exploration ǫ
Output action-value function Q∗
Initialize Q=0, α according to schedule
repeat
(a) take action a
′
(b) observe reward r and next state s
′
(c) Q(s,a)= Q(s,a)+α[r + γmaxa ′ Q(s,a ′)− Q(s,a)]
′
(d) choose action a
′′
(e) s = s , a = a
3. decay α according to schedule
forever
Table 5: Q-Learning: Off-policy reinforcement learning.
In case ofdeterministic environments, the update rulesforQ-learning and sarsa simplify asfollows: Q(st,at)← rt + γQ(st+1,at+1) (13) Q(st,at)← rt + γmaxa Q(st+1,a) (14)
Figure 1 depicts adeterministic maze. Possible moves areindicatedbyarrows. The final(absorbing) state is F; upon transitioning into state F, a reward of 100 is obtained. All other rewards are zero. Let γ =0.9.
| C | E 100 | F |
| A | B | D 100 |
Figure 1: Deterministic Maze.
Value Iteration
Q(s,a)l r A —81 B 090 C —90 D0 — E 0 100
Q(s,a)l r A —81 B 73 90 C —90 D81 — E 81 100
Q-Learning
u 81 90 — 100 —
u 81 90 — 100 — d — — 0 — 0
d — — 73 — 81
A B C D E
A B C D E
V(s) 81 90 90 100 100
V(s) 81 90 90 100 100
Trajectory Q-Learning
D → F E → F C → E → F A → C → E → F B → A → C → E → F D → B → A → C → E → F E → B → D → F Q(D,u) = 100 + .9maxa Q(F,a) = 100 Q(E,r) = 100 + .9maxa Q(F,a) = 100 Q(C,r) = 0 + .9maxa Q(E,a) =90 Q(A,u) = 0 + .9maxa Q(C,a) =81 Q(B,l) = 0 + .9maxa Q(A,a) =73 Q(D,l) = 0 + .9maxa Q(B,a) =66 Q(B,r) = 0 + .9maxa Q(D,a) =90 Q(E,d) = 0 + .9maxa Q(B,a) =81
Trajectory Q-Learning
D → F E → F C → E → F A → C → E → F B → A → C → E → F D → B → A → C → E → F E → B → D → F Q(D,u) = 100 + .9Q(F,q) = 100 Q(E,r) = 100 + .9Q(F,q) = 100 Q(C,r) = 0+ .9Q(E,r) = 90 Q(A,u) = 0+ .9Q(C,r) = 81 Q(B,l) = 0+ .9Q(A,u) = 73 Q(D,l) = 0+ .9Q(B,l) = 66 Q(B,r) = 0+ .9Q(D,u) = 90 Q(E,d) = 0+ .9Q(B,r) = 81