CS 141 Introduction to AI Greenwald
TBA
5:00 PM, Apr 20, 2009
|
Homework 7: Markov Decision Processes Due: |
By the end of this homework, you will understand:
why deterministic Q-learning converges
how discounting relates to dying
By the end of this homework, you will be able to:
avert disaster in a casino
The goal of this question is to extend the MINIMAX algorithm to games of chance, like backgammon and monopoly.
(a) Formally define games of chance by extending the definition of game trees.
(b) Extend the MINIMAX algorithm to take as input a game of chance.
The goal of this question is to extend value and policy iteration to games of chance, like backgammon and monopoly.
(a) Formally define games of chance by extending the definition of MDPs.
(b) Extend the value iteration algorithm to take as input a game of chance.
(c) Extend the policy iteration algorithm to take as input a game of chance.
(a) Show that for all Markov reward processes, the value function V is well-defined (i.e., finite-valued) if 0 ≤ γ < 1.
(b) Show that there exists a Markov reward process s.t. the value function V is not necessarily well-defined if γ = 1.
(c) Prove that any Markov reward process M1, which for discount factor 0 ≤ γ < 1 yields value function V1, can be transformed into another Markov reward process M2, s.t. the undiscounted value function V2 = V1. [Hint: Introduce a zero-reward, absorbing state END in M2 s.t. all state-action pairs in M1 transition to END with probability 1 - γ.]
Consider the following controlled version of Gambler’s Ruin in
which the gambler places bets on the outcome of a biased coin flip.
Assume the gambler’s worth is between 0 and N (i.e., S = { 0, 1,
..., N } ∪ { END }). At each state s, the
gambler stakes some amount n in the range
. The coin turns up heads with probability p and
tails with probability 1 - p. If the coin turns up heads, the gambler
wins the amount he stakes (i.e., he transitions to state s + n);
otherwise, the gambler loses the amount he stakes (i.e., he transitions
to state s - n). A reward of 1 is received upon transitioning to
state N; all other rewards are 0. The gambler transitions from
states 0 and N to the absorbing state END,
deterministically. The constraints on the gambler’s range of actions
ensure that (i) he stakes at least $1 but no more than his worth;
(ii) he stakes no more than N - s, preventing his worth from ever
exceeding N.
(a) Draw the corresponding MDP, assuming the gambler’s worth is limited to $5 (i.e., N = 5).
(b) What is the optimal policy if p < 0.5, N = 100, and γ = 1. Why? (Solve this problem by implementing value iteration or policy iteration.)
(c) What is the optimal policy if p > 0.5, N = 100, and γ = 1. Why?
The figure below depicts a Markov reward process with seven states. Every state except the terminal state has two possible successors, each of which occurs with probability 0.5. Rewards are associated with transitions as shown.
|
Suppose the following sample trajectories are observed:

Assume the values V(s) are initialized to 0, the learning rate α = 0.5, and the discount factor γ = 1.
(a) After learning from the first of these trajectories, what value does Monte-Carlo learning assign to each state?
(b) After learning from the second of these trajectories, what value does TD learning assign to each state?
(c) An alternative to TD or Monte-Carlo learning is to use the observed sample trajectories to construct a “best guess” at the model that generated those trajectories, and then to solve that model using policy evaluation.
The transition probabilities of the “best guess” model are defined as follows:

The rewards model R(s,s′) of state transitions out of s into s′ is given by:

Draw the “best-guess” model corresponding to the two observed trajectories, and solve it using policy evaluation.
The Q-learning update rule for deterministic MDPs is as follows:

Prove that Q-learning converges in deterministic MDPs.