CSCI2951-F: Learning and Sequential Decision Making

Brown University
Fall 2017
Prof. Michael L. Littman (mlittman@cs.brown.edu, CIT 301)

Time: TTh 10:30-11:50
Place: Brown CIT 316 (subject to change)
Semester: Fall 2017
Web page: http://cs.brown.edu/courses/cs2951f/

Office hours: By appointment.

General Orientation to the Course: Through a combination of classic papers and more recent work, the course explores automated decision making from a computer-science perspective. It examines efficient algorithms, where they exist, for single agent and multiagent planning as well as approaches to learning near-optimal decisions from experience. Topics include Markov decision processes, stochastic and repeated games, partially observable Markov decision processes, and reinforcement learning. Of particular interest will be issues of generalization, exploration, and representation. Students will replicate a result in a published paper in the area. Participants should have taken a machine-learning course and should have had some exposure to reinforcement learning from a previous computer-science class or seminar; check with instructor if not sure. Students should already know how to program. No particular language will be required, but Python is growing interest in the community.

Course goals: Students should understand the main concepts of interest to the field of reinforcement learning, be able to implement standard algorithms and understand how to apply them to relevant problems, and be prepared to be able to contribute new results to the field.

Prerequisites: CSCI 1950F or CSCI 1420 or permission of the instructor.

Flipped structure: Students need to watch the lectures recorded by the professor and available online. Class time will be used for problem solving and discussion.

Result replication presentation: Students will form into small groups of two to four, and select a relevant paper from the literature. They will choose a graph in the paper and create an independent implementation/replication of this result. Students often find that important parameters needed to replicate the result are not stated in the paper and that obtaining the same pattern of results is sometimes not possible. Students will present their work at the end of the semester. Grades are based on the fidelity of the replication (25%), how well they show they understand the original paper (25%), the quality of the presentation itself in terms of clarity and creativity (25%), and their short written report (25%). The grade on this project will represent 50% of the final grade in the class.

Grading: Final grade is derived from: Per-class quizzes (50%), result replication presentation (50%).

Schedule

9/7: RL intro.
RL Demo, Intro slides.
9/12: Lesson: Introduction to Reinforcement Learnng. (Michael out).
Pedro Tsividis on learning to play like humans.
9/14: Lesson: Smoov and Curly's Bogus Journey.
9/19: Lesson: Reinforcement Learnng Basics.
Read Chapters 1 and 2 of Littman (1996).
9/21: Lesson: TD and Friends.
Read Sutton (1988).
9/26:
Read Thomas, Niekum, Theocharous, and Konidaris (2015).
9/28: Lesson: Convergence.
Read Littman and Szepesvári (1996).
10/3: (Michael out).
George Konidaris on alternative notions of return.
10/5:
10/10: Lesson: AAA.
Read Kocsis and Szepesvári (2006).
10/12: Project discussion.
10/17: Lesson: Messing with Rewards.
Read Ng, Harada, Russell (1999) and Asmuth, Littman, Zinkov (2008).
10/19: Lesson: Exploring Exploration.
Read Fong (1995).
10/24:
Read Li, Littman, Walsh (2008) and Poupart, Vlassis, Hoey, and Regan (2006).
10/26: Lesson: Generalization.
Read Gordon (1995), Baird (1995) and Smart and Kaelbling (2000).
10/31:
Read Baxter and Bartlett (1999) and Lagoudakis and Parr (2001).
11/2: Lesson: Partially Observable MDPs.
Read Littman (2009) and Smith, Simmons (2005).
11/7: (Michael out).
Dave Abel presents "SimpleRL" RL package.
11/9: Lesson: Options.
Read Sutton, Precup, Singh (1999), Jong, Hester, Stone (2008) (including slides), and Silver and Ciosek (2012).
11/14: Lesson: Game Theory.
Read Littman and Stone (2003).
11/16: (Michael out).
Jun Ki Lee presents implementing deep RL algorithms in TensorFlow.
11/21: Lesson: Game Theory Reloaded.
Read Littman (1994) and Greenwald and Hall (2003).
11/28: Lesson: Game Theory Revolutions.
Read Munoz de Cote and Littman (2008).
11/30: Lesson: CCC.
Read Zeibart et al. (2008) and Babes et al. (2011).
12/5: Discussion: Deep RL.
Read Minh et al. (2015).
Robot learning videos:
- Box Pushing.
- Aibo Ingenuity.
- Live Learning.
- Batting.
- Robocup Walking.
- Helicopter.
- Robot Preschool.
- Arm Farm.
- Cart Pole.
12/7: Presentations.
12/12: Presentations.
12/14: Presentations.
12/15: Presentations.

Topics and Paper Links

The space of sequential decision-making problems, geometric discounting implies no decision reversals. Work out a concrete example of a reversal.
Define MDP. Talk about representations (math, graph, table, OO). Work out a concrete example of an MDP and its value. Policies and optimality. Discuss 1/(1-gamma). Bellman's equation.
Value iteration and Bellman's equation. Value of a policy is the solution to a system of linear equations. Define contraction mapping and show value iteration contacts.
Model-based estimate. TD(lambda).
Hoeffding bounds for epsilon-delta assurance of Monte Carlo. Q-learning with a fixed policy. Connection to averaging, learning rates, and Bellman equation.
Coding VI for policy optimization.
Convergence proof of VI. Q-learning. The absolute difference between the k-th order statistic of two lists is less than the pairwise absolute differences between the lists.
Generalized Q-learning convergence. Policy iteration. Each iteration of policy iteration results in an improvement, policy iteration converges in finite time.
Linear programming. Linear programming can solve MDPs in polynomial time. MDP dual LP. Zero-sum games.
General sum games, Nash equilibria. Nash equilibria in repeated games.
Stochastic games. Minimax-Q. Grid games.
Exploration, bandits, algorithms.
KWIK learning and efficient RL.
POMDPs.
HSVI.
Shaping.
Project discussion.
Options.
Generalization. Introduce averagers and "hill car the hard way". Provide a counter example for more general convergence.
Policy gradient.
LSPI.
UCT.
Training RL agents.
Bayesian RL.
Memory-based RL.
Inverse reinforcement learning.
Deep reinforcement learning.