COMP-767 : Reinforcement Learning

Date	Topic	Materials
January 6	Introduction to reinforcement learning. Bandit algorithms	RL book, chapters 1,2.
January 13	Wrap-up of bandits. Finite MDPs, dynamic programming.	RL book, chapters 3, 4
January 20	Wrap-up of dynamic programming.	RL book chapters 4, 5
January 27	Monte Carlo methods	RL book, chapter 6
February 3	Temporal Difference Learning, Multi-step Bootstrapping	RL book, chapter 7
February 10	n-steps methods, Planning and learning with tabular methods	RL book, chapter 8
February 17	Temporal Abstraction (Sutton's slides)	RL book, chapter 9
February 24	On-policy prediction with approximation	RL book, chapter 10, 11
March 10	Eligibility traces, stability analysis	RL book, chapter 12
March 17	Gradient-based Temporal Difference Methods
March 24	Invited speaker: Harm van Seijen on Dutch Traces. Second part: Emphatic TD with Doina.	RL book, chapter 12. Harm's slides
March 31	Policy gradient methods	RL book, chapter 13. Doina's slides
April 7	Policy gradient in average reward setting + invited speaker: Herke Van Hoof on REPS	Herke's slides
April 13	Final project presentations: 9AM-3PM
April 19	Final project presentations: 11AM-4PM
April 21	Final project due by the end of the day

January 6

banditalgs.com: First steps: Explore-then-Commit

January 13

Background on contextual bandits:

From Ads to Interventions: Contextual Bandits in Mobile Health, Tewari & Murphy, 2017.

Interpretation of the discount factor as a random horizon:

Proposition 5.3.1 in Puterman (1994)
Derman, C. 1970. Finite State Markovian Decision Processes. Academic Press, New York

of a discounted reward process, Haviv and Puterman, 1992.

Modelling problems with multiple discount factors:

Death and Discounting, Adam Shwartz, 2001.

January 20

Convergence and contraction mappings:

In Puterman (1994): see theorem 6.2.3 for an overview of the Banach fixed-point theorem and proposition 6.2.4 for a proof of value iteration using the contraction argument.

Convergence and spectral radius:

See appendix A.3 of Puterman (1994) for background on the spectral radius and Neumann series expansion.
As part of a presentation: show why a spectral radius strictly smaller than one gives us convergence. One intuitive way that you could show this is by expressing the initial error in the eigenbasis of the transition matrix and see how the terms in the expansion can vanish. You can find the full argument in Watkins' "Fundamentals of Matrix Computations" section 8.3.

On the existence of an optimal deterministic Markov policy in discounted MDPs:

See section 6.2.4 of Puterman (1994)

On variants of value iteration and policy iteration:

Modified policy iteration: instead of fully (to convergence) evaluating a policy, just compute a few Bellman backups and then improve. See section 6.5 of Puterman (1994)
Asynchronous value iteration, Gauss-Seidel and Jacobi variants. See section 6.3.3 of Puterman (1994)
Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 1993.

An interesting excercise could be to implement these variants and compare their speed of convergence. You can then verify how the theoretical rates of convergence match the empirical ones.

On the interpretation of policy iteration as Newton's method:

See section 6.4.3 of Puterman (1994). Spoiler: modified policy iteration turns out to be a quasi-Newton's method.

Computational complexity of policy iteration and relation to the Simplex algorithm:

The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

January 27

Monte Carlo Matrix Inversion and Reinforcement Learning

Suggested presentation: Show how equation (3) is obtained using Curtiss' CLT approach. For theory-minded students: use concentration inequalities to derive a better estimate. Implement Wasow's method and compare to a "classical" iterative method.

Learning to act using real-time dynamic programming

Suggested presentation: implement and compare RTDP against classical DP methods (including Gauss-Seidel variants).

Simulation and the Monte Carlo Method

The origins of off-policy methods in the simulation literature. Suggested presentations:

Importance sampling is seen as a variance reduction method in the simulation community. Explain how you could design better behavior policies (the policy at the denominator) using this perspective. For example, implement the cross-entropy method of Rubinstein in an MDP of your choice. Compare to fixed/arbitrary behavior policies.
Explain and demonstrate empirically how to use control variates (see Rubinstein) with first-visit and every-visit MC. Show how to choose the "optimal coefficient" and compare this choice with fixed/arbitrary values. This notion of "control variates" will come back in disguise with policy gradient methods (it will then be called a "baseline").

Off-policy Learning with Recognizers

Importance sampling can be problematic when the behavior policy is not well suited to the target policy. Doina's idea of "recognizers" is to reshape the behavior policy to reduce variance. Suggested presentation: show theorem 1 and demonstrate the use of recognizers with first-visit and every-visit MC in an MDP of your choice.

Simulating Discounted Costs

Based on the interpretation of the discount factor as a random horizon, one can devise a MC algorithm for policy evaluation in which rewards are sampled up to a geometric stopping time. Explain this methodology and implement it in an example MDP. How does it compare to the discount-aware approach that we've seen in Sutton & Barto ?

Average Reward criterion

See chapter 8 of Puterman (1994). Regarding the Laurent series expansion that Pierre-Luc alluded to: see section 8.2.2. Suggested presentation: overview of problem setting + demo in a continuing task. Interesting question to develop: the so-called "advantage function" is defined as $A_\pi(s, a) = Q_\pi(s, a) - v_\pi(s)$ and often appears in RL under the discounted setting. How is it related to the average reward case ?

February 3

A Theoretical and Empirical Analysis of Expected Sarsa, van Seijen, van Hasselt, Whiteson, and Weiring (2009)

Suggested presentation: explore the bias-variance tradeoff in Expected SARSA(0) vs SARSA(0). Present the variance analysis of section 5 and design new experiments (other than the cliff walking task and windy grid world). For the statisticians : try to relate the idea behind expected SARSA to the notion "conditioning" for variance reduction (see Rubinstein).

Double Q-learning, van Hasselt (2011)

Suggested presentation: show lemma 1, illustrate the benefits of this approach in an experiment.

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Singh, Jaakkola, Littman, Szepesvári (2000).

February 10

Q(sigma)

Suggested presentation: implement it ! We currently have no idea how Q(sigma) performs in practice. Try to design an experiment which would highlight the benefits of Q(sigma). Compare with expected SARSA and tree backup. Email Richard Sutton with your results.

Eligibility Traces for Off-Policy Policy Evaluation

This is the original paper in which Doina introduced Tree Backup (renamed from "q_t-pi"). We haven't seen eligibility traces yet, so this paper shouldn't be considered for presentation next week.

TD Models: Modeling the World at a Mixture of Time Scales

You have seen n-steps returns and the idea of planning with a model in Dyna. This paper shows that we don't have to limit ourselves to one-step models and can also consider multi-steps extensions. Planning at multiple time scales can also be achieved by the idea of "beta-models". This paper was a precursor to the topic of temporally extended actions which we will see in a couple of weeks.

Suggested presentation: in the wall-following domain, use TD(0) to learn the models. Implement one more domain of your choice. Another interesting project would be to use multi-steps models in Dyna. Can you reduce the number of "fake" iterations in Dyna when your model predicts futher in the future ?

February 17

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

The original journal paper on Options

The optimality equations for options are shown on page 190, eq. (11)
Learning option values by SMDP TD: see section 3.2
The intra-option Bellman equations that I showed on the board are shown in eq. (20)
The Bellman equations for option models are shown on page 202
The interruption execution rule is shown in Theorem 2 page 197

Predictive Knowledge:

Useful to undertand how the notion of predictive knowledge shaped concepts such as eligibility traces, TD models, options, PSRs, TD networks, Horde, the Predictron ...

Sutton's "manifesto" on predictive knowledge: Mind Is About Conditional Predictions
Sutton's fourteen declarative principles of experience-oriented intelligence

" all knowledge can be thought of as predictions of the outcomes of temporally extended ways of behaving, that is, policies with termination conditions, also known as “options."

"Beyond Reward: The Problem of Knowledge and Data"

Options as behavioral programs:

A talk that Doina gave last December in which she developed the idea of options as "programs" and potential connections to Neural Turing Machines.

Slides: "From temporal abstraction to programs"

The bottleneck approach:

There is a vast litterature on that topic. Just a few representative papers:

February 24

"Bootstrapping methods are not in fact instances of true gradient descent"

Temporal-Difference Methods and Markov Models

The original paper by Sutton which first provided an analysis of TD(0) through the induced linear system:

Learning to Predict by the Methods of Temporal Differences

Doina mentioned that there are ways to obtain true gradient-based TD algorithms. In fact, she co-invented the TDC algorithm shown in:

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

The official tile coding implementation that the UofA group uses:

RLtoolkit
A technical note for implementing tile coding
- Suggested assignment/class project: implement at tile coder for the GPU (using Theano, TF, or in raw CUDA). It would very useful to have a "tile coding layer" in order to easily compare different function approximators and establish a baseline.

The origins of tile coding in the "cerebellar model articulator controller" (CMAC). The "receptive field" idea of CMAC echoes "weight sharing" and convolutions in CNNs:

A theory of cerebellar function

On "optimal" aggregation using bisimulation:

Bisimulation metrics for continuous Markov Decision Processes

March 10

A nice overview of the stability analysis of TD algorithms with linear function approximation:

Section 2: An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

The paper in which TD(lambda) was properly formalized, and analyzed (for the TD(0)) case:

Learning to Predict by the Methods of Temporal Differences

Tsitsiklis and Van Roy, 1997:

An Analysis of Temporal-Difference Learning with Function Approximation

Doina discussed the projection point of view on linear TD. Bertsekas has a line of research leveraging this projection point of view to perform general large scale linear algebra. This paper also shows how the projected Bellman operator can be analyzed in the Galerkin approximation framework.

Projected Equation Methods for Approximate Solution of Large Linear Systems

The chattering effect in temporal difference methods for control:

Chattering in Sarsa(lambda)

Doina co-authored a paper which looked at the conditions for convergence in the control case:

A Convergent Form of Approximate Policy Iteration

Regarding non-parametric RL, a question raised by a student:

For a discussion about the bias-variance tradeoff:

Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientific, Belmont, MA
Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition, Vol. II, Athena Scientific, Belmont, MA.

March 17

Suggested presentation: try linear or nonlinear TDC on larger experiments, compare to the usual TD (aka. "semi-gradient" TD)

The paper in which gradient-based TD methods were developed:

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Gradient-based TD was extended to the nonlinear case by taking an orthogonal projection onto the tangent space of a manifold:

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

The proximal perspective on gradient-based TD

Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

March 24

Least-Squares Methods:

Linear Least-Squares algorithms for temporal difference learning
Boyan's extention of LSTD for the general lambda case Least-Squares Temporal Difference Learning
Lagoudakis 2003: Least-Squares Policy Iteration

Interim forward view:

A new Q(lambda) with interim forward view and Monte Carlo equivalence

Emphatic TD

An Emphatic Approach to the Problem

The emergence of Dutch traces and "follow-on" traces:

Learning to Predict Independent of Span

March 31

The original policy gradient paper by Sutton and colleagues:

Policy Gradient Methods for Reinforcement Learning with Function Approximation

The policy gradient theorem was also discovered independently by Konda and Tsitsiklis. This paper also contains a "two-timescales" analysis of the actor-critic architecture. (This kind of decoupling and two-timescale analysis might be of interest for the GAN enthusiasts).

On actor-critic algorithms

Unpublished "theorem 4" by Sutton and colleagues showing that "unbiased values don't help":

Comparing Policy-Gradient Algorithms

Deterministic extension:

Deterministic Policy Gradient Algorithms

Natural gradient extension:

Natural Actor-Critic Algorithms

Philip Thomas showed that policy gradient algorithms were neglecting the discount factor and introducing bias:

Bias in Natural Actor-Critic Algorithms

Fitted value methods (mentionned, but won't be covered in class):

Classification-based methods

Mentionned in class following Mike's presentation:

Linear methods implicitly building linear model: "An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning"
Incremental Truncated LSTD

Lecture Schedule

Further Reading