Date |
Topic |
Materials |

January 6 | Introduction to reinforcement learning. Bandit algorithms | RL book, chapters 1,2. |

January 13 | Wrap-up of bandits. Finite MDPs, dynamic programming. | RL book, chapters 3, 4 |

January 20 | Wrap-up of dynamic programming. | RL book chapters 4, 5 |

January 27 | Monte Carlo methods | RL book, chapter 6 |

February 3 | Temporal Difference Learning, Multi-step Bootstrapping | RL book, chapter 7 |

February 10 | n-steps methods, Planning and learning with tabular methods | RL book, chapter 8 |

February 17 | Temporal Abstraction (Sutton's slides) | RL book, chapter 9 |

February 24 | On-policy prediction with approximation | RL book, chapter 10, 11 |

March 10 | Eligibility traces, stability analysis | RL book, chapter 12 |

March 17 | Gradient-based Temporal Difference Methods | |

March 24 | Invited speaker: Harm van Seijen on Dutch Traces. Second part: Emphatic TD with Doina. | RL book, chapter 12. Harm's slides |

March 31 | Policy gradient methods | RL book, chapter 13. Doina's slides |

April 7 | Policy gradient in average reward setting + invited speaker: Herke Van Hoof on REPS | Herke's slides |

April 13 | Final project presentations: 9AM-3PM | |

April 19 | Final project presentations: 11AM-4PM | |

April 21 | Final project due by the end of the day |

Background on contextual bandits:

Interpretation of the discount factor as a random horizon:

- Proposition 5.3.1 in Puterman (1994)
- Derman, C. 1970. Finite State Markovian Decision Processes. Academic Press, New York

Modelling problems with multiple discount factors:

Convergence and contraction mappings:

- In Puterman (1994): see theorem 6.2.3 for an overview of the Banach fixed-point theorem and proposition 6.2.4 for a proof of value iteration using the contraction argument.

Convergence and spectral radius:

- See appendix A.3 of Puterman (1994) for background on the spectral radius and Neumann series expansion.
- As part of a presentation: show why a spectral radius strictly smaller than one gives us convergence. One intuitive way that you could show this is by expressing the initial error in the eigenbasis of the transition matrix and see how the terms in the expansion can vanish. You can find the full argument in Watkins' "Fundamentals of Matrix Computations" section 8.3.

On the existence of an optimal deterministic Markov policy in discounted MDPs:

- See section 6.2.4 of Puterman (1994)

On variants of value iteration and policy iteration:

- Modified policy iteration: instead of fully (to convergence) evaluating a policy, just compute a few Bellman backups and then improve. See section 6.5 of Puterman (1994)
- Asynchronous value iteration, Gauss-Seidel and Jacobi variants. See section 6.3.3 of Puterman (1994)
- Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 1993.

An interesting excercise could be to implement these variants and compare their speed of convergence. You can then verify how the theoretical rates of convergence match the empirical ones.

On the interpretation of policy iteration as Newton's method:

- See section 6.4.3 of Puterman (1994). Spoiler: modified policy iteration turns out to be a quasi-Newton's method.

Computational complexity of policy iteration and relation to the Simplex algorithm:

Suggested presentation: Show how equation (3) is obtained using Curtiss' CLT approach. For theory-minded students: use concentration inequalities to derive a better estimate. Implement Wasow's method and compare to a "classical" iterative method.

Suggested presentation: implement and compare RTDP against classical DP methods (including Gauss-Seidel variants).

The origins of off-policy methods in the simulation literature. Suggested presentations:

- Importance sampling is seen as a variance reduction method in the simulation community. Explain how you could design better behavior policies (the policy at the denominator) using this perspective. For example, implement the cross-entropy method of Rubinstein in an MDP of your choice. Compare to fixed/arbitrary behavior policies.
- Explain and demonstrate empirically how to use control variates (see Rubinstein) with first-visit and every-visit MC. Show how to choose the "optimal coefficient" and compare this choice with fixed/arbitrary values. This notion of "control variates" will come back in disguise with policy gradient methods (it will then be called a "baseline").

Importance sampling can be problematic when the behavior policy is not well suited to the target policy. Doina's idea of "recognizers" is to reshape the behavior policy to reduce variance. Suggested presentation: show theorem 1 and demonstrate the use of recognizers with first-visit and every-visit MC in an MDP of your choice.

See chapter 8 of Puterman (1994). Regarding the Laurent series expansion that Pierre-Luc alluded to: see section 8.2.2. Suggested presentation: overview of problem setting + demo in a continuing task. Interesting question to develop: the so-called "advantage function" is defined as $A_\pi(s, a) = Q_\pi(s, a) - v_\pi(s)$ and often appears in RL under the discounted setting. How is it related to the average reward case ?

Suggested presentation: explore the bias-variance tradeoff in Expected SARSA(0) vs SARSA(0). Present the variance analysis of section 5 and design new experiments (other than the cliff walking task and windy grid world). For the statisticians : try to relate the idea behind expected SARSA to the notion "conditioning" for variance reduction (see Rubinstein).

Suggested presentation: show lemma 1, illustrate the benefits of this approach in an experiment.

Suggested presentation: implement it ! We currently have no idea how Q(sigma) performs in practice. Try to design an experiment which would highlight the benefits of Q(sigma). Compare with expected SARSA and tree backup. Email Richard Sutton with your results.

You have seen n-steps returns and the idea of planning with a model in Dyna. This paper shows that we don't have to limit ourselves to one-step models and can also consider multi-steps extensions. Planning at multiple time scales can also be achieved by the idea of "beta-models". This paper was a precursor to the topic of temporally extended actions which we will see in a couple of weeks.

Suggested presentation: in the wall-following domain, use TD(0) to learn the models. Implement one more domain of your choice. Another interesting project would be to use multi-steps models in Dyna. Can you reduce the number of "fake" iterations in Dyna when your model predicts futher in the future ?

The original journal paper on Options

- The optimality equations for options are shown on page 190, eq. (11)
- Learning option values by SMDP TD: see section 3.2
- The intra-option Bellman equations that I showed on the board are shown in eq. (20)
- The Bellman equations for option models are shown on page 202
- The interruption execution rule is shown in Theorem 2 page 197

Useful to undertand how the notion of predictive knowledge shaped concepts such as eligibility traces, TD models, options, PSRs, TD networks, Horde, the Predictron ...

- Sutton's "manifesto" on predictive knowledge: Mind Is About Conditional Predictions
- Sutton's fourteen declarative principles of experience-oriented intelligence
- " all knowledge can be thought of as predictions of the outcomes of temporally extended ways of behaving, that is, policies with termination conditions, also known as “options."
- "Beyond Reward: The Problem of Knowledge and Data"

A talk that Doina gave last December in which she developed the idea of options as "programs" and potential connections to Neural Turing Machines.

There is a vast litterature on that topic. Just a few representative papers:

- Skill characterization based on betweenness
- Automated discovery of options in reinforcement learning
- Automatic discovery of subgoals in reinforcement learning using diverse density

"Bootstrapping methods are not in fact instances of true gradient descent"

The original paper by Sutton which first provided an analysis of TD(0) through the induced linear system:

Doina mentioned that there are ways to obtain true gradient-based TD algorithms. In fact, she co-invented the TDC algorithm shown in:

The official tile coding implementation that the UofA group uses:

- RLtoolkit
- A technical note for implementing tile coding
Suggested assignment/class project: implement at tile coder for the GPU (using Theano, TF, or in raw CUDA). It would very useful to have a "tile coding layer" in order to easily compare different function approximators and establish a baseline.

The origins of tile coding in the "cerebellar model articulator controller" (CMAC). The "receptive field" idea of CMAC echoes "weight sharing" and convolutions in CNNs:

On "optimal" aggregation using bisimulation:

A nice overview of the stability analysis of TD algorithms with linear function approximation:

The paper in which TD(lambda) was properly formalized, and analyzed (for the TD(0)) case:

Tsitsiklis and Van Roy, 1997:

Doina discussed the projection point of view on linear TD. Bertsekas has a line of research leveraging this projection point of view to perform general large scale linear algebra. This paper also shows how the projected Bellman operator can be analyzed in the Galerkin approximation framework.

The chattering effect in temporal difference methods for control:

Doina co-authored a paper which looked at the conditions for convergence in the control case:

Regarding non-parametric RL, a question raised by a student:

For a discussion about the bias-variance tradeoff:

- Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientific, Belmont, MA
- Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition, Vol. II, Athena Scientific, Belmont, MA.

Suggested presentation: try linear or nonlinear TDC on larger experiments, compare to the usual TD (aka. "semi-gradient" TD)

The paper in which gradient-based TD methods were developed:

Gradient-based TD was extended to the nonlinear case by taking an orthogonal projection onto the tangent space of a manifold:

The proximal perspective on gradient-based TD

Least-Squares Methods:

- Linear Least-Squares algorithms for temporal difference learning
- Boyan's extention of LSTD for the general lambda case Least-Squares Temporal Difference Learning
- Lagoudakis 2003: Least-Squares Policy Iteration

Interim forward view:

Emphatic TD

- An Emphatic Approach to the Problem
- On Convergence of Emphatic Temporal-Difference Learning
- Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

The emergence of Dutch traces and "follow-on" traces:

The original policy gradient paper by Sutton and colleagues:

The policy gradient theorem was also discovered independently by Konda and Tsitsiklis. This paper also contains a "two-timescales" analysis of the actor-critic architecture. (This kind of decoupling and two-timescale analysis might be of interest for the GAN enthusiasts).

Unpublished "theorem 4" by Sutton and colleagues showing that "unbiased values don't help":

Deterministic extension:

Natural gradient extension:

Philip Thomas showed that policy gradient algorithms were neglecting the discount factor and introducing bias:

Fitted value methods (mentionned, but won't be covered in class):

- Tree-Based Batch Mode Reinforcement Learning
- Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method

Classification-based methods

- Reinforcement learning as classification: Leveraging modern classifiers
- Classification-Based Approximate Policy Iteration

Mentionned in class following Mike's presentation:

- Linear methods implicitly building linear model: "An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning"
- Incremental Truncated LSTD