# Lecture Schedule

 Date Topic Materials January 6 Introduction to reinforcement learning. Bandit algorithms RL book, chapters 1,2. January 13 Wrap-up of bandits. Finite MDPs, dynamic programming. RL book, chapters 3, 4 January 20 Wrap-up of dynamic programming. RL book chapters 4, 5 January 27 Monte Carlo methods RL book, chapter 6 February 3 Temporal Difference Learning, Multi-step Bootstrapping RL book, chapter 7 February 10 n-steps methods, Planning and learning with tabular methods RL book, chapter 8 February 17 Temporal Abstraction (Sutton's slides) RL book, chapter 9 February 24 On-policy prediction with approximation RL book, chapter 10, 11 March 10 Eligibility traces, stability analysis RL book, chapter 12 March 17 Gradient-based Temporal Difference Methods March 24 Invited speaker: Harm van Seijen on Dutch Traces. Second part: Emphatic TD with Doina. RL book, chapter 12. Harm's slides March 31 Policy gradient methods RL book, chapter 13. Doina's slides April 7 Policy gradient in average reward setting + invited speaker: Herke Van Hoof on REPS Herke's slides April 13 Final project presentations: 9AM-3PM April 19 Final project presentations: 11AM-4PM April 21 Final project due by the end of the day

### January 13

Background on contextual bandits:

Interpretation of the discount factor as a random horizon:

of a discounted reward process, Haviv and Puterman, 1992.

Modelling problems with multiple discount factors:

### January 20

Convergence and contraction mappings:

• In Puterman (1994): see theorem 6.2.3 for an overview of the Banach fixed-point theorem and proposition 6.2.4 for a proof of value iteration using the contraction argument.

• See appendix A.3 of Puterman (1994) for background on the spectral radius and Neumann series expansion.
• As part of a presentation: show why a spectral radius strictly smaller than one gives us convergence. One intuitive way that you could show this is by expressing the initial error in the eigenbasis of the transition matrix and see how the terms in the expansion can vanish. You can find the full argument in Watkins' "Fundamentals of Matrix Computations" section 8.3.

On the existence of an optimal deterministic Markov policy in discounted MDPs:

On variants of value iteration and policy iteration:

• Modified policy iteration: instead of fully (to convergence) evaluating a policy, just compute a few Bellman backups and then improve. See section 6.5 of Puterman (1994)
• Asynchronous value iteration, Gauss-Seidel and Jacobi variants. See section 6.3.3 of Puterman (1994)
• Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 1993.

An interesting excercise could be to implement these variants and compare their speed of convergence. You can then verify how the theoretical rates of convergence match the empirical ones.

On the interpretation of policy iteration as Newton's method:

• See section 6.4.3 of Puterman (1994). Spoiler: modified policy iteration turns out to be a quasi-Newton's method.

Computational complexity of policy iteration and relation to the Simplex algorithm:

### January 27

##### Monte Carlo Matrix Inversion and Reinforcement Learning

Suggested presentation: Show how equation (3) is obtained using Curtiss' CLT approach. For theory-minded students: use concentration inequalities to derive a better estimate. Implement Wasow's method and compare to a "classical" iterative method.

##### Learning to act using real-time dynamic programming

Suggested presentation: implement and compare RTDP against classical DP methods (including Gauss-Seidel variants).

##### Simulation and the Monte Carlo Method

The origins of off-policy methods in the simulation literature. Suggested presentations:

• Importance sampling is seen as a variance reduction method in the simulation community. Explain how you could design better behavior policies (the policy at the denominator) using this perspective. For example, implement the cross-entropy method of Rubinstein in an MDP of your choice. Compare to fixed/arbitrary behavior policies.
• Explain and demonstrate empirically how to use control variates (see Rubinstein) with first-visit and every-visit MC. Show how to choose the "optimal coefficient" and compare this choice with fixed/arbitrary values. This notion of "control variates" will come back in disguise with policy gradient methods (it will then be called a "baseline").
##### Off-policy Learning with Recognizers

Importance sampling can be problematic when the behavior policy is not well suited to the target policy. Doina's idea of "recognizers" is to reshape the behavior policy to reduce variance. Suggested presentation: show theorem 1 and demonstrate the use of recognizers with first-visit and every-visit MC in an MDP of your choice.

##### Simulating Discounted Costs
Based on the interpretation of the discount factor as a random horizon, one can devise a MC algorithm for policy evaluation in which rewards are sampled up to a geometric stopping time. Explain this methodology and implement it in an example MDP. How does it compare to the discount-aware approach that we've seen in Sutton & Barto ?
##### Average Reward criterion

See chapter 8 of Puterman (1994). Regarding the Laurent series expansion that Pierre-Luc alluded to: see section 8.2.2. Suggested presentation: overview of problem setting + demo in a continuing task. Interesting question to develop: the so-called "advantage function" is defined as $A_\pi(s, a) = Q_\pi(s, a) - v_\pi(s)$ and often appears in RL under the discounted setting. How is it related to the average reward case ?

### February 3

##### A Theoretical and Empirical Analysis of Expected Sarsa, van Seijen, van Hasselt, Whiteson, and Weiring (2009)

Suggested presentation: explore the bias-variance tradeoff in Expected SARSA(0) vs SARSA(0). Present the variance analysis of section 5 and design new experiments (other than the cliff walking task and windy grid world). For the statisticians : try to relate the idea behind expected SARSA to the notion "conditioning" for variance reduction (see Rubinstein).

##### Double Q-learning, van Hasselt (2011)

Suggested presentation: show lemma 1, illustrate the benefits of this approach in an experiment.

### February 10

##### Q(sigma)

Suggested presentation: implement it ! We currently have no idea how Q(sigma) performs in practice. Try to design an experiment which would highlight the benefits of Q(sigma). Compare with expected SARSA and tree backup. Email Richard Sutton with your results.

##### Eligibility Traces for Off-Policy Policy Evaluation
This is the original paper in which Doina introduced Tree Backup (renamed from "q_t-pi"). We haven't seen eligibility traces yet, so this paper shouldn't be considered for presentation next week.
##### TD Models: Modeling the World at a Mixture of Time Scales

You have seen n-steps returns and the idea of planning with a model in Dyna. This paper shows that we don't have to limit ourselves to one-step models and can also consider multi-steps extensions. Planning at multiple time scales can also be achieved by the idea of "beta-models". This paper was a precursor to the topic of temporally extended actions which we will see in a couple of weeks.

Suggested presentation: in the wall-following domain, use TD(0) to learn the models. Implement one more domain of your choice. Another interesting project would be to use multi-steps models in Dyna. Can you reduce the number of "fake" iterations in Dyna when your model predicts futher in the future ?

### February 17

##### Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

The original journal paper on Options

• The optimality equations for options are shown on page 190, eq. (11)
• Learning option values by SMDP TD: see section 3.2
• The intra-option Bellman equations that I showed on the board are shown in eq. (20)
• The Bellman equations for option models are shown on page 202
• The interruption execution rule is shown in Theorem 2 page 197
##### Predictive Knowledge:

Useful to undertand how the notion of predictive knowledge shaped concepts such as eligibility traces, TD models, options, PSRs, TD networks, Horde, the Predictron ...

##### Options as behavioral programs:

A talk that Doina gave last December in which she developed the idea of options as "programs" and potential connections to Neural Turing Machines.

##### The bottleneck approach:

There is a vast litterature on that topic. Just a few representative papers:

### February 24

"Bootstrapping methods are not in fact instances of true gradient descent"

The original paper by Sutton which first provided an analysis of TD(0) through the induced linear system:

Doina mentioned that there are ways to obtain true gradient-based TD algorithms. In fact, she co-invented the TDC algorithm shown in:

The official tile coding implementation that the UofA group uses:

• RLtoolkit
• A technical note for implementing tile coding
• Suggested assignment/class project: implement at tile coder for the GPU (using Theano, TF, or in raw CUDA). It would very useful to have a "tile coding layer" in order to easily compare different function approximators and establish a baseline.

The origins of tile coding in the "cerebellar model articulator controller" (CMAC). The "receptive field" idea of CMAC echoes "weight sharing" and convolutions in CNNs:

On "optimal" aggregation using bisimulation:

### March 10

A nice overview of the stability analysis of TD algorithms with linear function approximation:

The paper in which TD(lambda) was properly formalized, and analyzed (for the TD(0)) case:

Tsitsiklis and Van Roy, 1997:

Doina discussed the projection point of view on linear TD. Bertsekas has a line of research leveraging this projection point of view to perform general large scale linear algebra. This paper also shows how the projected Bellman operator can be analyzed in the Galerkin approximation framework.

The chattering effect in temporal difference methods for control:

Doina co-authored a paper which looked at the conditions for convergence in the control case:

Regarding non-parametric RL, a question raised by a student:

• Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientific, Belmont, MA
• Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition, Vol. II, Athena Scientific, Belmont, MA.

### March 17

Suggested presentation: try linear or nonlinear TDC on larger experiments, compare to the usual TD (aka. "semi-gradient" TD)

The paper in which gradient-based TD methods were developed:

Gradient-based TD was extended to the nonlinear case by taking an orthogonal projection onto the tangent space of a manifold:

The proximal perspective on gradient-based TD

### March 24

Least-Squares Methods:

Interim forward view:

Emphatic TD

The emergence of Dutch traces and "follow-on" traces:

### March 31

The original policy gradient paper by Sutton and colleagues:

The policy gradient theorem was also discovered independently by Konda and Tsitsiklis. This paper also contains a "two-timescales" analysis of the actor-critic architecture. (This kind of decoupling and two-timescale analysis might be of interest for the GAN enthusiasts).

Unpublished "theorem 4" by Sutton and colleagues showing that "unbiased values don't help":

Deterministic extension: