|January 6||Introduction to reinforcement learning. Bandit algorithms||RL book, chapters 1,2.|
|January 13||Wrap-up of bandits. Finite MDPs, dynamic programming.||RL book, chapters 3, 4||January 20||Wrap-up of dynamic programming.||RL book chapters 4, 5|
|January 27||Monte Carlo methods||RL book, chapter 6
|February 3||Temporal Difference Learning, Multi-step Bootstrapping||RL book, chapter 7|
|February 10||n-steps methods, Planning and learning with tabular methods||RL book, chapter 8
|February 17||Temporal Abstraction (Sutton's slides)||RL book, chapter 9|
|February 24||On-policy prediction with approximation||RL book, chapter 10, 11|
|March 10||Eligibility traces, stability analysis||RL book, chapter 12|
|March 17||Gradient-based Temporal Difference Methods|
|March 24||Invited speaker: Harm van Seijen on Dutch Traces. Second part: Emphatic TD with Doina.||RL book, chapter 12. Harm's slides|
|March 31||Policy gradient methods||RL book, chapter 13. Doina's slides|
|April 7||Policy gradient in average reward setting + invited speaker: Herke Van Hoof on REPS||Herke's slides|
|April 13||Final project presentations: 9AM-3PM|
|April 19||Final project presentations: 11AM-4PM|
|April 21||Final project due by the end of the day|
Background on contextual bandits:
Interpretation of the discount factor as a random horizon:
Modelling problems with multiple discount factors:
Convergence and contraction mappings:
Convergence and spectral radius:
On the existence of an optimal deterministic Markov policy in discounted MDPs:
On variants of value iteration and policy iteration:
An interesting excercise could be to implement these variants and compare their speed of convergence. You can then verify how the theoretical rates of convergence match the empirical ones.
On the interpretation of policy iteration as Newton's method:
Computational complexity of policy iteration and relation to the Simplex algorithm:
Suggested presentation: Show how equation (3) is obtained using Curtiss' CLT approach. For theory-minded students: use concentration inequalities to derive a better estimate. Implement Wasow's method and compare to a "classical" iterative method.
Suggested presentation: implement and compare RTDP against classical DP methods (including Gauss-Seidel variants).
The origins of off-policy methods in the simulation literature. Suggested presentations:
Importance sampling can be problematic when the behavior policy is not well suited to the target policy. Doina's idea of "recognizers" is to reshape the behavior policy to reduce variance. Suggested presentation: show theorem 1 and demonstrate the use of recognizers with first-visit and every-visit MC in an MDP of your choice.
See chapter 8 of Puterman (1994). Regarding the Laurent series expansion that Pierre-Luc alluded to: see section 8.2.2. Suggested presentation: overview of problem setting + demo in a continuing task. Interesting question to develop: the so-called "advantage function" is defined as $A_\pi(s, a) = Q_\pi(s, a) - v_\pi(s)$ and often appears in RL under the discounted setting. How is it related to the average reward case ?
Suggested presentation: explore the bias-variance tradeoff in Expected SARSA(0) vs SARSA(0). Present the variance analysis of section 5 and design new experiments (other than the cliff walking task and windy grid world). For the statisticians : try to relate the idea behind expected SARSA to the notion "conditioning" for variance reduction (see Rubinstein).
Suggested presentation: show lemma 1, illustrate the benefits of this approach in an experiment.
Suggested presentation: implement it ! We currently have no idea how Q(sigma) performs in practice. Try to design an experiment which would highlight the benefits of Q(sigma). Compare with expected SARSA and tree backup. Email Richard Sutton with your results.
You have seen n-steps returns and the idea of planning with a model in Dyna. This paper shows that we don't have to limit ourselves to one-step models and can also consider multi-steps extensions. Planning at multiple time scales can also be achieved by the idea of "beta-models". This paper was a precursor to the topic of temporally extended actions which we will see in a couple of weeks.
Suggested presentation: in the wall-following domain, use TD(0) to learn the models. Implement one more domain of your choice. Another interesting project would be to use multi-steps models in Dyna. Can you reduce the number of "fake" iterations in Dyna when your model predicts futher in the future ?
The original journal paper on Options
A talk that Doina gave last December in which she developed the idea of options as "programs" and potential connections to Neural Turing Machines.
There is a vast litterature on that topic. Just a few representative papers:
"Bootstrapping methods are not in fact instances of true gradient descent"
The original paper by Sutton which first provided an analysis of TD(0) through the induced linear system:
Doina mentioned that there are ways to obtain true gradient-based TD algorithms. In fact, she co-invented the TDC algorithm shown in:
The official tile coding implementation that the UofA group uses:
Suggested assignment/class project: implement at tile coder for the GPU (using Theano, TF, or in raw CUDA). It would very useful to have a "tile coding layer" in order to easily compare different function approximators and establish a baseline.
The origins of tile coding in the "cerebellar model articulator controller" (CMAC). The "receptive field" idea of CMAC echoes "weight sharing" and convolutions in CNNs:
On "optimal" aggregation using bisimulation:
A nice overview of the stability analysis of TD algorithms with linear function approximation:
The paper in which TD(lambda) was properly formalized, and analyzed (for the TD(0)) case:
Tsitsiklis and Van Roy, 1997:
Doina discussed the projection point of view on linear TD. Bertsekas has a line of research leveraging this projection point of view to perform general large scale linear algebra. This paper also shows how the projected Bellman operator can be analyzed in the Galerkin approximation framework.
The chattering effect in temporal difference methods for control:
Doina co-authored a paper which looked at the conditions for convergence in the control case:
Regarding non-parametric RL, a question raised by a student:
For a discussion about the bias-variance tradeoff:
Suggested presentation: try linear or nonlinear TDC on larger experiments, compare to the usual TD (aka. "semi-gradient" TD)
The paper in which gradient-based TD methods were developed:
Gradient-based TD was extended to the nonlinear case by taking an orthogonal projection onto the tangent space of a manifold:
The proximal perspective on gradient-based TD
Interim forward view:
The emergence of Dutch traces and "follow-on" traces:
The original policy gradient paper by Sutton and colleagues:
The policy gradient theorem was also discovered independently by Konda and Tsitsiklis. This paper also contains a "two-timescales" analysis of the actor-critic architecture. (This kind of decoupling and two-timescale analysis might be of interest for the GAN enthusiasts).
Unpublished "theorem 4" by Sutton and colleagues showing that "unbiased values don't help":
Natural gradient extension:
Philip Thomas showed that policy gradient algorithms were neglecting the discount factor and introducing bias:
Fitted value methods (mentionned, but won't be covered in class):
Mentionned in class following Mike's presentation: