Approximate Policy Iteration with a Policy Language Bias
We explore approximate policy iteration (API), replacing the usual cost-function learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs.
Bisimulation for Partially Observable Markov Decision Processes
POMDPs (Partially observable Markov decision processes) provide a rich
framework to model sequential decision processes with uncertainty in the
state of the world and the action effects. A wide range of domains such
as spoken dialogue systems, user modeling, preference elicitation,
stochastic resource allocation, robot navigation, helicopter control,
maintenance scheduling, etc. can be naturally modeled with POMDPs. In the
first part of this talk, I will introduce POMDPs and explain how they can
be used in partially observable stochastic domains to verify the
specifications of a system or to find an optimal control strategy.
In a second part, I will examine some computational issues with POMDPs
that prevent current solution algorithms to handle large scale problems.
For many domains, the exponential dimensionality of "belief states" is
an important source of intractability. I will analyze the problem of
compressing POMDP belief states in a way that minimally impacts future
rewards. Sufficient conditions will be proposed for a bisimulation that
ensures reward equivalence. Using those conditions, several algorithms
will be proposed to obtain lossless and lossy belief state compressions.
This is joint work with Craig Boutilier.
Metrics for Markov Decision Processes
We present a class of metrics, defined on the state space of a finite Markov
Decision Process (MDP), each of which is sound with respect to stochastic
bisimulation, a notion of MDP state equivalence derived from the theory of
concurrent processes. Such metrics are based on similar metrics developed in
the context of Labelled Markov Processes, and like those, are suitable for
state space aggregation. Furthermore, we restrict our attention to a subset of
this class that is appropriate for certain reinforcement learning (RL) tasks,
specifically, infinite horizon tasks with an expected total discounted reward
optimality criterion. Given such an RL metric, we provide bounds relating it
to the optimal value function of the original MDP as well as to the value
function of the aggregate MDP. Finally, we present an algorithm for
calculating such a metric up to a prescribed degree of accuracy and some
empirical results.
This is joint work with Vineet Gupta, Radha Jagadeesan, Prakash Panangaden and Doina Precup.
Verification of randomised distributed algorithms
Randomised algorithms are used widely in co-ordinating distributed
computer networks, message routing and cache management.
They act as symmetry breakers and often lead to more efficient
and elegant solutions. Since both nondeterminism and probabilistic
choice are present, the natural models of randomised distributed
algorithms are Markov decision processes.
To prove a randomised distributed algorithm correct one usually
involves two levels: classical, assertion-based reasoning, and a
probabilistic analysis based on a suitable probability space
on computations. In this talk we describe a number of approaches
to the verification of randomised distributed algorithms, including
automated techniques.
This is joint work with Gethin Norman
On a random walk strategy for the Q2SAT problem
We discuss a simple, randomized algorithm for the problem of checking whether a Q2CNF formula has a model (i.e., the Q2SAT problem). The algorithm is based on the coin-flipping strategy outlined by Papadimitriouu for the 2SAT problem. We show that a similar strategy works for the Q2SAT problem, in that the expected number of coin flips to discover a model for a satisfiable Q2CNF formula on n variables is n*n and the probability of not finding a model in 2n*n coin flips is less than one half.
Linear Value Function Approximation Methods for MDPs and Markov Games
For many practical problems that we would like to model and solve as Markov
decision processes (MDPs) or Markov games, the state space is too large to
permit an exact solution. In this talk I provide an overview of methods
for approximating value functions for these problems as linear combinations
of problem features. Our work in this area covers the case where the model
is known and described with state variable dynamics using a dynamic
Bayesian network, and the case where the model is not known in advance.
The latter case is often called reinforcement learning.
I will provide an overview of the general topic and highlight some
contributions I have made in this area with collaborators Carlos Guestrin
(Stanford University), Daphne Koller (Stanford University) and Michail
Lagoudakis (Duke University).