LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2009

Citation

DRUM DOI

Abstract

We propose various computational schemes for solving Partially Observable

Markov Decision Processes with the finite stage additive cost and infinite

horizon discounted cost criterion. Error bounds for the corresponding algorithms

are given and it is further shown that at the expense of more computational

effort the Partially Observable Markov Decision Problem (POMDP) can be solved

as closely to the optimal as desired.

It is well known that a sufficient statistic for taking the best action at any time for

the POMDP is the aposteriori probability distribution on the underlying states, given

all the past history, and that this can be updated recursively. We prove that the finite

stage optimal costs as well as the optimal cost for the infinite horizon discounted

cost problem are both Lipschitz continuous (with domain the unit simplex of probability

distributions over the underlying states) and gives bounds for the Lipschitz constant.

We use these bounds to provide error bounds for computational algorithms for solving

POMDPs.

We extend the almost sure convergence result of a very general stochastic approximation

algorithm to the case when the underlying Markov process exhibits periodicity. This result

is used to extend the proof of convergence of Temporal Difference (TD) reinforcement learning

schemes with linear function approximation for Markov Cost processes in order to estimate the

cost to go function for the discounted cost criterion, and the differential cost function for the

average cost criterion, respectively.

Adaptive control of Markov Decision Problems (MDPs) is a problem in which a full knowledge

of the system parameters, namely transition probabilities as well as the distribution of the

immediate costs, are not available apriori. We give direct adaptive control schemes for

infinite horizon discounted cost and average cost MDPs. Approximate Policy Iteration

using on-line TD schemes for policy evaluation is detailed for the discounted cost and

average cost criteria.

Possible extensions of direct adaptive control schemes to the POMDP framework are

discussed.

Auxiliary results relevant to the core results of the dissertation are stated

and proved in the appendices. In particular an efficient discretization scheme

for the finite dimensional unit simplex is given. Some general error bounds for

MDPs are also given. Also TD schemes for learning in Stochastic Shortest Path

problems (SSP) are discussed.

Notes

Rights