LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES

Thomas, Abraham

LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES

dc.contributor.advisor	MARCUS, STEVEN	en_US
dc.contributor.author	Thomas, Abraham	en_US
dc.contributor.department	Electrical Engineering	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2010-02-19T06:30:57Z
dc.date.available	2010-02-19T06:30:57Z
dc.date.issued	2009	en_US
dc.description.abstract	We propose various computational schemes for solving Partially Observable Markov Decision Processes with the finite stage additive cost and infinite horizon discounted cost criterion. Error bounds for the corresponding algorithms are given and it is further shown that at the expense of more computational effort the Partially Observable Markov Decision Problem (POMDP) can be solved as closely to the optimal as desired. It is well known that a sufficient statistic for taking the best action at any time for the POMDP is the aposteriori probability distribution on the underlying states, given all the past history, and that this can be updated recursively. We prove that the finite stage optimal costs as well as the optimal cost for the infinite horizon discounted cost problem are both Lipschitz continuous (with domain the unit simplex of probability distributions over the underlying states) and gives bounds for the Lipschitz constant. We use these bounds to provide error bounds for computational algorithms for solving POMDPs. We extend the almost sure convergence result of a very general stochastic approximation algorithm to the case when the underlying Markov process exhibits periodicity. This result is used to extend the proof of convergence of Temporal Difference (TD) reinforcement learning schemes with linear function approximation for Markov Cost processes in order to estimate the cost to go function for the discounted cost criterion, and the differential cost function for the average cost criterion, respectively. Adaptive control of Markov Decision Problems (MDPs) is a problem in which a full knowledge of the system parameters, namely transition probabilities as well as the distribution of the immediate costs, are not available apriori. We give direct adaptive control schemes for infinite horizon discounted cost and average cost MDPs. Approximate Policy Iteration using on-line TD schemes for policy evaluation is detailed for the discounted cost and average cost criteria. Possible extensions of direct adaptive control schemes to the POMDP framework are discussed. Auxiliary results relevant to the core results of the dissertation are stated and proved in the appendices. In particular an efficient discretization scheme for the finite dimensional unit simplex is given. Some general error bounds for MDPs are also given. Also TD schemes for learning in Stochastic Shortest Path problems (SSP) are discussed.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/9810
dc.subject.pqcontrolled	Engineering, Electronics and Electrical	en_US
dc.subject.pquncontrolled	AVERAGE COST MDP	en_US
dc.subject.pquncontrolled	DISCOUNTED COST MDP	en_US
dc.subject.pquncontrolled	MARKOV DECISION PROCESSES	en_US
dc.subject.pquncontrolled	PARTIALLY OBSERVABLE MDPS	en_US
dc.subject.pquncontrolled	STOCHASTIC APPROXIMATION ALGORITHM	en_US
dc.subject.pquncontrolled	TEMPORAL DIFFERENCE SCHEMES	en_US
dc.title	LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Thomas_umd_0117E_10603.pdf
Size:: 1.01 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Electrical & Computer Engineering Theses and Dissertations