330d Optimal Control of Partially Observable Markov Decision Processes

Eduardo J. Dozal-Mejorada and B. Erik Ydstie. Carnegie Mellon University, Department of Chemical Engineering, Pittsburgh, PA 15213

Barto et. al. (1995) describe Dynamic Programming (DP) based reinforcement learning (RL) algorithms such as Sutton's Temporal Difference methods, Watkins' Q-learning, and Werbos' Heuristic DP. In 1994, Bradtke introduced the Adaptive Policy Iteration (API) algorithm based on Watkins' Q-learning method and the concept of Policy Iteration. The algorithm successfully solved the problem of adaptive LQ regulation with state feedback. A Kalman filter or full state information is needed for implementation of these methods. The methods cannot easily be applied to partially observable Markov decision processes (POMDP) when the system model is not known for state estimation.

The objective of this paper is to develop a method for optimal control of systems which does not rely on full state feedback. The RL approach we develop is motivated by Bradtke's method. The method uses input output data to obtain a corresponding Q-function. The Q-function is estimated recursively by least squares. The estimates are then used to develop reduced order representations of the optimal feedback strategies. The resulting optimal controls do not require Kalman filtering leading to a novel reinforcement learning based approach. The method works well for systems where most modes have dynamics that dissipate quickly. The use of input output data allows us to focus modeling attention on the low order (observable and controllable) dynamics while the fast (poorly observable) dynamics are allowed to drift and it is therefore suitable for optimal control with wide separation of time constants.

An alternative approach to solve these problems is via indirect adaptive control. Here, the controller is based on a reduced order model of the plant and estimation of parameters in the model leads indirectly to adaptation of parameters in the controller. Typically, the controllers are designed by generating state feedback gains through the corresponding Riccati equations.

The main contributions of this paper are:

1. We formulate input output optimal controllers using Q-functions and apply the formulation to low order control for high- and infinite order problems.

2. We outline mathematical proofs of stability and convergence of direct and indirect adaptive control methods. These show that that our algorithm converges asymptotically to the optimal controller within a certain tolerance which can be related to excitation level, magnitude of model mismatch and disturbances.

3. We compare and contrast the RL (direct adaptive control) based method to the indirect adaptive control approach. We present two case study simulations: a bioreactor and an inverted pendulum (double integrator) system.