Planning with expectation models


Title:	Planning with expectation models
Authors:	Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton
Link:	https://arxiv.org/pdf/1904.01191.pdf

What

The papers demonstrates how to use expectation models in a sound way for planning in model-based reinforcement learning (MBRL), particularly in stochastic environments.

One important insight is that planning with an expectation model is just as expressive as with a distribution model, provided that the state value function has a linear parameterization in the state feature function. However, the mapping of observation-to-state-feature function may still be non-linear.

Furthermore, a strong argument is made for using non-linear instead of linear function approximation for the feature-vector and reward functions.

Finally, a stable planning algorithm called Gradient Dyna is introduced. It draws inspiration from Gradient-TD off-policy policy evaluation algorithms, and it guarantees convergence for both linear and nonlinear expectation models. This is true even when using imperfect models for online learning.

Why

For MBRL there are many choices to be made for what type of model to use to predict the next feature-vector state and reward. Distribution and sample models are two options. But they can be very inefficient when state and action spaces are large, because either the distribution of next feature-vector states is intractable to model or it requires too many samples to approximate the environment transition dynamics reasonably well. Expectation models are instead compact and deterministic in their output.

When using an expectation model, we need to estimate the expected next feature-vector state and expected next reward. Should this estimation be done linearly or non-linearly? The answer is non-linear. This is due to the fact that using model-free off-policy TD(0) with real data and data generated from the best linear model doesn’t produce the same solution. However, if a non-linear model is used, both methods produce the same results.

If you decide to use expectation models with non-linear function approximations, is TD(0) a suitable algorithm to use for planning with the model? It is widely known that model-free TD(0) is not a stable algorithm when combined with function approximation, bootstrapping, and off-policy learning (Sutton & Barto (2018), Sec. 11.3). As established by the authors, TD(0) trained from real data, however, has an equivalent solution to doing TD(0) updates with data from a non-linear model. Thus, TD(0) may cause instability in planning. Alternatively, Gradient-TD methods have robust convergence properties, particularly when applied to function approximation, whether linear or nonlinear.

How

The core idea in this paper is the development of a stable planning algorithm for MBRL using expectation models that is particularly suitable for stochastic environments.

TL;DR: The Gradient Dyna planning algorithm is inspired by Gradient-TD based methods in both the objective that it aims to minimize and the computationally efficient way that it performs stochastic gradient descent (SGD). It can also be utilized with incomplete linear or non-linear models in an online fashion.

Gradient-based TD methods have strong convergence guarantees for both linear and non-linear function approximation. They minimize the projected Bellman error. Similiarly, the goal here was to take inspiration from these methods. As a result, the authors introduced a new objective named the Model-Based Mean Square Projected Bellman Error (MS-MSPBE):

$$ \begin{align} \overline{\text{MB-PBE}}(\mathbf{w}) \stackrel{.}{=} \mathop{\mathbb{E}}_{\zeta}[ \Delta_k \bm{\phi}_k ]^\top \cdot \mathop{\mathbb{E}}_{\zeta}[ \bm{\phi}_k \bm{\phi}^\top_k ]^{-1} \cdot \mathop{\mathbb{E}}_{\zeta}[ \Delta_k \bm{\phi}_k ] \notag \end{align} $$

The gradient of MS-MSPBE ¹ is defined as:

$$ \begin{align} \nabla \overline{\text{MB-PBE}}(\mathbf{w}) \stackrel{.}{=} \underbrace{\mathop{\mathbb{E}}_{\zeta}[ ( \gamma \mathbf{\hat{x}}(\bm{\phi}_k, A_k) - \bm{\phi}_k)\bm{\phi}^\top_k] \cdot \mathop{\mathbb{E}}_{\zeta}[ \bm{\phi}_k \bm{\phi}^\top_k ]^{-1}}_{\text{estimated with} \mathbf{V} \text{via LMS method}} \notag \\ \cdot \underbrace{\mathop{\mathbb{E}}_{\zeta}[ \Delta_k \bm{\phi}_k ]}_{\text{sampled at every timestep}} \notag \end{align} $$

The essential steps in Gradient Dyna (cf. Algorithm 1 in paper) then are:

$$ \begin{align} \mathbf{w}_{k+1} \stackrel{.}{=} \mathbf{w}_{k} - \alpha_k \mathbf{V}_{k} \Delta_k \bm{\phi}_k \notag \\ \mathbf{V}_{k+1} \stackrel{.}{=} \mathbf{V}_{k} + \beta_k ( ( \gamma \mathbf{\hat{x}}(\bm{\phi}_k, A_k) - \bm{\phi}_k)\bm{\phi}^\top_k - \mathbf{V}_{k} \bm{\phi}_k \bm{\phi}_k^\top) \notag \\ \text{with} ~ \mathbf{V}_0, \mathbf{w}_0 ~ \text{initialized arbitrarily} ~ \text{(e.g., }\stackrel{.}{=} \bm{0}) \notag \end{align} $$

To store the model for MBRL you need quadratic memory in the number of state features, for non-linear model might be less depending on the parameterization.
The planning algorithm has $O(d^2)$ storage and per-step computation, with $d$ the number of state features.

Thoughts

How can Gradient Dyna be extended to the control problem?
What other parameterization of nonlinear models can be devised that require less than quadratic memory in the state features.

Footnotes

I adopted the notation from Sutton & Barto (2018) using an overline over $\text{MB-PBE}$ to compactly denote the mean-square part. ↩︎