An emphatic approach to the problem of off-policy temporal-difference learning


Title:	An emphatic approach to the problem of off-policy temporal-difference learning
Authors:	Richard S. Sutton, A. Rupam Mahmood, Martha White
Link:	https://www.jmlr.org/papers/volume17/14-488/14-488.pdf

What

Introduces a new family of Temporal-Differences (TD) methods that improve on the classic TD(λ) methods by naturally extending them with the notion of emphasizing and de-emphasizing states based on intrinsic interest in them and the interest that follows from past visited states.

Why

It has been shown that TD(λ) can become unstable, meaning that the update values to the weight vector are diverging, in some cases including state-dependent bootstrapping $\lambda$ and most importantly off-policy learning. Other model-free TD algorithms like GTD($\lambda$) are stable under off-policy learning but they require maintaining two weight vectors and two step-sizes. Instead, Emphatic TD(λ) is stable for off-policy learning under linear function approximation for any discount function $\gamma$, bootstrapping function $\lambda$, and interest function $i$, and has linear per-step computational complexity with the negligible overhead of maintaining the emphasis scalar value at each time step.

How

TL;DR: Emphatic TD(λ) is a direct successor of TD(λ) where the subtle but effective change in the algorithm is that the distribution, from which the updates to the weight vector are sampled, is warped, by emphasizing some states more than others based on the intrinsic interest in them and cumulative past interest leading to them, so that instability is overcome, and at the same time all the benefits of regular TD(λ) are preserved.

Emphatic TD($\lambda$) can be specified in its most general form by the following equations ¹, for $t \ge 0$:

$$ \begin{align} \bm{\theta}_{t+1} &\stackrel{.}{=} \bm{\theta}_{t} + \alpha ~ \Big( R_{t+1} + \gamma_{t+1} \bm{\theta}_{t}^\top \bm{\phi_{t+1}} - \bm{\theta}_{t}^\top \bm{\phi}_{t} \Big) ~ \bm{e}_{t} \\ \bm{e}_{t} &\stackrel{.}{=} \rho_t ( \gamma_t \lambda_t \bm{e}_{t-1} + M_t \bm{\phi}_{t}), \quad \text{with} ~~ \bm{e}_{-1} \stackrel{.}{=} \bm{0} \\ M_t &\stackrel{.}{=} \lambda_t i(S_t) + (1 - \lambda_t) ~ F_t \\ F_t &\stackrel{.}{=} \rho_{t-1} \gamma_t F_{t-1} + i(S_t), \quad \text{with} ~~ F_0 \stackrel{.}{=} i(S_0) \end{align} $$

During the learning process, some updates to the weight vector can become excessively large, leading to an unstable scenario where future updates spiral out of control, ultimately resulting in infinitely large values. Although stability alone is not sufficient, it is required for convergence. Emphatic TD aims to stabilise regular TD updates by molding the distribution of updating states while preserving the core TD error updates.

If we consider the off-policy TD(0) update:

$$ \begin{align} \bm{\theta}_{t+1} &\stackrel{.}{=} \bm{\theta}_{t} + \rho_t \alpha ~ \Big( R_{t+1} + \gamma \bm{\theta}_{t}^\top \bm{\phi}_{t+1} - \bm{\theta}_{t}^\top \bm{\phi}_{t} \Big) ~ \bm{\phi}_{t} \\ &= \bm{\theta}_{t} + \alpha ( \bm{b}_t - \bm{A}_t \bm{\theta}_{t}) \nonumber \\

\bm{b} &\stackrel{.}{=} \rho_t R_{t+1} \bm{\phi}_{t} \\ \bm{A} &\stackrel{.}{=} \rho_t \bm{\phi}_{t} (\bm{\phi_{t}} - \gamma \bm{\phi}_{t+1})^\top \end{align} $$

and we take the expected update of $\bm{A}$, we get the following form:

$$ \begin{align} \bm{A} &= \lim\limits_{t \rightarrow \infty} \mathop{\mathbb{E}}[\bm{A}_t] = \lim\limits_{t \rightarrow \infty} \mathop{\mathbb{E_\mu}}[\rho_t \bm{\phi}_{t}(\bm{\phi}_{t} - \gamma \bm{\phi}_{t+1})^\top] \\ &= \bm{\Phi}^\top \bm{D}_\mu (\bm{I} - \gamma \bm{P}_{\pi}) \bm{\Phi} \end{align} $$

Matrix $\bm{A}$ leads to instability in case it is not positive semi-definite. Emphatic TD aims to warp the distribution $\bm{D}_{\mu}$ of updating states such that the matrix $\bm{A}$ remains positive semi-definite thus ensuring stability. The TD error update form given by $\bm{I} - \gamma \bm{P}_{\pi}$ remains unchanged thus preserving all the features of regular TD.

The above is for off-policy TD(0). The paper derives in detail the matrix $\bm{A}$ for the general case of arbitrary $\lambda$ and also for Emphatic TD, showing how the proposed interest function $i$ and followon trace $F$ naturally result in positive semi-definiteness of $\bm{A}$.

Below is a pseudocode for an agent applying the Emphatic TD learning update.

     def agent_init(self):
          self.i = 1.
          self.F = 0.
          self.M = 0.

     def learn(self, r, phi, phi_prev):
          target = r + self.gamma * np.dot(self.theta.T, phi)
          pred = np.dot(self.theta.T, phi_prev)
          delta = target - pred

          self.F = self.gamma * self.F + self.i
          self.M = self.lambda_ * self.i + (1 - self.lambda_) * self.F
          self.e = self.gamma * self.lambda_ * self.e + self.M * phi_prev
          self.theta += self.alpha * delta * self.e

Thoughts

Emphatic TD solves all the known instability issues of TD while at the same time preserving all of its core features, making it the natural choice of a successor.
Emphatic TD updates may exhibit larger variance than regular TD updates since the intensity at every time step may substantially vary in magnitude. An open research question is how to lower the variance of Emphatic TD.
The interest function can be set uniformly across states or designed a priori. How to meta-learn the interest remains promising future work.
More empirical results are needed to elucidate the full strength of Emphatic TD in its most general case with state-dependent discounting, state-dependent boostrapping, and state-dependent interest.

Footnotes

I adopted the notation from Sutton, R. S., & Barto, A. G. (2018). Vectors and matrices are lowercase respectively uppercase bold-faced. Also, in accordance with the adopted notation, any subsequent statements resulting from previous statements by definition are indicated with a dot above the equal sign like so $\stackrel{.}{=}$. ↩︎