Recent Developments in Emphatic Temporal-Differences Methods and Emphatic Weightings

I have recently read some of the work that has been done on emphatic temporal-differences (ETD) methods, and more generally on the use of emphatic weightings. Below I briefly summarize my understanding of the authors’ findings, and offer some of my own opinions where I think they are appropriate.

ETD methods have been proposed to address the long-standing problem of instability of off-policy temporal-differences (TD) methods under linear function approximation (Sutton, Mahmood, and White 2016 ¹). The emphatic algorithm elegantly extends TD by reweighting updates in a special way. Specifically, the emphatic weights are derived from an initially specified interest in states, to which a discounted follow-on trace of past interest from previous states is cumulatively added at each time step, and the update is scaled by this factor. The authors extended the algorithm to its most general case, including state-dependent bootstrapping, state-dependent termination, and state-dependent interest functions for linear function approximation, and showed that its updates are stable.

Not surprisingly, emphatic weightings can also be used in the control setting for actor-critic algorithms. Graves et al. (2021) ² developed an off-policy online actor-critic algorithm, ACE, with emphatic weightings. The emphatic weightings can be estimated for a given fixed target policy by an unbiased gradient update using the follow-on trace from ¹. Alternatively, for a changing target policy, the emphatic weightings can be estimated directly (at the cost of introducing some bias) using a parameterization of the follow-on function. This introduces additional meta-parameters that need to be adjusted depending on what type of TD update is performed on the parameters of the followon function.

A well-known problem of ETD is its relatively high variance in updates, which leads to higher sampling complexity and thus slower learning (Ghiassian and Sutton 2021 ³). Efforts to reduce its variance have been made in several ways. Recently, Guan, Xu, and Liang (2021) ⁴ developed an algorithm that reinitializes the followon trace (and thereby also the emphatic weightings) at a some timestep frequency. The authors show that there is a trade-off between small variance and large bias at high restart frequency, and large variance and small bias as the restart frequency decreases. The method drastically reduces variance compared to ETD, but it is not clear to me how it can be used in an online incremental fashion, since no updates to the weight vector are performed until the iterative updates to the (follow-on, emphatic, and eligibility) traces are complete.

Klissarov et al. (2022)⁵ proposed a meta-gradient learning method to adaptively learn the interest function for ETD methods. This is in contrast to most methods that set the interest function to be uniform over all states, an assumption that is difficult to satisfy in a large complex environment. Nevertheless, using a uniform interest function at initialization still leads to different emphasizing of states based on their visit frequency. In their work, Klissarov et al. (2022) ⁵ further showed how, by adaptively learning the interest function at each learning stage, off-policy ETD significantly outperforms competing methods in a high-variance environment. Surprisingly, the experiments showed that the adaptive emphasis may be responsible for assigning more weight to important states (so-called bottlenecks), which may prove beneficial in conjunction with learning subtasks (Sutton, Machado, et al. 2022 ⁶).

Sutton, R. S., Mahmood, A. R., & White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1), 2603-2631. ↩︎ ↩︎
Graves, E., Imani, E., Kumaraswamy, R., & White, M. (2021). Off-policy actor-critic with emphatic weightings. arXiv preprint arXiv:2111.08172. ↩︎
Ghiassian, S., & Sutton, R. S. (2021). An empirical comparison of off-policy prediction learning algorithms in the four rooms environment. arXiv preprint arXiv:2109.05110. ↩︎
Guan, Z., Xu, T., & Liang, Y. (2021). PER-ETD: A polynomially efficient emphatic temporal difference learning method. arXiv preprint arXiv:2110.06906. ↩︎
Klissarov, M., Fakoor, R., Mueller, J. W., Asadi, K., Kim, T., & Smola, A. J. (2022). Adaptive Interest for Emphatic Reinforcement Learning. Advances in Neural Information Processing Systems, 35, 95-108. ↩︎ ↩︎
Sutton, R. S., Machado, M. C., Holland, G. Z., Timbers, D. S. F., Tanner, B., & White, A. (2022). Reward-respecting subtasks for model-based reinforcement learning. arXiv preprint arXiv:2202.03466. ↩︎