Reinforcement learning with unsupervised auxiliary tasks


Title:	Reinforcement learning with unsupervised auxiliary tasks
Authors:	Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Link:	https://arxiv.org/pdf/1611.05397.pdf

What

The paper examines the concept of maximizing “pseudo-rewards”, in addition to the main reward, derived from different environment signals. The objective is to improve certain parts of the agents’ feature representation. This, in turn, will help the main policy to reward states more effectively.

Why

To maximize cumulative reward in an environment the agent must construct sufficiently rich feature representation of the states that bring reward. This can be time-consuming when the reward is hardly available in many states or is zero. To tackle this issue, they suggest the agent takes on additional tasks to master, in addition to maximizing the total reward.

How

TL;DR: Learn to control an environment with sparse rewards by developing a rich feature representation derived from signals beyond than the main reward. To accomplish this, several auxiliary tasks are proposed with the aim of refining the feature representation. The promise is that solving these auxiliary tasks will ultimately enable more effective learning of the main policy when the reward becomes available.

These auxiliary tasks are defined by pseudo-reward functions. The goal is to maximize the pseudo-rewards and main reward simultaneously.

$$ \begin{equation} \argmax_{\bm{\theta}} \mathbb{E}_{\pi(\bm{\theta})} [ R_{1:\infty} ] + \lambda_c \sum_{c \in \mathcal{C}} \mathop{\mathbb{E}}_{\pi_c(\bm{\theta})} [R^{(c)}_{1:\infty}] \end{equation} $$

$ R^{(c)}_{t:t+n} \stackrel{.}{=} \sum_k^n \gamma^k r^{(c)}_t $: discounted return for auxiliary reward $r^{(c)}$.

$\bm{\theta}$: weight vector for $\pi$ and all $\pi^{(c)}$’s.

The objective in Eq. 1 is optimized via an off-policy n-step Q-learning loss¹. The decision was made on purpose to compute pseudo-rewards in parallel using only one stream of experience, but it is not necessary.

$$ \mathcal{L}^{(c)}_Q \stackrel{.}{=} \mathop{\mathbb{E}}[(R^{(c)}_{t:t+n} + \gamma^n \max_{a^\prime} Q^{(c)}(s^\prime, a^\prime, \bm{\theta}^{-}) - Q^{(c)}(s, a, \bm{\theta}))^2] $$

$\bm{\theta}^{-}$: are target parameters that are fixed and not optimized. They are used to compute the actual targets for optimizing $\bm{\theta}$ at some iteration $i$. Target parameters are updated (by copying $\bm{\theta}$) at a slower frequency than $\bm{\theta}$ ².

The authors examine two kinds of tasks: control auxiliary tasks and prediction auxiliary tasks.

Control tasks

For control tasks, they suggest using pixel control and feature control. Both of these control tasks are additional policies aimed at accumulating a pseudo-reward maximally. The two auxiliary policies share the network parameters (LSTM and ConvNet) with the main policy $\pi$.

Pixel control (PC): Pseudo-reward is defined as the absolute difference between the pixel changes of two consecutive frames. - Concretely, they divide the center crop of an input image from the environment into a grid of blocks of 4x4 non-overlapping pixels. For each pixel in the block the pseudo-reward is the absolute difference between the block average of two consecutive frames.
Feature control (FC): In feature control, the agent learns to maximally control the activation of certain hidden units in the neural network. - Specifically, the agent learned to control the activations of the second hidden layer in the convolutational visual stream of the ConvNet. In principle, the agent can control other parts of the network activations as well.

Prediction tasks

For the prediction auxiliary tasks, value functions are estimated to improve the feature representation of the agent.

Reward prediction (RP): Concretely, given a sequence of observations, the goal is to predict the reward in the next frame. - This task affects only the feature representation of the ConvNet, no value function is computed to estimate the return. - A multi-class cross-entropy with three classes (zero, positive, or negative reward) is used. The sequence of frames are sampled in a skewed manner to over-represent the frames with positive rewards which are fewer in sparse environments. - The LSTM is not affected by this task, rather a fully-connected layer is supposed to predict the immediate reward. The rationale was that an LSTM would have focused too much on predicting long-term return from distant states in the experience history.
Value function replay (VR): Resample a random past experiences sequence from the experience replay buffer using the behavior policy and performing value function regression across all timesteps in the sequence between the state value estimate and the actual reward received.

UNREAL Agent

The UNREAL agent was trained using an A3C (asynchronous advantage actor-critic) framework. The designer has the flexibility to choose the RL algorithm and function approximator for each type of tasks. For instance, in the authors’ experiments, each control task was learned with off-policy n-step Q-learning using a LSTM, while solving the reward prediction auxiliary task involved a feed-forward neural network. A CNN performed the encoding of the observation frames, and its parameters were shared across all tasks.

$$ \begin{align} \mathcal{L}(\bm{\theta}) &\stackrel{.}{=} \mathcal{L}_{\text{A3C}} + \lambda_\text{VR} \mathcal{L}_{\text{VR}} + \lambda_\text{PC} \sum_{c \in \mathcal{C}}^{} \mathcal{L}_{\text{Q}}^{(c)} + \lambda_{\text{RP}}\mathcal{L}_{\text{RP}} \\ \mathcal{L}_{\text{VR}} &\stackrel{.}{=} \mathop{\mathbb{E}}_{s \sim \pi}[ (R_{t:t+n} + \gamma^n V(s_{t+n+1}, \bm{\theta}^{-}) - V(s_t, \bm{\theta}))^2 ] \\ \mathcal{L}_{\text{A3C}} &\stackrel{.}{=} \mathcal{L}_{\text{VR}} + \mathcal{L}_{\pi} - \mathop{\mathbb{E}}_{s \sim \pi}[\alpha \underbrace{H(\pi(s, \cdot, \bm{\theta}))}_{\text{policy entropy regularization}}] \end{align} $$

Thoughts

In the Labyrinth experiments, A3C+PC was superior to A3C+RP. It would be interesting to find out when reward prediction from other signals is better. For this, using a recurrent network like an LSTM would be beneficial that takes into account the entire history of experience and predicts rewards with a long-term view.
In these experiments, the auxiliary tasks were carefully chosen with some knowledge of the environment and sensory inputs available to the agent. How to let the agent discover auxiliary tasks is a prospective open research question.

References

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR. ↩︎
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533. ↩︎