Loss of plasticity in continual deep reinforcement learning


Title:	Loss of plasticity in continual deep reinforcement learning
Authors:	Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, Marlos C. Machado
Link:	https://arxiv.org/pdf/2303.07507.pdf

What

The study shows how continual deep reinforcement learning (RL) experiences a significant loss of plasticity. By carefully and thoroughly examining three statistics - weight change, gradient norm, and neural network activations - the authors prove that traditional value-based deep RL methods in DQN and Rainbow are inadequate for continual learning. The benchmark for continual learning is a sequence of Atari 2600 games.

Why

For many challenging and interesting real-world problems where RL is applied to effectively learn controller systems, the environment is changing frequently maybe due to patterns in the environment or sensory drifts of the agent. To learn such a policy requires systems to continually learn.

How

To prevent the loss of plasticity caused by diminishing activations in a neural network during continual training, a simple but highly effective technique is used: substituting Rectified Linear Units (ReLUs) for Concatenated ReLU (CReLUs).

TL;DR: Replacing ReLU activations with CReLUs mitigates the drastic decrease in non-zero activations during continual learning.

Agent

Concretely for the RL agent, the authors focus on value-based deep RL methods and use DQN ¹ and Rainbow ² (which is an amalgam of the best variants of DQN).

Single agent with single neural network and single environment (synchronous training)

Environment

The environment used is a sequence of games from the Atari Learning Environment (ALE) with some minor modifications:

They follow Machado et al. (2018)’s ³ recommendations of evaluation protocol in ALE
- injecting stochasticity through sticky actions
- ignore the lives signal
- report average performance during training

ALE is modified for continual learning. Specifically:

Learning on a fixed sequence of games:
- learn for a fixed number of frames (e.g., $10^6$) then switch to the next game without resetting weights nor flushing the replay buffer.
- buffer is explicitly designed too be too small to contain only frames from the same game
Varying non-stationarity by chaning the game mode within a single game:
- 3 Atari 2600 games (Breakout, Freeway, and Space Invaders)

Agent performance: The agent’s performance is measured as the average score received per episode over the last 1000 episodes ³.

Statistics used to characterize loss of plasticity

Below I attempt to give a more formal specification of the statistics computed. With $\bm{\theta}$ I denote all the parameters, structured layer-wise, such that indexing by the second subscript gives the parameters at layer $l$. Indexing by the first subscript gives the parameters at visit $\text{v}$. A visit is the sequence of experience an agent encounters in a single game ⁴.

Weight change: Defined as the per-visit weight change between when the agent started playing the game in that visit and when it got halfway through the visit:

per-layer normalization
aggregation of all layers using weighted arithmetic mean

$$ \begin{align} f_{\text{weight\_change}}^{\text{v}}(\bm{\theta}^{\text{start\_visit}}_{\text{v}, :}, \bm{\theta}^{\text{halfway\_visit}}_{\text{v}, :}) \stackrel{.}{=} \\ \frac{\sum_{l \in \text{layers}} | \bm{\theta}_{\text{v}, l} | \frac{ \Vert \bm{\theta}^{\text{start\_visit}}_{\text{v}, l} - \bm{\theta}^{\text{halfway\_visit}}_{\text{v}, l} \Vert_2 }{f_{\text{weight\_change}}^{\text{1}} }}{\sum_{l \in \text{layers}} | \bm{\theta}_{\text{v}, l} | } \notag \end{align} $$

Maintaining plasticity: Regularly changing the weights in a continually changing environment leads to acquiring new knowledge. On the other hand, if no weight change occurs, it results in a failure to acquire new knowledge.

Loss function: Defined as the average over 100 minibatches (of size 32 here) evaluated with the network parameters at halfway of a visit.

$$ \begin{align} f_{\text{loss}}(\bm{\theta}^{\text{halfway\_visit}}_{\text{v}, :}, \text{d}_{\text{batches}}) \stackrel{.}{=} \\ \frac{1}{100} \sum_{\text{b}=0}^{99} \frac{1}{32} \sum_{\text{i}=0}^{31} f_{\text{network}}(\bm{\theta}^{\text{halfway\_visit}}_{\text{v}, :}, \text{d}_{\text{batches}} [ \text{b}, \text{i} ]) \notag \end{align} $$

Where $\text{d}_{\text{batches}} [ \text{b}, \text{i} ]$ is a tensor that stacks batches of examples on rows.

Maintaining plasticity: If a game is visited more than once (but not one after another) and the loss function is decreasing as new knowledge is acquired, then plasticity is maintained.

Gradient norm: The average norm of the gradients of the 100 mini-batches is determined by evaluating the network parameters halfway through a visit. This evaluation is carried out using the same mini-batches that were used to calculate the loss above. As with the weight change in Eq. (1), the gradient norms are normalized based on the gradient norm from the first visit. Using weighted mean arithmetic, all the per-layer gradient norms are then averaged.
Activations: The average $_0$-norm of the activations of the network is determined by evaluating the output of the convolutional network output layer, the Value network layer, and the Advantage network layer. The activation norms are computed per-layer and scaled by the number of units in that layer. They are averaged over the same 100 mini-batches (use for computing the gradient norm) each averaged over the 32 examples in each batch.

Maintaining plasticity: If the hidden units have zero activations, then there won’t be any gradient flow back. As a result, gradients will be increasingly diminished. Almost zero or completely zero gradients indicates that weights are hardly changed. Ultimately, this results in the failure to acquire new knowledge in S-ALE.

Method: Mitigating loss of plasticity with CReLUs

To mitigate the diminishing number of non-zero activations, the authors propose replacing ReLUs with CReLUs. CReLU takes an input, stacks the input with the negative of the input and applies ReLU on the two inputs:

$$ \text{CReLU}(x) \stackrel{.}{=} \text{ReLU}( [ x, -x ]^\top) $$

CReLU is increasing the number of non-zero activations, since the only way that activations can become zero is if the input to CReLU is exactly zero.
CReLU doubles the number of outputs for each input signal $\Rightarrow$ storage and per-step computation is doubled.

To make methods with CReLU and ReLU comparable in terms of effective network capacity in the experiments the inputs/outputs need to be controlled.

In one approach the authors utilized an input invariant dimension, where the number of inputs were fixed before applying the activation layer.

invariant_input_dimension_crelu — **Fig. 1:** An invariant input dimension doubles the parameters of a hidden layer with CReLUs activations.

Thoughts

According to what criteria was the game sequence for S-ALE selected?
ReLU is the most established activation function for Rainbow and DQN agents. How much loss of plasticity would other activation functions such as SELU.
Varying the number of frames per visit (from 10M, 20M up to 50M) did not make a big difference and there was still loss of plasticity.

References

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533. ↩︎
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., … & Silver, D. (2018, April). Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No 1). ↩︎
Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., & Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61 523-562. ↩︎ ↩︎
Multiple visits to the same game are possible with zero or more games between two consecutive visits. ↩︎