Loss of Plasticity in Deep Continual Learning


Title:	Loss of Plasticity in Deep Continual Learning
Authors:	Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, Richard S. Sutton, A. Rupam Mahmood
Link:	https://arxiv.org/pdf/2306.13812v2.pdf

What

This paper presents an extensive and systematic empirical study that proves standard deep learning systems fail to keep learning in a continual learning environment. It explains some of the core reasons for this problem of loss of plasticity, and proposes a natural extension to the backpropagation algorithm that can reliably maintain plasticity for continual learning.

Why

Most deep learning systems are only trained once on a dataset, and then tested in the real world. However, it’s obvious that those systems should be able to continually learn in a changing environment. If the systems drift away because of changes in the data they were trained on, they’re trained again from scratch. This is a computationally inefficient method, especially when the world is changing frequently.

The most feasible solution is for the systems to continually learn, but standard deep learning systems with various network architectures, activations functions, optimizers, batch normalization, and dropout fail to do so.

How

Continual learning involves both memorizing things learned before (or not catastrophically forget) and ability to learn new things (or maintain plasticity). The paper addresses the second issue. They ask whether deep-learning systems lose plasticity in continual learning problems? The definitive answer is yes for continual supervised learning problems.

TL;DR: Continual backpropagation (CBP), a natural extension of the standard backpropagation algorithm to the continual setting is proposed, that extends initialization of weights to all time steps, by replacing less useful units based on a utility measure, and performs stochastic gradient descent on each example. CBP maintains plasticity for continual learning and is robust to hyper-parameter tuning.

To study loss of plasticity the authors measure three essential properties:

number of dead units: if a hidden units’ output is zero or close to zero for all examples, then that unit essentially is useless $\Rightarrow$ the lower the better.
weight magnitude: large weights can lead to exploding gradients $\Rightarrow$ the lower the the better.
effective rank: for network layers, the effective rank of the weight matrix explains how many of the units of that layer contribute to the output of the layer $\Rightarrow$ the higher the rank the better.

CBP extends standard backpropagation by initializing weights in all layers at all time steps after doing a per-example gradient descent step.

A fraction of hidden units with the lowest utility are considered for replacement. The utility measures both contribution of a unit to its consumers (units taking its input) and adaptility. To ensure that newly reinitialized units are not replaced immediately after, there is a maturity threshold set to allow them to specialize.

CBP has two new hyper-parameters:

replacement rate $\rho$ (float)
maturity threshold $M$ (int)

CBP has constant extra memory and time computation per-step, since it does not grow number of units over time.

The running average of the utility measure is defined by $y$ as:

$$ \begin{align} y_{l, i, t} &\stackrel{.}{=} \frac{ | h_{l, i, t} - \hat{f}_{l, i, t} | \cdot \sum_{k=1}^{n_{l+1}} | w_{l, i, k, t} | }{\sum_{j=1}^{n_{l-1}} | w_{l-1, j, i, t} | }, \\ u_{l, i, t} &\stackrel{.}{=} \eta \cdot u_{l, i, t-1} + (1 - \eta) \cdot y_{l, i, t}, \\ \hat{u}_{l, i, t} &\stackrel{.}{=} \frac{u_{l, i, t-1}}{1 - \eta^{a_{l, i, t}}}, \\ \hat{f}_{l, i, t} &\stackrel{.}{=} \frac{f_{l, i, t-1}}{1 - \eta^{a_l, i, t}}, \\ f_{l, i, t} &\stackrel{.}{=} \eta \cdot f_{l, i, t-1} + (1 - \eta) \cdot h_{l, i, t} \end{align} $$

The input weights of the eligible units for replacement at some layer $l$ are initialized from the initial distribution $d_l$. Output weights of those units are set to zero.

CBP vs. existing deep learning methods for mitigating loss of plasticity

Among the most used deep-learning methods for mitigating loss of plasticity such as L2-regularization, shrink-and-perturb, dropout, online normalization, and adaptive momentum estimation (Adam), the first two exhibited a minimum loss of plasticity. All others have failed to maintain plasticity, while CBP was significantly superior in terms of online classification accuracy.

CBP had much fewer dead units and a higher effective ranking compared to the other methods used. Shrink-and-perturb had a slightly lower weight magnitude than CBP, but both had low values.

Thoughts

Why does Adam and Dropout drastically exarcebate loss of plasticity?
- Dropout is really bad for continual learning. Since there is no purposeful selection involved what exactly is discarded during this process?
To increase confidence in the results and to ensure that the performance drop is not due to some bug in the implementation, one of the authors independently reproduced the results of this experiment.
- Is this common practice in a research project? If not, it should be!!
Online normalization is harder than it seems to implement effectively for continual learning.
Although the weight magnitude and number of dead units rose quickly for standard backprop in the Online Permuted MNIST, the effective rank dropped at a much slower pace. While this is not a suitable long-term solution, the deep neural network still managed to learn some robust and powerful deep representations.
In deep RL, using experience replay (ER) buffers creates a nearly training-once setting. ER requires a stable environment so changes in the dynamics don’t occur frequently. Unfortunately, this condition is often not met in continual settings. If the transition dynamics change rapidly, the experience sampled from ER may not matter or could even harm the agent’s ability to learn.
I wonder how plasticity for continual learning will develop further in the RL setting? It seems to me that to be able to learn new things, some prior learned features are more important than others. Consequently, the agent should prepare for upcoming task changes by predicting what task it might face. With limited computational resources, the preparation would involve which features need to be replaced by new ones or perhaps by more “distilled” versions of existing features. Based on predicting what tasks lie ahead both short-term and long-term, the replacement then should be done with care as to make re-learning of previous features easier?