Auxiliary task discovery through generate-and-test


Title:	Auxiliary task discovery through generate-and-test
Authors:	Banafsheh Rafiee, Sina Ghiassian, Jun Jin, Richard Sutton, Jun Luo, Adam White
Link:	https://proceedings.mlr.press/v232/rafiee23a.html

What

This paper proposes a new way to measure how useful auxiliary tasks are by looking at the features they create and how much they contribute to the main task.
The test to measure usefulness follows the general generate-and-test framework.
In addition to the usefulness tester for the auxiliary task, a new generator is developed based on the feature-attainment subtasks introduced by Sutton (2023) ¹.

Why

Previously, humans have defined auxiliary subtasks beforehand. However, this is not a scalable solution as it requires a significant amount of domain knowledge in complex environments. Moreover, determining what is essential to be learned in advance may not always be possible, and such subtasks may hinder learning in a non-stationary continual learning setting.

How

TL;DR: Auxiliary subtasks (defined as GVFs) are discovered by continually generating them randomly (or through feature-attainment ¹) and testing to replace those that are less useful (based on their feature value and the weights contribution to the main task).

The paper considers the episodic discounted setting of an agent interacting with an environment. The goal of the agent is to find the policy that maximizes future discounted sum of rewards estimated by the state-action value function.
They estimate the state-action function using Q-learning with neural networks (i.e., DQN)
- DQN with only one hidden layer is used together with the standard methods for stability like replay buffer, target networks, and Adam optimizer.
Auxiliary subtasks are defined using the general value functions (GVFs) ² (cf. Sutton & Barto (2018), Sec. 17.1):

$$ \begin{align} v_{\pi, \gamma, c}(s) \stackrel{.}{=} \mathbb{E} \Bigg[ \sum^{\infty}_{k=t} \Bigg( \prod^{k}_{i=t+1} \gamma(S_i) \Bigg) c(S_{t+k+1}) \Bigg| S_t=s, A_{t:\infty} \sim \pi \Bigg]. \end{align} $$

The solution of a subtask is a dedicated policy $\pi$ with the highest state-action value function. The cumulant $c(\cdot)$ and stopping function $\gamma(\cdot)$ are fixed.

The discovery and curation of auxiliary subtasks is implemented using the generate-and-test method introduced by Mahmood and Sutton (2013) ³. Here, two generators were developed:
- The first generator is randomly generating auxiliary subtasks from the space of input observations.
- The second generator, called feature-attainment generator, is generating auxiliary subtasks from the space of features. Specifically from the target features induced by the main task.
For the tester, a criteria for usefulness of an auxiliary subtask was developed that is used to replace some less useful auxiliary tasks with newly initialized ones. Below is the definition of the utility of an auxiliary task:

$$ \begin{align} u(\text{aux}^{i})_t \stackrel{.}{=} \sum^{}_{k \in F^{i}} u_{k, t} \\ u_{k, t} \stackrel{.}{=} \bar{f}_{k, t-1} \cdot \sum_{a \in \mathcal{A}} | w^{\text{main}}_{k, a, t} | \\ \bar{f}_{k, t} \stackrel{.}{=} (1 - \tau)~\bar{f}_{k, t-1} + \tau f_{k, t}, ~~~~ \bar{f}_{k, -1} \stackrel{.}{=} 0, \forall k, 0 \le \tau \le 1 \end{align} $$

where $F^i$ and $w^{\text{main}}_{k, a, t}$ are the set of features induced by the $i$-th auxiliary task and output weight connection of feature $k$ to the action value $a$ of the output of the main task.

For a complete specification of the complete algorithm generate-and-test for auxiliary subtask discovery see Algorithm 1 in the paper.
The paper presents a new feed-forward neural network architecture inspired by Khurram, White, and Sutton (2021)⁴ that better aligns a subtask’s learning effect with the input by carefully controlling the gradient flow within the network.
Here, the network has one hidden layer. It’s followed by a network head for each additional auxiliary subtask and one for the main task. The hidden layer is called the representation layer since it’s the final layer after the output.
- For brevity, I’ll call the main task and auxiliary subtasks “tasks”.
- In the hidden layer, the units are features and they’re grouped by how they affect learning during gradient backpropagation.
- Each task receives input from all features during the forward pass.
- During the backward pass, each group of features only sends gradients back to the input from its corresponding task network head. This makes it clear which features are influenced by which task.

The space of search using feature-attainment generator are the features of the main task

Here, the features to be maximized are only the ones induced by the main task, which are referred to as target features.
At each replacement time, the tester is picking among the target features those with highest utility score based on Equation 3.
- A subtask is assigned to each of these picked target features, whose subgoal is to maximize the target feature value, after which the policy for that subtask GVF stops.
- The remaining auxiliary subtasks that belong to low utility target features are reinitialized with new input and output weights.

Below are two scenario illustrating the forward-backward pass of the feature-attainment generator, in Figure 1 for the main task and in Figure 2 for the “purple” auxilary task.

In this example there are two input features, 6 hidden features in total, with two features per task and one action-value per task.

fwd-bwd-main-task — **Fig. 1:** Performing the forward and backward pass for the main task of the feature-attainment generator.

fwd-bwd-aux-task — **Fig. 2:** Performing the forward and backward pass for the 'purple' auxiliary task, that is maximizing the 'purple' feature of the main task, of the feature-attainment generator.

In Figure 2, I color-coded the main task’s features to match those of the auxiliary task that is being learned. The purple feature has the highest utility score, so an auxiliary task is assigned to maximize it by finding the best performing policy. However, the feature that the green auxiliary task is supposed to maximize is not among the highest scoring features, so the green auxiliary task is re-initialized. The replacement factor is 0.5, so only half of the highest-scoring features are maximized.

Thoughts

The paper is well written.
- I appreciate the explanation provided for using the specific environments, and the clarity of the experimental section, where the subsections are titled with the conclusions drawn from each experiment.
The foundation of the paper is generic and based on the generate-and-test framework, allowing for many potential ways to extend and validate the work on complex environments. One possible way to expand is to apply it to a recurrent network design.
The auxiliary task utility function is a strong heuristic. How can it be (meta-)learned?
Generating auxiliary tasks using feature-attainment is an experiential⁵ practice of utilizing past experiences to explain internal agent changes instead of relying on external objective elements.

References

Sutton, R. S., Machado, M. C., Holland, G. Z., Szepesvari, D., Timbers, F., Tanner, B., & White, A. (2023). Reward-respecting subtasks for model-based reinforcement learning. Artificial Intelligence, 324, 104001. ↩︎ ↩︎
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. ↩︎
Mahmood, A. R., & Sutton, R. S. (2013, June). Representation Search through Generate and Test. In AAAI Workshop: Learning Rich Representations from Low-Level Sensors (Vol. 10, No. 2908225.2908228). ↩︎
Javed, K., White, M., & Sutton, R. (2021). Scalable online recurrent learning using columnar neural networks. arXiv preprint arXiv:2103.05787. ↩︎
Sutton, R. (2022). The Increasing Role of Sensorimotor Experience in Artificial Intelligence, Incomplete Ideas (blog). ↩︎