Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines


Title:	Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines
Authors:	Michael Schenk, Elías F. Combarro, Michele Grossi, Verena Kain, Kevin Shing Bruce Li, Mircea-Marian Popa, and Sofia Vallecorsa
Link:	https://iopscience.iop.org/article/10.1088/2058-9565/ad261b/meta

What

The authors of this paper explore whether free energy-based reinforcement learning (FERL) with quantum Boltzmann machines (QBM) algorithms are more sample efficient than classical RL algorithms in continuous action spaces. Previous research has shown that indeed FERL-QBM algorithms are more sample efficient than classical RL algorithms in environments with discrete action spaces.

Why

CERN currently tunes some of its control systems manually. This paper examines the potential of quantum-based RL algorithms to train sample-efficient RL agents that can control these systems specifically for proton beam line steering. This minimizes the use of the beam line in the reactor, saving energy and enabling other experiments to run in the collider.

How

TL;DR: The first study compared classical deep Q-learning and FERL-QBM-based algorithms for discrete action and continuous state spaces. The latter algorithm was found to be more sample efficient.
The second study compares the classical DDPG algorithm (also known as deep Q-learning for continuous action spaces) with the newly developed hybrid actor-critic algorithm. The hybrid algorithm features a FERL-QBM-based critic and a classical DDPG-based actor. The latter is proven to be more sample efficient.

Study A: FERL Q-learning with continuous state space

In the first study, the state and action space are both one-dimensional. Initially, the classical Q-learning and FERL-QBM are evaluated for a discrete state-action space. Then, the setting is analyzed also for continuous state-action spaces.

The comparison of the two RL-based methods is ran with and without experience replay buffer.

The FERL method performed significantly better than deep Q-learning with a factor of 400 in improvements in sample efficiency. Moreover the ordering of the two methods was maintained with and without experience replay buffer.

Study B: hybrid A-C scheme

The second study evaluated DDPG (a classical actor-critic) and hybrid actor-critic (Hybrid A-C) methods in a 10-dimensional state and action space. The algorithms were trained in simulation and evaluated on a real device. Evaluation performance in simulation was almost identical to that on the real device. Hybrid A-C was superior to DDPG but the advantage in performance is not statistically significant.

QBM-based methods can be more challenging to deploy due to their complexity and hardware dependence. However, the hybrid actor-critic approach has an advantage over previous FERL-QBM methods. The critic is QBM-based, but it is not necessary for inference. Only the actor, a classical policy network, is used, making deployment easier.

Thoughts

FERL-QBM-based methods showed better performance in both discrete and continuous state-action spaces. However, in the latter case, the advantage over classical A-C methods was not significant. Further studies are required to determine if this holds true in other environments.
It would be intriguing to test SAC on the beam line steering task in the AWAKE environment and compare it to hybrid A-C (with SAC-based actor).
CrossQ is a method recently proposed ¹ that replaces target networks in DQN-based RL algorithms with batch normalization to improve sample efficiency.
- Can CrossQ be used as a hybrid A-C with FERL, and how does it compare to its classical counterpart?

References

Bhatt, A., Palenicek, D., Belousov, B., Argus, M., Amiranashvili, A., Brox, T., & Peters, J. (2019). CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity. arXiv preprint arXiv:1902.05605. ↩︎