A key aspect of intelligence is the ability to demonstrate a broad spectrum of behaviors for adapting to unexpected situations. Over the past decade, advancements in deep reinforcement learning have led to groundbreaking achievements to solve complex continuous control tasks. However, most approaches return only one solution specialized for a specific problem.
We introduce Quality-Diversity Actor-Critic (QDAC), an off-policy actor-critic deep reinforcement learning algorithm that leverages a value function critic and a successor features critic to learn high-performing and diverse behaviors. In this framework, the actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills.
Compared with other Quality-Diversity methods, QDAC achieves significantly higher performance and more diverse behaviors on six challenging continuous control locomotion tasks. We also demonstrate that we can harness the learned skills to adapt better than other baselines to five perturbed environments. Finally, qualitative analyses showcase a range of remarkable behaviors.
We formalize Quality-Diversity optimization as a constraint optimization problem. We intend to learn a skill-conditioned policy that (1) maximizes the expected return, and (2) is subject to the expected features converge to the desired skill.
We introduce an actor-critic method that leverages two critics: a performance critic (i.e., a value function) to optimize (1), and a behavior critic (i.e., a successor features) to optimize (2).
The actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills.
In this work, we introduce a novel Quality-Diversity algorithm fully formalized as an actor-critic method. This approach leverages both a performance critic and a behavior critic to learn high-performing and diverse behaviors. Within this framework, the actor optimizes an objective that seamlessly integrates these two critics with a Lagrange multiplier, using constrained optimization to (1) maximize return and (2) execute diverse skills.
We demonstrate that our approach is competitive compared to traditional Quality-Diversity methods. Quantitative results demonstrate that QDAC is competitive in adaptation tasks, while qualitative analyses reveal a range of diverse and remarkable behaviors.
Most Quality-Diversity methods determine the skill (i.e., descriptor) after the episode is terminated, resulting in backward-looking approach. In constrast, we introduce an innovative forward-looking approach, that leverages successor features to predict the skill to be executed by the policy. This is crucial, as it allows the successor features to act as a critic, evaluating the policy's actions to ensure the execution of the desired skill.
Furthermore, like the vast majority of Quality-Diversity algorithms, QDAC uses a manually defined diversity measure to guide the diversity search towards relevant behaviors. An exciting direction for future work would be to combine QDAC with an unsupervised method to discover task-agnostic skills.
@inproceedings{airl2024qdac,
title={Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics},
author={Grillotti, Luca and Faldor, Maxence and González León, Borja and Cully, Antoine},
booktitle={International Conference on Machine Learning},
year={2024},
organization={PMLR}
}