Quality-Diversity Actor-Critic:
Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics

*Equal Contribution, 1Imperial College London, 2Iconic AI
ICML 2024

QDAC is a quality-diversity reinforcement learning algorithm that discovers high-performing and diverse skills.

Abstract

A key aspect of intelligence is the ability to demonstrate a broad spectrum of behaviors for adapting to unexpected situations. Over the past decade, advancements in deep reinforcement learning have led to groundbreaking achievements to solve complex continuous control tasks. However, most approaches return only one solution specialized for a specific problem.

We introduce Quality-Diversity Actor-Critic (QDAC), an off-policy actor-critic deep reinforcement learning algorithm that leverages a value function critic and a successor features critic to learn high-performing and diverse behaviors. In this framework, the actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills.

Compared with other Quality-Diversity methods, QDAC achieves significantly higher performance and more diverse behaviors on six challenging continuous control locomotion tasks. We also demonstrate that we can harness the learned skills to adapt better than other baselines to five perturbed environments. Finally, qualitative analyses showcase a range of remarkable behaviors.

Method

We formalize Quality-Diversity optimization as a constraint optimization problem. We intend to learn a skill-conditioned policy that (1) maximizes the expected return, and (2) is subject to the expected features converge to the desired skill.


We introduce an actor-critic method that leverages two critics: a performance critic (i.e., a value function) to optimize (1), and a behavior critic (i.e., a successor features) to optimize (2).


The actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills.

QDAC seamlessly unifies performance and behavior critics using constrained optimization to
(1) maximize return, while (2) executing diverse skills.

Tasks

Humanoid - Jump

Humanoid Jump Archive

Humanoid - Angle

Humanoid Angle Archive

Humanoid - Feet Contact

Humanoid Feet Contact Archive

Walker - Feet Contact

Walker Feet Contact Archive

Ant - Feet Contact

Ant Feet Contact Archive

Ant - Velocity

Ant Velocity Archive

Conclusion

In this work, we introduce a novel Quality-Diversity algorithm fully formalized as an actor-critic method. This approach leverages both a performance critic and a behavior critic to learn high-performing and diverse behaviors. Within this framework, the actor optimizes an objective that seamlessly integrates these two critics with a Lagrange multiplier, using constrained optimization to (1) maximize return and (2) execute diverse skills.

We demonstrate that our approach is competitive compared to traditional Quality-Diversity methods. Quantitative results demonstrate that QDAC is competitive in adaptation tasks, while qualitative analyses reveal a range of diverse and remarkable behaviors.

Most Quality-Diversity methods determine the skill (i.e., descriptor) after the episode is terminated, resulting in backward-looking approach. In constrast, we introduce an innovative forward-looking approach, that leverages successor features to predict the skill to be executed by the policy. This is crucial, as it allows the successor features to act as a critic, evaluating the policy's actions to ensure the execution of the desired skill.

Furthermore, like the vast majority of Quality-Diversity algorithms, QDAC uses a manually defined diversity measure to guide the diversity search towards relevant behaviors. An exciting direction for future work would be to combine QDAC with an unsupervised method to discover task-agnostic skills.

BibTeX

@inproceedings{airl2024qdac,
	title={Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics},
	author={Grillotti, Luca and Faldor, Maxence and González León, Borja and Cully, Antoine},
	booktitle={International Conference on Machine Learning},
	year={2024},
	organization={PMLR}
}