## Abstract

Consider the problem of training robustly capable agents. One approach is to generate a diverse collection of agent polices. Training can then be viewed as a quality diversity (QD) optimization problem, where we search for a collection of performant policies that are diverse with respect to quantified behavior. Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available. However, agent policies typically assume that the environment is not differentiable. To apply DQD algorithms to training agent policies, we must approximate gradients for performance and behavior. We propose two variants of the current state-of-the-art DQD algorithm that compute gradients via approximation methods common in reinforcement learning (RL). We evaluate our approach on four simulated locomotion tasks. One variant achieves results comparable to the current state-of-the-art in combining QD and RL, while the other performs comparably in two locomotion tasks. These results provide insight into the limitations of current DQD algorithms in domains where gradients must be approximated. Source code is available at https://github.com/icaros-usc/dqd-rl

## 1 Introduction

We focus on the problem of extending differentiable quality diversity (DQD) to reinforcement learning (RL) domains. We propose to approximate gradients for the objective and measure functions, resulting in two variants of the DQD algorithm CMA-MEGA [18].

Consider a half-cheetah agent (Fig. 2) trained for locomotion, where the agent must continue walking forward even when one foot is damaged. If we frame this challenge as an RL problem, two approaches to design a robustly capable agent would be to (1) design a reward function and (2) apply domain randomization [58, 47]. However, prior work [29, 8] suggests that designing such a reward function is difficult, while domain randomization may require manually selecting hundreds of environment parameters [47, 44].

As an alternative approach, consider that we have intuition on what behaviors would be useful for adapting to damage. For instance, we can measure how often each foot is used during training, and we can pre-train a collection of policies that are diverse in how the agent uses its feet. When one of the agent’s feet is damaged during deployment, the agent can adapt to the damage by selecting a policy that did not move the damaged foot during training [13, 9].

Front Foot: | 5.8% |

Back Foot: | 96.8% |

Front Foot: | 83.9% |

Back Foot: | 11.1% |

Pre-training such a collection of policies may be viewed as a quality diversity (QD) optimization problem [49, 13, 40, 9]. Formally, QD assumes an objective function $f$ and one or more measure functions ${\bm{m}}$. The goal of QD is to find solutions satisfying all output combinations of ${\bm{m}}$, i.e. moving different combinations of feet, while maximizing each solution’s $f$, i.e. walking forward quickly. Most QD algorithms treat $f$ and ${\bm{m}}$ as black boxes, but recent work [18] proposes differentiable quality diversity (DQD), which assumes $f$ and ${\bm{m}}$ are differentiable functions with exact gradient information. QD algorithms have been applied to procedural content generation [25], robotics [13, 40], aerodynamic shape design [22], and scenario generation in human-robot interaction [20, 17].

The recently proposed DQD algorithm CMA-MEGA [18] outperforms QD algorithms by orders of magnitude when exact gradients are available, such as when searching the latent space of a generative model. However, RL problems like the half-cheetah lack these gradients because the environment is typically non-differentiable, thus limiting the applicability of DQD. To address this limitation, we draw inspiration from how evolution strategies (ES) [1, 60, 51, 39] and deep RL actor-critic methods [53, 54, 38, 21] optimize a reward objective by approximating gradients for gradient descent. Our key insight is to approximate objective and measure gradients for DQD algorithms by adapting ES and actor-critic methods.

Our work makes three contributions. (1) We formalize the problem of quality diversity for reinforcement learning (QD-RL) and reduce it to an instance of DQD. (2) We develop two QD-RL variants of the DQD algorithm CMA-MEGA, where each algorithm approximates objective and measure gradients with a different combination of ES and actor-critic methods. (3) We benchmark our variants on four PyBullet locomotion tasks from QDGym [15, 43]. One variant performs comparably (in terms of QD score; Sec. 5.1.3) to the state-of-the-art PGA-MAP-Elites [42] in two tasks. The other variant achieves comparable QD score with PGA-MAP-Elites in all tasks^{1}^{1} 1 We note that the performance of the CMA-MEGA is worse than PGA-MAP-Elites in two of the tasks, albeit within variance. We consider it likely that additional runs would result in PGA-MAP-Elites performing significantly better in these tasks. We leave further evaluation for future work. but is less efficient than PGA-MAP-Elites in two tasks.

These results contrast with prior work [18] where CMA-MEGA vastly outperforms a DQD algorithm inspired by PGA-MAP-Elites on benchmark functions where gradient information is available. Overall, we shed light on the limitations of CMA-MEGA in QD domains where the main challenge comes from optimizing the objective rather than from exploring measure space. At the same time, since we decouple gradient estimates from QD optimization, our work opens a path for future research that would benefit from independent improvements to either DQD or RL.

## 2 Problem Statement

### 2.1 Quality Diversity (QD)

We adopt the definition of QD from prior work [18]. For a solution vector ${\bm{\phi}}\in\mathbb{R}^{n}$, QD considers an objective function $f({\bm{\phi}})$ and $k$ measures^{2}^{2} 2 Prior work refers to measure function outputs as “behavior characteristics,” “behavior descriptors,” or “feature descriptors.” $m_{i}({\bm{\phi}})\in\mathbb{R}$ (for $i\in 1..k$) or, as a joint measure, ${\bm{m}}({\bm{\phi}})\in\mathbb{R}^{k}$. These measures form a $k$-dimensional measure space $\mathcal{X}$. For every ${\bm{x}}\in\mathcal{X}$, the QD objective is to find solution ${\bm{\phi}}$ such that ${\bm{m}}({\bm{\phi}})={\bm{x}}$ and $f({\bm{\phi}})$ is maximized. Since $\mathcal{X}$ is continuous, it would require infinite memory to solve the QD problem, so algorithms in the MAP-Elites family [40, 13] discretize $\mathcal{X}$ by forming a tesselation $\mathcal{Y}$ consisting of $M$ cells. Thus, we relax the QD problem to one of searching for an archive $\mathcal{A}$ consisting of $M$ elites ${\bm{\phi}}_{i}$, one for each cell in $\mathcal{Y}$. Then, the QD objective is to maximize the performance $f({\bm{\phi}}_{i})$ of all elites:

$\displaystyle\max_{{\bm{\phi}}_{1..M}}\sum_{i=1}^{M}f({\bm{\phi}}_{i})$ | (1) |

#### 2.1.1 Differentiable Quality Diversity (DQD)

In DQD, we assume $f$ and ${\bm{m}}$ are first-order differentiable. We denote the objective gradient as ${\bm{\nabla}}f({\bm{\phi}})$, or abbreviated as ${\bm{\nabla}}f$, and the measure gradients as ${\bm{\nabla}}{\bm{m}}({\bm{\phi}})$ or ${\bm{\nabla}}{{\bm{m}}}$.

### 2.2 Quality Diversity for Reinforcement Learning (QD-RL)

We define QD-RL as an instance of the QD problem in which each solution ${\bm{\phi}}$ parameterizes an RL policy $\pi_{\bm{\phi}}$. Then, the objective $f({\bm{\phi}})$ is the expected discounted return of $\pi_{\bm{\phi}}$, and the measures ${\bm{m}}({\bm{\phi}})$ are functions of $\pi_{\bm{\phi}}$. Formally, drawing on the Markov Decision Process (MDP) formulation [55], we represent QD-RL as a tuple $(\mathcal{S},\mathcal{U},p,r,\gamma,{\bm{m}})$. On discrete timesteps $t$ in an episode of interaction, an agent observes state $s\in\mathcal{S}$ and takes action $a\in\mathcal{U}$ according to a policy $\pi_{\bm{\phi}}(a|s)$. The agent then receives scalar reward $r(s,a)$ and observes next state $s^{\prime}\in\mathcal{S}$ according to $s^{\prime}\sim p(\cdot|s,a)$. Each episode thus has a trajectory $\xi=\{s_{0},a_{0},s_{1},a_{1},..,s_{T}\}$, where $T$ is the number of timesteps in the episode, and the probability that policy $\pi_{\bm{\phi}}$ takes trajectory $\xi$ is $p_{\bm{\phi}}(\xi)=p(s_{0})\prod_{t=0}^{T-1}\pi_{\bm{\phi}}(a_{t}|s_{t})$ $p(s_{t+1}|s_{t},a_{t})$. Now, we define the expected discounted return of policy $\pi_{\bm{\phi}}$ as

$\displaystyle f({\bm{\phi}})=\mathbb{E}_{\xi\sim p_{\bm{\phi}}}\left[\sum_{t=0% }^{T}\gamma^{t}r(s_{t},a_{t})\right]$ | (2) |

where the discount factor $\gamma\in(0,1)$ trades off between short- and long-term rewards. Finally, we quantify the behavior of policy $\pi_{\bm{\phi}}$ via a $k$-dimensional measure function ${\bm{m}}({\bm{\phi}})$.

#### 2.2.1 QD-RL as an instance of DQD

We reduce QD-RL to a DQD problem. Since the exact gradients ${\bm{\nabla}}f$ and ${\bm{\nabla}}{\bm{m}}$ usually do not exist in QD-RL, we must instead approximate them.

## 3 Background

### 3.1 Single-Objective Reinforcement Learning

We review algorithms which train a policy to maximize a single objective, i.e. $f({\bm{\phi}})$ in Eq. 2, with the goal of applying these algorithms’ gradient approximations to DQD in Sec. 4.

#### 3.1.1 Evolution strategies (ES)

ES [4] is a class of evolutionary algorithms which optimizes the objective by sampling a population of solutions and moving the population towards areas of higher performance. Natural Evolution Strategies (NES) [60, 61] is a type of ES which updates the sampling distribution of solutions by taking steps on distribution parameters in the direction of the natural gradient [2]. For example, with a Gaussian sampling distribution, each iteration of an NES would compute natural gradients to update the mean ${\bm{\mu}}$ and covariance ${\bm{\Sigma}}$.

We consider an NES-inspired algorithm [51] which has demonstrated success in RL domains. This algorithm, which we refer to as OpenAI-ES, samples $\lambda_{es}$ solutions from an isotropic Gaussian but only computes a gradient step for the mean ${\bm{\phi}}$. Each solution sampled by OpenAI-ES is represented as ${\bm{\phi}}+\sigma{\bm{\epsilon}}_{i}$, where $\sigma$ is the fixed standard deviation of the Gaussian and ${\bm{\epsilon}}_{i}\sim{\mathcal{N}}(\mathbf{0},{\bm{I}})$. Once these solutions are evaluated, OpenAI-ES estimates the gradient as

$\displaystyle{\bm{\nabla}}f({\bm{\phi}})\approx\frac{1}{\lambda_{es}\sigma}% \sum_{i=1}^{\lambda_{es}}f({\bm{\phi}}+\sigma{\bm{\epsilon}}_{i}){\bm{\epsilon% }}_{i}$ | (3) |

OpenAI-ES then passes this estimate to an Adam optimizer [32] which outputs a gradient ascent step for ${\bm{\phi}}$. To make the estimate more accurate, OpenAI-ES further includes techniques such as mirror sampling and rank normalization [5, 26, 60].

#### 3.1.2 Actor-critic methods

While ES treats the objective as a black box, actor-critic methods leverage the MDP structure of the objective, i.e. the fact that $f({\bm{\phi}})$ is a sum of Markovian values. We are most interested in Twin Delayed Deep Deterministic policy gradient (TD3) [21], an off-policy actor-critic method. TD3 maintains (1) an actor consisting of the policy $\pi_{\bm{\phi}}$ and (2) a critic consisting of state-action value functions $Q_{{\bm{\theta}}_{1}}(s,a)$ and $Q_{{\bm{\theta}}_{2}}(s,a)$ which differ only in random initialization. Through interactions in the environment, the actor generates experience which is stored in a replay buffer $\mathcal{B}$. This experience is sampled to train $Q_{{\bm{\theta}}_{1}}$ and $Q_{{\bm{\theta}}_{2}}$. Simultaneously, the actor improves by maximizing $Q_{{\bm{\theta}}_{1}}$ via gradient ascent ($Q_{{\bm{\theta}}_{2}}$ is only used during critic training). Specifically, for an objective $f^{\prime}$ which is based on the critic and approximates $f$, TD3 estimates a gradient ${\bm{\nabla}}f^{\prime}({\bm{\phi}})$ and passes it to an Adam optimizer. Notably, TD3 never updates network weights directly, instead accumulating weights into target networks $\pi_{{\bm{\phi}}^{\prime}}$, $Q_{{\bm{\theta}}^{\prime}_{1}}$, $Q_{{\bm{\theta}}^{\prime}_{2}}$ via an exponentially weighted moving average with update rate $\tau$.

### 3.2 Quality Diversity Algorithms

#### 3.2.1 MAP-Elites extensions for QD-RL

One of the simplest QD algorithms is MAP-Elites [40, 13]. MAP-Elites creates an archive $\mathcal{A}$ by tesselating the measure space $\mathcal{X}$ into a grid of evenly-sized cells. Then, it draws $\lambda$ initial solutions from a multivariate Gaussian $\mathcal{N}(\mathbf{{\bm{\phi}}_{0}},\sigma{\bm{I}})$ centered at some ${\bm{\phi}}_{0}$. Next, for each sampled solution ${\bm{\phi}}$, MAP-Elites computes $f({\bm{\phi}})$ and ${\bm{m}}({\bm{\phi}})$ and inserts ${\bm{\phi}}$ into $\mathcal{A}$. In subsequent iterations, MAP-Elites randomly selects $\lambda$ solutions from $\mathcal{A}$ and adds Gaussian noise, i.e. solution ${\bm{\phi}}$ becomes ${\bm{\phi}}+\mathcal{N}(\mathbf{0},\sigma{\bm{I}})$. Solutions are placed into cells based on their measures; if a solution has higher $f$ than the solution currently in the cell, it replaces that solution. Once inserted into $\mathcal{A}$, solutions are known as elites.

Due to the high dimensionality of neural network parameters, direct policy optimization with MAP-Elites has not proven effective in QD-RL [9], although indirect encodings have been shown to scale to large policy networks [50, 23]. For direct search, several extensions merge MAP-Elites with actor-critic methods and ES. For instance, Policy Gradient Assisted MAP-Elites (PGA-MAP-Elites) [42] combines MAP-Elites with TD3. Each iteration, PGA-MAP-Elites evaluates $\lambda$ solutions for insertion into the archive. $\frac{\lambda}{2}$ of these are created by selecting random solutions from the archive and taking gradient ascent steps with a TD3 critic. The other $\frac{\lambda}{2}$ solutions are created with a directional variation operator [59] which selects two solutions ${\bm{\phi}}_{1}$ and ${\bm{\phi}}_{2}$ from the archive and creates a new one according to ${\bm{\phi}}^{\prime}={\bm{\phi}}_{1}+\sigma_{1}\mathcal{N}(\mathbf{0},{\bm{I}}% )+\sigma_{2}({\bm{\phi}}_{2}-{\bm{\phi}}_{1})\mathcal{N}(0,1)$. Finally, PGA-MAP-Elites maintains a “greedy actor” which provides actions when training the critics (identically to the actor in TD3). Every iteration, PGA-MAP-Elites inserts this greedy actor into the archive. PGA-MAP-Elites achieves state-of-the-art performance on locomotion tasks in the QDGym benchmark [43].

Another MAP-Elites extension is ME-ES [9], which combines MAP-Elites with an OpenAI-ES optimizer. In the “explore-exploit” variant, ME-ES alternates between two phases. In the “exploit” phase, ME-ES restarts OpenAI-ES at a mean ${\bm{\phi}}$ and optimizes the objective for $k$ iterations, inserting the current ${\bm{\phi}}$ into the archive in each iteration. In the “explore” phase, ME-ES repeats this process, but OpenAI-ES instead optimizes for novelty, where novelty is the distance in measure space from a new solution to previously encountered solutions. ME-ES also has an “exploit” variant and an “explore” variant, which each execute only one type of phase.

Our work is related to ME-ES in that we also adapt OpenAI-ES, but instead of alternating between following a novelty gradient and objective gradient, we compute all objective and measure gradients and allow a CMA-ES [28] instance to decide which gradients to follow by sampling gradient coefficients from a multivariate Gaussian updated over time (Sec. 3.2.2). We include MAP-Elites, PGA-MAP-Elites, and ME-ES as baselines in our experiments. Refer to Fig. 3 for a diagram which compares these algorithms to our approach.

#### 3.2.2 Covariance Matrix Adaptation MAP-Elites via a Gradient Arborescence (CMA-MEGA)

We directly extend CMA-MEGA [18] to address QD-RL. CMA-MEGA is a DQD algorithm based on the QD algorithm CMA-ME [19]. The intuition behind CMA-MEGA is that if we knew which direction the current solution point ${\bm{\phi}}^{*}$ should move in objective-measure space, then we could calculate that change in search space via a linear combination of objective and measure gradients. From CMA-ME, we know a good direction is one that results in the largest archive improvement.

Each iteration, CMA-MEGA first calculates objective and measure gradients for a solution point ${\bm{\phi}}^{*}$. Next, it generates $\lambda$ new solutions by sampling gradient coefficients ${\bm{c}}\sim\mathcal{N}({\bm{\mu}},{\bm{\Sigma}})$ and computing ${\bm{\phi}}^{\prime}\leftarrow{\bm{\phi}}^{*}+c_{0}{\bm{\nabla}}f({\bm{\phi}}^% {*})+\sum_{j=1}^{k}c_{j}{\bm{\nabla}}m_{j}({\bm{\phi}}^{*})$. CMA-MEGA inserts these solutions into the archive and computes their improvement, $\Delta$. $\Delta$ is defined as $f({\bm{\phi}}^{\prime})$ if ${\bm{\phi}}^{\prime}$ populates a new cell, and $f({\bm{\phi}}^{\prime})-f({\bm{\phi}}^{\prime}_{\mathcal{E}})$ if ${\bm{\phi}}^{\prime}$ improves an existing cell (replaces a previous solution ${\bm{\phi}}^{\prime}_{\mathcal{E}}$). After CMA-MEGA inserts the solutions, it ranks them by $\Delta$. If a solution populates a new cell, its $\Delta$ always ranks higher than that of a solution which only improves an existing cell. CMA-MEGA then moves the solution point ${\bm{\phi}}^{*}$ towards the largest archive improvement, but also adapts the distribution $\mathcal{N}({\bm{\mu}},{\bm{\Sigma}})$ towards better gradient coefficients by the same ranking. By leveraging gradient information, CMA-MEGA solves QD benchmarks with orders of magnitude fewer solution evaluations than previous QD algorithms.

#### 3.2.3 Beyond MAP-Elites

Several QD-RL algorithms have been developed outside the MAP-Elites family. NS-ES [11] builds on Novelty Search (NS) [35, 36], a family of QD algorithms which add solutions to an unstructured archive only if they are far away from existing archive solutions in measure space. Using OpenAI-ES, NS-ES concurrently optimizes several agents for novelty. Its variants NSR-ES and NSRA-ES optimize for a linear combination of novelty and objective. Meanwhile, the QD-RL algorithm [7] (distinct from the QD-RL problem we define) maintains an archive with all past solutions and optimizes agents along a Pareto front of the objective and novelty. Finally, Diversity via Determinants (DvD) [46] leverages a kernel method to maintain diversity in a population of solutions. As NS-ES, QD-RL, and DvD do not output a MAP-Elites grid archive, we leave their investigation for future work.

### 3.3 Diversity in Reinforcement Learning

Here we distinguish QD-RL from prior work which also applies diversity to RL. One area of work is in latent- and goal-conditioned policies. For latent-conditioned policy $\pi_{\bm{\phi}}(a|s,z)$ [16, 33, 37] or goal-conditioned policy $\pi_{\bm{\phi}}(a|s,g)$ [52, 3], varying the latent variable $z$ or goal $g$ results in different behaviors, e.g. different walking gaits or walking to a different location. While QD-RL also seeks a range of behaviors, the measures ${\bm{m}}({\bm{\phi}})$ are computed after evaluating ${\bm{\phi}}$, rather than before the evaluation. In general, QD-RL focuses on finding a variety of policies for a single task, rather than attempting to solve a variety of tasks with a single conditioned policy.

Another area of work combines evolutionary and actor-critic algorithms to solve single-objective hard-exploration problems [10, 31, 48, 56, 30]. In these methods, an evolutionary algorithm such as cross-entropy method [14] facilitates exploration by generating a diverse population of policies, while an actor-critic algorithm such as TD3 trains high-performing policies with this population’s environment experience. QD-RL differs from these methods in that it views diversity as a component of the output, while these methods view diversity as a means for environment exploration. Hence, QD-RL measures policy behavior via a measure function and collects diverse policies in an archive. In contrast, these RL exploration methods assume that trajectory diversity, rather than targeting specific behavioral diversity, is enough to drive exploration to discover a single optimal policy.

## 4 Approximating Gradients for DQD

Since DQD algorithms require exact objective and measure gradients, we cannot directly apply CMA-MEGA to QD-RL. To address this limitation, we replace exact gradients with gradient approximations (Sec. 4.1) and develop two CMA-MEGA variants (Sec. 4.2).

### 4.1 Approximating Objective and Measure Gradients

We adapt gradient approximations from ES and actor-critic methods. Since the objective has an MDP structure, we estimate objective gradients ${\bm{\nabla}}f$ with ES and actor-critic methods. Since the measures are black boxes, we estimate measure gradients ${\bm{\nabla}}{\bm{m}}$ with ES.

#### 4.1.1 Approximating objective gradients with ES and actor-critic methods

We estimate objective gradients with two methods. First, we treat the objective as a black box and estimate its gradient with a black box method, namely the OpenAI-ES gradient estimate in Eq. 3. Since OpenAI-ES performs well in RL domains [51, 45, 34], we believe this estimate is suitable for approximating gradients for CMA-MEGA in QD-RL settings. Importantly, this estimate requires environment interaction by evaluating $\lambda_{es}$ solutions.

Since the objective has a well-defined structure, i.e. it is a sum of rewards from an MDP (Eq. 2), we also estimate its gradient with an actor-critic method, TD3. TD3 is well-suited for this purpose because it efficiently estimates objective gradients for the multiple policies that CMA-MEGA and other QD-RL algorithms generate. In particular, once the critic is trained, TD3 can provide a gradient estimate for any policy without additional environment interaction.

Among actor-critic methods, we select TD3 since it achieves high performance while optimizing primarily for the RL objective. Prior work [21] shows that TD3 outperforms on-policy actor-critic methods [53, 54]. While the off-policy Soft Actor-Critic [27] algorithm can outperform TD3, it optimizes a maximum-entropy objective designed to encourage exploration. In our work, we explore by finding policies with different measures. Thus, we leave for future work the problem of integrating QD-RL with the action diversity encouraged by entropy maximization.

#### 4.1.2 Approximating measure gradients with ES

Since measures do not have special properties such as an MDP structure (Sec. 2.2), we only estimate their gradient with black box methods. Thus, similar to the objective, we approximate each measure’s gradient ${\bm{\nabla}}{m_{i}}$ with the OpenAI-ES gradient estimate, replacing $f$ with $m_{i}$ in Eq. 3.

Since the OpenAI-ES gradient estimate requires additional environment interaction, all of our CMA-MEGA variants require environment interaction to estimate gradients. However, the environment interaction required to estimate measure gradients remains constant even as the number of measures increases, since we can reuse the same $\lambda_{es}$ solutions to estimate each ${\bm{\nabla}}{m_{i}}$.

In problems where the measures have an MDP structure similar to the objective, it may be feasible to estimate each ${\bm{\nabla}}{m_{i}}$ with its own TD3 instance. In the environments in our work (Sec. 5.1), each measure is non-Markovian since it calculates the proportion of time a walking agent’s foot spends on the ground. This calculation depends on the entire agent trajectory rather than on one state.

### 4.2 CMA-MEGA Variants

Our choice of gradient approximations leads to two CMA-MEGA variants. CMA-MEGA (ES) approximates objective and measure gradients with OpenAI-ES, while CMA-MEGA (TD3, ES) approximates the objective gradient with TD3 and measure gradients with OpenAI-ES. Fig. 1 shows an overview of both algorithms, and Algorithm 1 shows their pseudocode. As CMA-MEGA (TD3, ES) builds on CMA-MEGA (ES), we present only CMA-MEGA (TD3, ES) and highlight lines that CMA-MEGA (TD3, ES) additionally executes.

Identically to CMA-MEGA, the two variants maintain three primary components: a solution point ${\bm{\phi}}^{*}$, a multivariate Gaussian distribution $\mathcal{N}({\bm{\mu}},{\bm{\Sigma}})$ for sampling gradient coefficients, and a MAP-Elites archive $\mathcal{A}$ for storing solutions. We initialize the archive and solution point on line 3, and we initialize the coefficient distribution as part of a CMA-ES instance on line 4.^{3}^{3} 3 We set the CMA-ES batch size $\lambda^{\prime}$ slightly lower than the total batch size $\lambda$ (line 2). While CMA-MEGA (ES) and CMA-MEGA (TD3, ES) both evaluate $\lambda$ solutions each iteration, one evaluation is reserved for ${\bm{\phi}}^{*}$ (line 7). In CMA-MEGA (TD3, ES), one more evaluation is reserved for the greedy actor (line 26).

In the main loop (line 6), we follow the workflow shown in Fig. 1. First, after evaluating ${\bm{\phi}}^{*}$ and inserting it into the archive (line 7-8), we approximate its gradients with either ES or TD3 (line 9-10). This gradient approximation forms the key difference between our variants and the original CMA-MEGA algorithm [18].

Next, we branch from ${\bm{\phi}}^{*}$ to create solutions ${\bm{\phi}}^{\prime}_{i}$ by sampling ${\bm{c}}_{i}$ from the coefficient distribution and computing perturbations ${\bm{\nabla}}_{i}$ (line 13-15). We then evaluate each ${\bm{\phi}}^{\prime}_{i}$ and insert it into the archive (line 16-17).

Finally, we update the solution point and the coefficient distribution’s CMA-ES instance by forming an improvement ranking based on the improvement $\Delta_{i}$ (Sec. 3.2.2; line 19-21). Importantly, since we rank based on improvement, this update enables the CMA-MEGA variants to maximize the QD objective (Eq. 1) [18].

Our CMA-MEGA variants have two additional components. First, we check if no solutions were inserted into the archive at the end of the iteration, which would indicate that we should reset the coefficient distribution and the solution point (line 22-24). Second, in the case of CMA-MEGA (TD3, ES), we manage a TD3 instance similar to how PGA-MAP-Elites does (Sec. 3.2.1). This TD3 instance consists of a replay buffer $\mathcal{B}$, critic networks $Q_{{\bm{\theta}}_{1}}$ and $Q_{{\bm{\theta}}_{2}}$, a greedy actor $\pi_{{\bm{\phi}}_{q}}$, and target networks $Q_{{\bm{\theta}}^{\prime}_{1}}$, $Q_{{\bm{\theta}}^{\prime}_{2}}$, $\pi_{{\bm{\phi}}^{\prime}_{q}}$ (all initialized on line 5). At the end of each iteration, we use the greedy actor to train the critics, and we also insert it into the archive (line 26-29).

## 5 Experiments

We compare our two proposed CMA-MEGA variants (CMA-MEGA (ES), CMA-MEGA (TD3, ES)) with three baselines (PGA-MAP-Elites, ME-ES, MAP-Elites) in four locomotion tasks. We implement MAP-Elites as described in Sec. 3.2.1, and we select the explore-exploit variant for ME-ES since it has performed at least as well as both the explore variant and the exploit variant in several domains [9].

### 5.1 Evaluation Domains

#### 5.1.1 QDGym

We evaluate our algorithms in four locomotion environments from QDGym [43], a library built on PyBullet Gym [12, 15] and OpenAI Gym [6]. Appendix C lists all environment details. In each environment, the QD algorithm outputs an archive of walking policies for a simulated agent. The agent is primarily rewarded for its forward speed. There are also reward shaping [41] signals, such as a punishment for applying higher joint torques, intended to guide policy optimization. The measures compute the proportion of time (number of timesteps divided by total timesteps in an episode) that each of the agent’s feet contacts the ground.

QDGym is challenging because the objective in each environment does not “align” with the measures, in that finding policies with different measures (i.e. exploring the archive) does not necessarily lead to optimization of the objective. While it may be trivial to fill the archive with low-performing policies which stand in place and lift the feet up and down to achieve different measures, the agents’ complexity (high degrees of freedom) makes it difficult to learn a high-performing policy for each value of the measures.

QD Ant | QD Half-Cheetah | QD Hopper | QD Walker | |
---|---|---|---|---|

#### 5.1.2 Hyperparameters

Each agent’s policy is a neural network which takes in states and outputs actions. There are two hidden layers of 128 nodes, and the hidden and output layers have tanh activation. We initialize weights with Xavier initialization [24].

For the archive, we tesselate each environment’s measure space into a grid of evenly-sized cells (see Table 6 for grid dimensions). Each measure is bound to the range $[0,1]$, the min and max proportion of time that one foot can contact the ground.

Each algorithm evaluates 1 million solutions in the environment. Due to computational limits, we evaluate each solution once instead of averaging multiple episodes, so each algorithm runs 1 million episodes total. Refer to Appendix B for further hyperparameters.

#### 5.1.3 Metrics

Our primary metric is QD score [49], which provides a holistic view of algorithm performance. QD score is the sum of the objective values of all elites in the archive, i.e. $\sum_{i=1}^{M}\bm{1}_{{\bm{\phi}}_{i}\mathrm{exists}}f({\bm{\phi}}_{i})$, where $M$ is the number of archive cells. We note that the contribution of a cell to the QD score is 0 if the cell is unoccupied. We set the objective $f$ to be the expected undiscounted return, i.e. we set $\gamma=1$ in Eq. 2.

Since objectives may be negative, an algorithm’s QD score may be penalized when adding a new solution. To prevent this, we define a minimum objective in each environment by taking the lowest objective value that was inserted into the archive in any experiment in that environment. We subtract this minimum from every solution, such that every solution that was inserted into an archive has an objective value of at least 0. Thus, we use QD score defined as $\sum_{i=1}^{M}\bm{1}_{{\bm{\phi}}_{i}\mathrm{exists}}(f({\bm{\phi}}_{i})-% \mathrm{min\ objective})$. We also define a maximum objective equivalent to each environment’s “reward threshold” in PyBullet Gym. This threshold is the objective value at which an agent is considered to have successfully learned to walk.

We report two metrics in addition to QD score. Archive coverage, the proportion of cells for which the algorithm found an elite, gauges how well the QD algorithm explores measure space, and best performance, the highest objective of any elite in the archive, gauges how well the QD algorithm exploits the objective.

### 5.2 Experimental Design

We follow a between-groups design, where the two independent variables are environment (QD Ant, QD Half-Cheetah, QD Hopper, QD Walker) and algorithm (CMA-MEGA (ES), CMA-MEGA (TD3, ES), PGA-MAP-Elites, ME-ES, MAP-Elites). The dependent variable is the QD score. In each environment, we run each algorithm for 5 trials with different random seeds and test three hypotheses:

H1: CMA-MEGA (ES) will outperform all baselines (PGA-MAP-Elites, ME-ES, MAP-Elites).

H2: CMA-MEGA (TD3, ES) will outperform all baselines.

H3: CMA-MEGA (TD3, ES) will outperform CMA-MEGA (ES).

H1 and H2 are based on prior work [18] which showed that in QD benchmark domains, CMA-MEGA outperforms algorithms that do not leverage both objective and measure gradients. H3 is based on results [45] which suggest that actor-critic methods outperform ES in PyBullet Gym. Thus, we expect the TD3 objective gradient to be more accurate than the ES objective gradient, leading to more efficient traversal of objective-measure space and higher QD score.

### 5.3 Implementation

We implement all QD algorithms with the pyribs library [57] except for ME-ES, which we adapt from the authors’ implementation. We run each experiment with 100 CPUs on a high-performance cluster. We allocate one NVIDIA Tesla P100 GPU to algorithms that train TD3 (CMA-MEGA (TD3, ES) and PGA-MAP-Elites). Depending on the algorithm and environment, each experiment lasts 4-20 hours; refer to Table 12, Appendix E for mean runtimes. We have released our source code at https://github.com/icaros-usc/dqd-rl

## 6 Results

We ran 5 trials of each algorithm in each environment. In each trial, we allocated 1 million evaluations and recorded the QD score, archive coverage, and best performance. Fig. 5 plots these metrics, and Appendix E lists final values of all metrics. Appendix G shows example heatmaps and histograms of each archive, and the supplemental material contains videos of generated agents.

### 6.1 Analysis

To test our hypotheses, we conducted a two-way ANOVA which examined the effect of algorithm and environment on the QD score. We note that the ANOVA requires QD scores to have the same scale, but each environment’s QD score has a different scale by default. Thus, for this analysis, we normalized QD scores by dividing by each environment’s maximum QD score, defined as grid cells * (max objective - min objective) (see Appendix C for these quantities).

We found a statistically significant interaction between algorithm and environment on QD score, $F(12,80)=16.82,p<0.001$. Simple main effects analysis indicated that the algorithm had a significant effect on QD score in each environment, so we ran pairwise comparisons (two-sided t-tests) with Bonferroni corrections (Appendix F). Our results are as follows:

H1: There is no significant difference in QD score between CMA-MEGA (ES) and PGA-MAP-Elites in QD Ant and QD Half-Cheetah, but in QD Hopper and QD Walker, CMA-MEGA (ES) attains significantly lower QD score than PGA-MAP-Elites. CMA-MEGA (ES) achieves significantly higher QD score than ME-ES in all environments except QD Hopper, where there is no significant difference. There is no significant difference between CMA-MEGA (ES) and MAP-Elites in all domains except QD Hopper, where CMA-MEGA (ES) attains significantly lower QD score.

H2: In all environments, there is no significant difference in QD score between CMA-MEGA (TD3, ES) and PGA-MAP-Elites. CMA-MEGA (TD3, ES) achieves significantly higher QD score than ME-ES in all environments. CMA-MEGA (TD3, ES) achieves significantly higher QD score than MAP-Elites in QD Half-Cheetah and Walker, with no significant difference in QD Ant and QD Hopper.

H3: CMA-MEGA (TD3, ES) achieves significantly higher QD score than CMA-MEGA (ES) in QD Hopper and QD Walker, but there is no significant difference in QD Ant and QD Half-Cheetah.

### 6.2 Discussion

We discuss how the CMA-MEGA variants differ from the baselines (Sec. 6.2.1-6.2.4) and how they differ from each other (Sec. 6.2.5).

#### 6.2.1 PGA-MAP-Elites and objective-measure space exploration

Of the CMA-MEGA variants, CMA-MEGA (TD3, ES) performed the closest to PGA-MAP-Elites, with no significant QD score difference in any environment. This result differs from prior work [18] in QD benchmark domains, where CMA-MEGA outperformed OG-MAP-Elites, a baseline DQD algorithm inspired by PGA-MAP-Elites.

We attribute this difference to the difficulty of exploring objective-measure space in the benchmark domains. For example, the linear projection benchmark domain is designed to be “distorted” [19]. Values in the center of its measure space are easy to obtain with random sampling, while values at the edges are unlikely to be sampled. Hence, high QD score arises from exploring measure space and filling the archive. Since CMA-MEGA adapts its sampling distribution, it is able to perform this exploration, while OG-MAP-Elites remains “stuck” in the center of the measure space.

In contrast, as discussed in Sec. 5.1.1, it is relatively easy to fill the archive in QDGym. We see this empirically: in all environments, all algorithms achieve nearly 100% archive coverage, usually within the first 250k evaluations (Fig. 5). Hence, the best QD score is achieved by increasing the objective value of solutions after filling the archive. PGA-MAP-Elites achieves this by optimizing half of its generated solutions with respect to its TD3 critic. The genetic operator likely further enhances the efficacy of this optimization, by taking previously-optimized solutions and combining them to obtain high-performing solutions in other parts of the archive.

On the other hand, the CMA-MEGA variants place less emphasis on maximizing the performance of each solution, compared to PGA-MAP-Elites: in each trial, PGA-MAP-Elites takes 5 million objective gradient steps with respect to its TD3 critic, while the CMA-MEGA variants only compute 5k objective gradients, because they dedicate a large part of the evaluation to estimating the measure gradients. This difference suggests a possible extension to CMA-MEGA (TD3, ES) in which solutions are optimized with respect to the TD3 critic before being evaluated in the environment.

#### 6.2.2 PGA-MAP-Elites and optimization efficiency

While there was no significant difference in the final QD scores of CMA-MEGA (TD3, ES) and PGA-MAP-Elites, CMA-MEGA (TD3, ES) was less efficient than PGA-MAP-Elites in some environments. For instance, in QD Hopper, PGA-MAP-Elites reached 1.5M QD score after 100k evaluations, but CMA-MEGA (TD3, ES) required 400k evaluations.

We can quantify optimization efficiency with QD score AUC, the area under the curve (AUC) of the QD score plot. For a QD algorithm which executes $N$ iterations and evaluates $\lambda$ solutions per iteration, we define QD score AUC as a Riemann sum: QD score AUC $=\sum_{i=1}^{N}(\lambda*$ QD score at iteration $i)$ After computing QD score AUC, we ran statistical analysis similar to Sec. 6.1 and found CMA-MEGA (TD3, ES) had significantly lower QD score AUC than PGA-MAP-Elites in QD Ant and QD Hopper. There was no significant difference in QD Half-Cheetah and QD Walker. As such, while CMA-MEGA (TD3, ES) obtained comparable final QD scores to PGA-MAP-Elites in all tasks, it was less efficient at achieving those scores in QD Ant and QD Hopper.

#### 6.2.3 ME-ES and archive insertions

With one exception (CMA-MEGA (ES) in QD Hopper), both CMA-MEGA variants achieved significantly higher QD score than ME-ES in all environments. We attribute this result to the number of solutions each algorithm inserts into the archive. Each iteration, ME-ES evaluates 200 solutions (Appendix B) but only inserts one into the archive, for a total of 5000 solutions inserted during each run. Given that each archive has at least 1000 cells, ME-ES has, on average, 5 opportunities to insert a solution that improves each cell. In contrast, the CMA-MEGA variants have 100 times more insertions. Though the CMA-MEGA variants evaluate 200 solutions per iteration, they insert 100 of these into the archive. This totals to 500k insertions per run, allowing the CMA-MEGA variants to gradually improve archive cells.

#### 6.2.4 MAP-Elites and robustness

In most cases, both CMA-MEGA variants had significantly higher QD score than MAP-Elites or no significant difference, but in QD Hopper, MAP-Elites achieved significantly higher QD score than CMA-MEGA (ES). However, we found that MAP-Elites solutions were less robust (see Appendix D).

#### 6.2.5 CMA-MEGA variants and gradient estimates

In QD Hopper and QD Walker, CMA-MEGA (TD3, ES) had significantly higher QD score than CMA-MEGA (ES). One potential explanation is that PyBullet Gym (and hence QDGym) augments rewards with reward shaping signals intended to promote optimal solutions for deep RL algorithms. In prior work [45], these signals led PPO [54] to train successful walking agents, while they led OpenAI-ES into local optima. For instance, OpenAI-ES trained agents which stood still so as to maximize only the reward signal for staying upright.

Due to these signals, TD3’s objective gradient seems more useful than that of OpenAI-ES in QD Hopper and QD Walker. In fact, the algorithms which performed best in QD Hopper and QD Walker were ones that calculated objective gradients with TD3, i.e. PGA-MAP-Elites and CMA-MEGA (TD3, ES).

Prior work [45] found that rewards could be tailored for ES, such that OpenAI-ES outperformed PPO. Extensions of our work could investigate whether there is a similar effect for QD algorithms, where tailoring the reward leads CMA-MEGA (ES) to outperform PGA-MAP-Elites and CMA-MEGA (TD3, ES).

## 7 Conclusion

To extend DQD to RL settings, we adapted gradient approximations from actor-critic methods and ES. By integrating these approximations with CMA-MEGA, we proposed two novel variants that we evaluated on four locomotion tasks from QDGym. CMA-MEGA (TD3, ES) performed comparably to the state-of-the-art PGA-MAP-Elites in all tasks but was less efficient in two of the tasks. CMA-MEGA (ES) performed comparably in two tasks.

Our results contrast prior work [18] where CMA-MEGA outperformed a baseline algorithm inspired by PGA-MAP-Elites in QD benchmark domains. The difference seems to be that difficulty in the benchmarks arises from a hard-to-explore measure space, whereas difficulty in QDGym arises from an objective which requires rigorous optimization. As such, future work could formalize the notions of “exploration difficulty” of a measure space and “optimization difficulty” of an objective and evaluate algorithms in benchmarks that cover a spectrum of these metrics.

For practitioners looking to apply DQD in RL settings, we recommend estimating objective gradients with an off-policy actor-critic method such as TD3 instead of with an ES. Due to the difficulty of modern control benchmarks, it is important to efficiently optimize the objective — TD3 benefits over ES since it can compute the objective gradient without further environment interaction. Furthermore, reward signals in these benchmarks are designed for deep RL methods, making TD3 gradients more useful than ES gradients.

By reducing QD-RL to DQD, we have decoupled QD-RL into DQD optimization and RL gradient approximations. In the future, we envision algorithms which benefit from advances in either more efficient DQD or more accurate RL gradient approximations.

If you have any questions or comments, please visit the GitHub discussions for this page.