Discovering state-of-the-art reinforcement learning algorithms

Main

The primary goal of artificial intelligence is to design agents that, like humans, can predict and act in complex environments to achieve goals. Many of the most successful agents are based on reinforcement learning (RL), in which agents learn by interacting with environments. Decades of research have produced ever more efficient RL algorithms, resulting in numerous landmarks in artificial intelligence, including the mastery of complex competitive games such as Go⁷, chess⁸, StarCraft⁹ and Minecraft¹⁰, the invention of new mathematical tools¹¹, or the control of complex physical systems¹².

Unlike humans, whose learning mechanism has been naturally discovered by biological evolution, RL algorithms are typically manually designed. This is usually slow and laborious, and limited by reliance on human knowledge and intuition. Although a number of attempts have been made to automatically discover learning algorithms^1,2,3,4,5,6, none have proven to be sufficiently efficient and general to replace hand-designed RL systems.

In this work, we introduce an autonomous method for discovering RL rules solely through the experience of many generations of agents interacting with various environments (Fig. 1a). The discovered RL rule achieves state-of-the-art performance on a variety of challenging RL benchmarks. The success of our method contrasts previous work in two dimensions. First, whereas previous methods searched over narrow spaces of RL rules (for example, hyperparameters^13,14 or policy loss^1,6), our method allows the agent to explore a far more expressive space of potential RL rules. Second, whereas previous work focused on meta-learning in simple environments (for example, grid-worlds^3,15), our method meta-learns in complex and diverse environments at a much larger scale.

**Fig. 1: Discovering an RL rule from a population of agents.**

To choose a general space of discovery, we observe that the essential component of standard RL algorithms is a rule that updates one or more predictions, as well as the policy itself, towards targets that are functions of quantities such as future rewards and future predictions. Examples of RL rules based on different targets include temporal-difference learning¹⁶, Q-learning¹⁷, proximal policy optimization (PPO)¹⁸, auxiliary tasks¹⁹, successor features²⁰ and distributional RL²¹. In each case, the choice of target determines the nature of the predictions, for example, whether they become value functions, models or successor features.

In our framework, an RL rule is represented by a meta-network that determines the targets towards which the agent should move its predictions and policy (Fig. 1c). This allows the system to discover useful predictions without pre-defined semantics, as well as how they are used. The system may in principle rediscover past RL rules, but the flexible functional form also allows the agent to invent new RL rules that may be specifically adapted to environments of interest.

During the discovery process, we instantiate a population of agents, each of which interacts with its own instance of an environment taken from a diverse set of challenging tasks. Each agent’s parameters are updated according to the current RL rule. We then use the meta-gradient method¹³ to incrementally improve the RL rule such that it could lead to better-performing agents.

Our large-scale empirical results show that our discovered RL rule, which we call DiscoRL, surpasses all existing RL rules on the environments in which it was meta-learned. Notably, this includes Atari games²², arguably the most established and informative of RL benchmarks. Furthermore, DiscoRL achieved state-of-the-art performance on a number of other challenging benchmarks, such as ProcGen²³, that it had never been exposed to during discovery. We also show that the performance and generality of DiscoRL improves further as more diverse and complex environments are used in discovery. Finally, our analysis shows that DiscoRL has discovered unique prediction semantics that are distinct from existing RL concepts such as value functions. To the best of our knowledge, this is the empirical evidence that surpassing manually designed RL algorithms in terms of both generality and efficiency is finally within reach.

Discovery method

Our discovery approach involves two types of optimization: agent optimization and meta-optimization. Agent parameters are optimized by updating their policies and predictions towards the targets produced by the RL rule. Meanwhile, the meta-parameters of the RL rule are optimized by updating its targets to maximize the cumulative rewards of the agents.

Agent network

Much RL research considers what predictions an agent should make (for example, values), and what loss function should be used to learn those predictions (for example, temporal-difference (TD) learning) and improve the policy (for example, policy gradient). Instead of hand-crafting them, we define an expressive space of predictions without pre-defined semantics and meta-learn what the agent needs to optimize by representing it using a meta-network. It is desirable to maintain the ability to represent key ideas in existing RL algorithms, while supporting a large space of novel algorithmic possibilities.

To this end, we let the agent, parameterized by θ, output two types of predictions in addition to a policy (π): an observation-conditioned vector prediction y(s) ∈ ℝⁿ of arbitrary size n and an action-conditioned vector prediction z(s, a) ∈ ℝ^m of arbitrary size m, where s and a are an observation and an action, respectively (Fig. 1b). The form of these predictions stems from the fundamental distinction between prediction and control¹⁶. For example, value functions are commonly divided into state-value functions v(s) (for prediction) and action-value functions q(s, a) (for control), and many other concepts in RL, such as rewards and successor features, also have an observation-conditioned version and an action-conditioned version. Therefore, the functional form of the predictions (y, z) is general enough to represent, but is not restricted to, many existing fundamental concepts in RL.

In addition to the predictions to be discovered, in most of our experiments the agent makes predictions with pre-defined semantics. Specifically, the agent produces an action-value function q(s, a) and an action-conditional auxiliary policy prediction p(s, a)⁸. This encourages the discovery process to focus on discovering new concepts through y and z.

Meta-network

A large proportion of modern RL rules use the forward view of RL¹⁶. In this view, the RL rule receives a trajectory from time step t to t + n, and uses this information to update the agent’s predictions or policy. They typically update the predictions or policy towards bootstrapped targets, that is, towards future predictions.

Correspondingly, our RL rule uses a meta-network (Fig. 1c) as a function that determines targets towards which the agent should move its predictions and policy. To produce targets at time step t, the meta-network receives as input a trajectory of the agent’s predictions and policy as well as rewards and episode termination from time step t to t + n. It uses a standard long short-term memory²⁴ to process these inputs, although other architectures may be used (Extended Data Fig. 3).

The choice of inputs and outputs to the meta-network maintains certain desirable properties of handcrafted RL rules. First, the meta-network can deal with any observation and with discrete action spaces of any size. This is possible because the meta-network does not receive the observation directly as input, but only indirectly via predictions. In addition, it processes action-specific inputs and outputs by sharing weights across action dimensions. As a result it can generalize to radically different environments. Second, the meta-network is agnostic to the design of the agent network, as it sees only the output of the agent network. As long as the agent network produces the required form of outputs (π, y, z), the discovered RL rule can generalize to arbitrary agent architectures or sizes. Third, the search space defined by the meta-network includes the important algorithmic idea of bootstrapping. Fourth, as the meta-network processes both policy and predictions together, it can not only meta-learn auxiliary tasks²⁵ but also directly use predictions to update the policy (for example, to provide a baseline for variance reduction). Finally, outputting targets is strictly more expressive than outputting a scalar loss function, as it includes semi-gradient methods such as Q-learning in the search space. While building on these properties of standard RL algorithms, the rich parametric neural network allows the discovered rule to implement algorithms with potentially much greater efficiency and contextual nuance.

Agent optimization

The agent’s parameters (θ) are updated to minimize the distance from its predictions and policy to the targets from the meta-network. The agent’s loss function can be expressed as:

$$L(\theta )={{\mathbb{E}}}_{s,a\sim {{\boldsymbol{\pi }}}_{\theta }}[D(\hat{{\boldsymbol{\pi }}},{{\boldsymbol{\pi }}}_{\theta }(s))+D(\hat{{\bf{y}}},{{\bf{y}}}_{\theta }(s))+D(\hat{{\bf{z}}},{{\bf{z}}}_{\theta }(s,a))+{L}_{{\rm{a}}{\rm{u}}{\rm{x}}}]$$

where s and a are distributed according to the policy π_θ, and D(p, q) is a distance function between p and q. We chose the Kullback–Leibler divergence as the distance function, as it is sufficiently general and has previously been found to make meta-optimization easier³. Here π_θ, y_θ, z_θ and $\hat{{\boldsymbol{\pi }}}$, $\hat{{\bf{y}}}$, $\hat{{\bf{z}}}$ are the outputs of the agent network and the meta-network, respectively, with a softmax function applied to normalize each vector.

The auxiliary loss L_aux is used for predictions with pre-defined semantics: action values (q) and auxiliary policy predictions (p) as follows: L_aux = D($\hat{{\bf{q}}}$, q_θ(s, a)) + D($\hat{{\bf{p}}}$, p_θ(s, a)), where $\hat{{\bf{q}}}$ is an action-value target from Retrace²⁶ projected to a two-hot vector⁸, and $\hat{{\bf{p}}}$ = π_θ(s′) is the policy at the one-step future state. To be consistent with the rest of losses, we use the Kullback–Leibler divergence as the distance function D.

Meta-optimization

Our goal is to discover an RL rule, represented by the meta-network with meta-parameters η, that allows agents to maximize rewards in a variety of training environments. This discovery objective J(η) and its meta-gradient ∇_ηJ(η) can be expressed as:

$$J(\eta )={{\mathbb{E}}}_{{\mathcal{E}}}{{\mathbb{E}}}_{\theta }[\,J(\theta )],{\nabla }_{\eta }\,J(\eta )\approx {{\mathbb{E}}}_{{\mathcal{E}}}{{\mathbb{E}}}_{\theta }[{\nabla }_{\eta }\theta {\nabla }_{\theta }\,J(\theta )],$$

where ${\mathcal{E}}$ indicates an environment sampled from a distribution and θ denotes agent parameters induced by an initial parameter distribution and their evolution over the course of learning with the RL rule. $J(\theta )={\mathbb{E}}\left[{\sum }_{t}{{\gamma }}^{{t}}{r}_{{t}}\right]$, where γ is the discount factor and r_t is the reward at step t, is the expected discounted sum of rewards, which is the typical RL objective. The meta-parameters are optimized using gradient ascent following the above equations.

To estimate the meta-gradient, we instantiate a population of agents that learn according to the meta-network in a set of sampled environments. To ensure this approximation is close to the true distribution of interest, we use a large number of complex environments taken from challenging benchmarks, in contrast to previous work that focused on a small number of simple environments. As a result the discovery process surfaces diverse RL challenges, such as the sparsity of rewards, the task horizon, and the partial observability or stochasticity of environments.

Each agent’s parameters are periodically reset to encourage the update rule to make fast learning progress within a limited agent lifetime. As in previous work on meta-gradient RL¹³, the meta-gradient term ∇_ηJ(η) can be divided into two gradient terms by the chain rule: ∇_ηθ and ∇_θJ(θ). The first term can be understood as a gradient over the agent update procedure²⁷, whereas the second term is the gradient of the standard RL objective. To estimate the first term, we iteratively update the agent multiple times and backpropagate through the entire update procedure, as illustrated in Fig. 1d. To make it tractable, we backpropagate over 20 agent updates using a sliding window. Finally, to estimate the second term, we use the advantage actor–critic method²⁸. To estimate the advantage, we train a meta-value function, which is a value function used only for discovery.

Empirical result

We implemented our discovery method with a large population of agents in a set of complex environments. We call the discovered RL rule DiscoRL. In evaluation, the aggregated performance was measured by the interquartile mean (IQM) of normalized scores for benchmarks that consist of multiple tasks, which has proven to be a statistically reliable metric²⁹.

Atari

The Atari benchmark²², one of the most studied benchmarks in the history of RL, consists of 57 Atari 2600 games. They require complex strategies, planning and long-term credit assignment, making it non-trivial for AI agents to master. Hundreds of RL algorithms have been evaluated on this benchmark over the past decade, which include MuZero⁸ and Dreamer¹⁰.

To see how strong the rule can be when discovered directly from this benchmark, we meta-trained an RL rule, Disco57, and evaluated it on the same 57 games (Fig. 2a). In this evaluation, we used a network architecture that has a number of parameters comparable to the number used by MuZero. This is a larger network than the one used during discovery; the discovered RL rule must therefore generalize to this setting. Disco57 achieved an IQM of 13.86, outperforming all existing RL rules^8,10,14,30 on the Atari benchmark, with a substantially higher wall-clock efficiency compared with the state-of-the-art MuZero (Extended Data Fig. 4). This shows that our method can automatically discover a strong RL rule from such challenging environments.

Generalization

We further investigated the generality of Disco57 by evaluating it on a variety of held-out benchmarks that it was never exposed to during discovery. These benchmarks include unseen observation and action spaces, diverse environment dynamics, various reward structures and unseen agent network architectures. Meta-training hyperparameters were tuned on only training environments (that is, Atari) to prevent the rule from being implicitly optimized for held-out benchmarks.

The result on the ProcGen²³ benchmark (Fig. 2b and Extended Data Table 2), which consists of 16 procedurally generated two-dimensional games, shows that Disco57 outperformed all existing published methods, including MuZero⁸ and PPO¹⁸, even though it had never interacted with ProcGen environments during discovery. In addition, Disco57 achieved a competitive performance on Crafter³¹ (Fig. 2d and Extended Data Table 5), where the agent needs to learn a wide spectrum of abilities to survive. Disco57 reached the third place on the leaderboard of NetHack NeurIPS 2021 Challenge³² (Fig. 2e and Extended Data Table 4), where more than 40 teams participated. Unlike the top submitted agents in the competition³³, Disco57 did not use any domain-specific knowledge for defining subtasks or reward shaping. For a fair comparison, we trained an agent with the importance weighted actor-learner architecture (IMPALA) algorithm³⁴ using the same settings as Disco57. IMPALA’s performance was much weaker, suggesting that Disco57 has discovered a more efficient RL rule than standard approaches. In addition to environments, Disco57 turned out to be robust to a range of agent-specific settings such as network size, replay ratio and hyperparameters in evaluation (Extended Data Fig. 1).

Complex and diverse environments

To understand the importance of complex and diverse environments for discovery, we further scaled up meta-learning with additional environments. Specifically, we discovered another rule, Disco103, using a more diverse set of 103 environments consisting of the Atari, ProcGen and DMLab-30³⁵ benchmarks. This rule performs similarly on the Atari benchmark while improving scores on every other seen and unseen benchmark in Fig. 2. In particular, Disco103 reached human-level performance on Crafter and neared MuZero’s state-of-the-art performance on Sokoban³⁶. These results show that the more complex and diverse the set of environments used for discovery, the stronger and more general the discovered rule becomes, even on held-out environments that were not seen during discovery. Discovering Disco103 required no changes to the discovery method compared with Disco57 other than the set of environments. This shows that the discovery process itself is robust, scalable and general.

To further investigate the importance of using complex environments, we ran our discovery process on 57 grid-world tasks that are extended from previous work³, using the same meta-learning settings as for Disco57. The new rule had a significantly worse performance (Fig. 3c) on the Atari benchmark. This verifies our hypothesis about the importance of meta-learning directly from complex and challenging environments. While using such environments was crucial, there was no need for a careful curation of the correct set of environments; we simply used popular benchmarks from the literature.

**Fig. 3: Properties of discovery process.**

Efficiency and scalability

To further understand the scalability and efficiency of our approach, we evaluated multiple Disco57s over the course of discovery (Fig. 3a). The best rule was discovered within approximately 600 million steps per Atari game, which amounts to just 3 experiments across 57 Atari games. This is arguably more efficient than the manual discovery of RL rules, which typically requires many more experiments to be executed, in addition to the time of the human researchers.

Furthermore, DiscoRL performed better on the unseen ProcGen benchmark as more Atari games were used for discovery (Fig. 3b), showing that the resulting RL rule scales well with the number and diversity of environments used for discovery. In other words, the performance of the discovered rule is a function of data (that is, environments) and compute.

Effect of discovering new predictions

To study the effect of the discovered semantics of predictions (y, z in Fig. 1b), we compared different rules by varying the outputs of the agent, with and without certain types of prediction. The result in Fig. 3c shows that the use of a value function markedly improves the discovery process, which highlights the importance of this fundamental concept of RL. However, the result in Fig. 3c also shows the importance of discovering new prediction semantics (y and z) beyond pre-defined predictions. Overall, increasing the scope of discovery compared with previous work^1,2,3,4,5,6 was essential. In the following section, we provide further analysis to uncover what semantics have been discovered.

Analysis

Qualitative analysis

We analysed the nature of the discovered rule, using Disco57 as a case study (Fig. 4). Qualitatively, the discovered predictions spike in advance of salient events such as receiving rewards or changes in the entropy of the policy (Fig. 4a). We also investigated which features of the observation cause the meta-learned predictions to respond strongly, by measuring the gradient norm associated with each part of the observation. The result in Fig. 4b shows that meta-learned predictions tend to pay attention to objects that may be relevant in the future, which is distinct from where the policy and the value function pay attention to. These results indicate that DiscoRL has learned to identify and predict salient events over a modest horizon, and thus complements existing concepts such as the policy and value function.

Information analysis

To confirm the qualitative findings, we further investigated what information is contained in the predictions. We first collected data from the DiscoRL agent on 10 Atari games and trained a neural network to predict quantities of interest from either the discovered predictions, the policy or the value function. The results in Fig. 4c show that the discovered predictions contain greater information about upcoming large rewards and the future policy entropy, compared with the policy and value. This suggests that the discovered predictions may capture unique task-relevant information that is not well captured by the policy and value.

Emergence of bootstrapping

We also found evidence that DiscoRL uses a bootstrapping mechanism. When the meta-network’s prediction input at future time steps (z_t+k) is perturbed, it strongly affects the target ${\hat{{\bf{z}}}}_{t}$ (Fig. 4d). This means that the future predictions are used to construct targets for the current predictions. This bootstrapping mechanism and the discovered predictions turned out to be critical for performance (Fig. 4e). If the y and z inputs to the meta-network are set to zero when computing their targets $\hat{{\bf{y}}}$ and $\hat{{\bf{z}}}$ (thus preventing bootstrapping), performance degrades substantially. If the y and z inputs are set to zero for computing all targets including the policy target, the performance drops even further. This shows the discovered predictions are heavily used to inform the policy update, rather than just serving as auxiliary tasks.

Previous work

The idea of meta-learning, or learning to learn, in artificial agents dates back to the 1980s³⁷, with proposals to train meta-learning systems with backpropagation of gradients³⁸. The core idea of using a slower meta-learning process to meta-optimize a fast learning or adaptation process^39,40 has been studied for numerous applications in various contexts, including transfer learning⁴¹, continual learning⁴², multi-task learning⁴³, hyperparameter optimization⁴⁴ and automated machine learning⁴⁵.

Early efforts to use meta-learning for RL agents comprised attempts to meta-learn information-seeking behaviours⁴⁶. Many later works have focused on meta-learning a small number of hyperparameters of an existing RL algorithm^13,14. Such approaches have produced promising results but cannot markedly depart from the underlying handcrafted algorithms. Another line of work has attempted to eschew inductive biases by meta-learning entirely black-box algorithms implemented, for example, as recurrent neural networks⁴⁷ or as a synaptic learning rule⁴⁸. Although conceptually appealing, these methods are prone to overfit to tasks seen in meta-training⁴⁹.

The idea of representing knowledge using a wider class of predictions was first introduced in temporal-difference networks⁵⁰ but without any meta-learning mechanism. A similar idea has been explored for meta-learning auxiliary tasks²⁵. Our work extends this idea to effectively discover an entire loss function that the agent optimizes, covering a much broader range of possible RL rules. Furthermore, unlike previous work, the discovered knowledge can generalize to unseen environments.

Recently, there have been growing interests in discovering general-purpose RL rules^1,3,4,5,6,15. However, most of them were limited to small agents and simple tasks, or the scope of discovery was limited to a partial RL rule. Therefore, their rules were not extensively compared with state-of-the-art rules on challenging benchmarks. In contrast, we search over a larger space of rules, including entirely new predictions, and scale up to a large number of complex environments for discovery. As a result, we demonstrate that it is possible to discover a general-purpose RL rule that outperforms a number of state-of-the-art rules on challenging benchmarks.

Conclusion

Enabling machines to discover learning algorithms for themselves is one of the most promising ideas in artificial intelligence owing to its potential for open-ended self-improvement. This work has taken a step towards machine-designed RL algorithms that can compete with and even outperform some of the best manually designed algorithms in challenging environments. We also showed that the discovered rule becomes stronger and more general as it gets exposed to more diverse environments. This suggests that the design of RL algorithms for advanced AI may in the future be led by machines that can scale effectively with data and compute.

Methods

Meta-network

The meta-network maps a trajectory of agent outputs along with relevant quantities from the environment to targets: ${m}_{\eta }:{f}_{\theta }({s}_{t}),{f}_{{\theta }^{-}}({s}_{t}),$ ${a}_{t},{r}_{t},{b}_{t},\ldots ,{f}_{\theta }({s}_{t+n}),{f}_{{\theta }^{-}}({s}_{t+n}),{a}_{t+n},{r}_{t+n},{b}_{t+n}\mapsto \hat{{\boldsymbol{\pi }}},\hat{{\bf{y}}},\hat{{\bf{z}}},$ where η represents meta-parameters, and f_θ = [π_θ(s), y_θ(s), z_θ(s), q_θ(s)] is the agent output with parameters θ. a, r and b are an action taken by the agent, a reward and an episode termination indicator, respectively. θ⁻ is an exponential moving average of parameters θ. This functional form allows the meta-network to search over a strictly larger space of rules compared with meta-learning a scalar loss function. This is further discussed in Supplementary Information.

The meta-network processes the inputs by unrolling a long short-term memory (LSTM) backwards in time as illustrated in Fig. 1c. This allows it to take into account n-step future information to produce targets, as in multi-step RL methods such as temporal-difference methods TD(λ)⁵⁴. We found that this architecture is computationally more efficient than alternatives such as transformers, while achieving a similar performance, as shown in Extended Data Fig. 3b.

The action-specific inputs and outputs are processed in the meta-network using shared weights over the action dimension, and an intermediate embedding is computed by averaging across it. This allows the meta-network to process any number of actions. More details about it can be found in Supplementary Information.

To allow the meta-network to discover a wider class of algorithms, such as reward normalization, that require maintaining statistics over an agent’s lifetime, we add an additional recurrent neural network. This ‘meta-RNN’ is unrolled forward across agent updates (from θ_i to θ_i+1), rather than across time steps in an episode. The core of the meta-RNN is another LSTM module. For each of the agent updates, the whole batch of trajectories is embedded into a single vector that is passed to this LSTM. The meta-RNN can potentially capture the learning dynamics throughout the agent’s lifetime, producing targets that are adaptive to the specific agent and the environment. The meta-RNN slightly improved the overall performance, as shown in Extended Data Fig. 3a. Further details are described in Supplementary Information.

Meta-optimization stabilization

A number of challenges arise when we discover at a large scale, mainly because of unbalanced gradient signals coming from agents in different environments and myopic gradients caused by long lifetimes of agents. We introduce a few methods to alleviate these problems.

First, when estimating the advantage term in the advantage actor–critic method to estimate ∇_θJ(θ) in the meta-gradient, we normalize the advantage term as follows: $\bar{A}$ = (A − μ)/σ, where $\bar{A}$ is a normalized advantage and μ and σ are the exponentially moving average and standard deviation of advantages accumulated over the agent’s lifetime. We found that this makes the scale of the advantage term balanced across different environments. In addition, when aggregating the meta-gradient from the population of agents, we take the average of the meta-gradients over all agents after applying a separate Adam optimizer to the meta-gradient calculated from each agent: $\eta \leftarrow \eta +\frac{1}{n}{\sum }_{i=1}^{n}\mathrm{ADAM}({g}_{i})$, where g_i is the meta-gradient estimation from the ith agent in the population. We found that this helps to normalize the magnitude of the meta-gradients from each agent.

We add two meta-regularization losses (L_ent and L_KL) to the meta-objective J(η) as follows: ${{\mathbb{E}}}_{{\mathcal{E}}}{{\mathbb{E}}}_{\theta }[\,J(\theta )-{L}_{\mathrm{ent}}(\theta )-{L}_{\mathrm{KL}}(\theta )]$.${L}_{\mathrm{ent}}(\theta )=-{{\mathbb{E}}}_{s,a}$ $[H({{\bf{y}}}_{\theta }(s))+H({{\bf{z}}}_{\theta }({\rm{s}},a))]$ is an entropy regularization of predictions y and z, where H(⋅) is the entropy of the given categorical distribution. We found that this helps prevent the predictions from converging prematurely. ${L}_{{\rm{K}}{\rm{L}}}(\theta )={D}_{{\rm{K}}{\rm{L}}}({{\boldsymbol{\pi }}}_{{\theta }^{-}}||\hat{{\boldsymbol{\pi }}})$ is the Kullback–Leibler divergence between the policy of a target network with an exponential moving average of the agent parameters (θ⁻) and the meta-network’s policy target ($\hat{{\boldsymbol{\pi }}}$). This prevents the meta-network from proposing excessively aggressive updates that could lead to collapse.

It is noted that these methods are used only to stabilize meta-optimization, and they do not determine how the agents are updated. The meta-learned rule still solely determines how the agents are updated.

Implementation details

We developed a framework that uses JAX library^55,56 and distributes computation across tensor processing units (TPUs)⁵⁷ inspired by the Podracer architectures⁵⁸. In this framework, each agent is simulated independently, with the meta-gradients of all agents being calculated in parallel. The meta-parameters are updated synchronously by aggregating meta-gradients across all agents. We used MixFlow-MG⁵⁹ to minimize the computational cost of the runs.

For Disco57, we instantiate 128 agents by cycling through the 57 Atari environments in lexicographic order. For Disco103, we instantiate 206 agents, with two copies of each environment from Atari, ProcGen and DMLab-30. Disco57 was discovered using 1,024 TPUv3 cores for 64 hours, and Disco103 was discovered using 2,048 TPUv3 cores for 60 hours.

The meta-value function used to calculate the meta-gradient is updated using V-Trace³⁴, with a discount factor of 0.997 and a TD(λ) coefficient of 0.95. The meta-value function and agent networks are optimized using an Adam optimizer with a learning rate of 0.0003. For meta-parameter updates, we use the Adam optimizer with a learning rate of 0.001 and gradient clipping of 1.0. Each agent is updated based on a batch of 96 trajectories with 29 time steps each. In each batch, on-policy trajectories and trajectories sampled from the replay buffer are mixed, with replay trajectories accounting for 90% of each batch. At each meta-step, 48 trajectories are generated to calculate the meta-gradient and update the meta-value function.

Each agent’s parameters are reset after it has consumed its allocated experience budget. When resetting, a new experience budget is sampled from the categories (200 million, 100 million, 50 million, 20 million) with a weight inversely proportional to the budget, such that the same amount of total experience is sampled in each category. This was based on our observation that much of learning happens early in the lifetime and demonstrated a marginal improvement in our preliminary small-scale investigation.

Hyperparameters and evaluation

For evaluation on held-out benchmarks, we only tuned the learning rate from {0.0001, 0.0003, 0.0005}. The rest of the hyperparameters were selected based on baseline algorithms from the literature.

The evaluation on Atari games (shown in Fig. 2a and Extended Data Table 1) used a version of the IMPALA³⁴ network with an increased parameter count that matches the agent network size used by MuZero⁸. Specifically, we used a network with 4 convolutional residual blocks with 256, 384, 384 and 256 filters, a shared fully connected final layer of 768 dimensions, and an LSTM-based action-conditional predictions component that is composed of an LSTM with a 1,024 hidden state dimension and a 1,024-dimensional fully connected layer. DMLab-30 evaluations (Fig. 2c and Extended Data Table 3) use the same action space discretization and agent network architecture as used in IMPALA. See Extended Data Table 6 for the list of hyperparameters. To verify the statistical significance of our evaluations, we used two random seeds for initialization on each environment from Atari, ProcGen and DMLab, three seeds on Crafter and NetHack, and five seeds on Sokoban.

Analysis details

For the prediction analysis in Fig. 4c, we train multiple 3-layer perceptrons (MLPs) with 128, 64 and 32 hidden units for each layer respectively. The MLPs are trained to predict quantities such as future entropy and rewards from the outputs of an agent that has been trained on different Atari games using Disco57. We use 10 Atari games (Alien, Amidar, Battle Zone, Frostbite, Gravitar, Qbert, Riverraid, Road Runner, Robotank and Zaxxon). The values shown in Fig. 4c are R² scores for future entropy and test accuracy for large-reward events using fivefold cross-validation. Extended Data Fig. 2 provides an additional prediction analysis for more quantities. For high-dimensional outputs (y, z, z_a), we used a larger 3-layer MLP with 256 hidden units each.

Data availability

No external data were used for the results presented in the article.

Code availability

We provide the meta-training and evaluation code, with the meta-parameters of Disco103, under an open source licence at https://github.com/google-deepmind/disco_rl. All of the benchmarks presented in the article are publicly available.

References

Kirsch, L., van Steenkiste, S. & Schmidhuber, J. Improving generalization in meta reinforcement learning using learned objectives. In Proc. International Conference on Learning Representations (ICLR, 2020).
Kirsch, L. et al. Introducing symmetries to black box meta reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence 36, 7202–7210 (Association for the Advancement of Artificial Intelligence, 2022).
Oh, J. et al. Discovering reinforcement learning algorithms. In Proc. Adv. Neural Inf. Process. Syst. 33, 1060–1070 (NeurIPS, 2020).
Xu, Z. et al. Meta-gradient reinforcement learning with an objective discovered online. In Proc. Adv. Neural Inf. Process. Syst. 33, 15254–15264 (NeurIPS, 2020).
Houthooft, R. et al. Evolved policy gradients. In Proc. Adv. Neural Inf. Process. Syst. 31, 5405–5414 (NeurIPS, 2018).
Lu, C. et al. Discovered policy optimisation. In Proc. Adv. Neural Inf. Process. Syst. 35, 16455–16468 (NeurIPS, 2022).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS CAS PubMed Google Scholar
Schrittwieser, J. et al. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2020).
Article ADS CAS PubMed Google Scholar
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Article ADS CAS PubMed Google Scholar
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. Mastering diverse control tasks through world models. Nature 640, 647–653 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Fawzi, A. et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Degrave, J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 414–419 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, Z., van Hasselt, H. P. & Silver, D. Meta-gradient reinforcement learning. In Proc. Adv. Neural Inf. Process. Syst. 31, 2402–2413 (NeurIPS, 2018).
Zahavy, T. et al. A self-tuning actor–critic algorithm. In Proc. Adv. Neural Inf. Process. Syst. 33, 20913–20924 (NeurIPS, 2020).
Jackson, M. T. et al. Discovering general reinforcement learning algorithms with adversarial environment design. In Proc. Adv. Neural Inf. Process. Syst. 36, 79980–79998 (NeurIPS, 2023).
Sutton, R. S. & Barto, A. G. Reinforcement learning: An Introduction (MIT Press, 2018).
Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992).
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017).
Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In Proc. International Conference on Learning Representations (ICLR, 2017).
Barreto, A. et al. Successor features for transfer in reinforcement learning. In Proc. Adv. Neural Inf. Process. Syst. 30, 4055–4065 (NeurIPS, 2017).
Bellemare, M. G., Dabney, W. & Munos, R. A distributional perspective on reinforcement learning. In Proc. International Conference on Machine Learning 449–458 (PMLR, 2017).
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Article Google Scholar
Cobbe, K., Hesse, C., Hilton, J. & Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. In Proc. International Conference on Machine Learning 2048–2056 (PMLR, 2020).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS PubMed Google Scholar
Veeriah, V. et al. Discovery of useful questions as auxiliary tasks. In Proc. Adv. Neural Inf. Process. Syst. 32, 9306–9317 (NeurIPS, 2019).
Munos, R., Stepleton, T., Harutyunyan, A. & Bellemare, M. Safe and efficient off-policy reinforcement learning. In Proc. Adv. Neural Inf. Process. Syst. 29, 1054–1062 (NeurIPS, 2016).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning 70, 1126–1135 (PMLR, 2017).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. International Conference on Machine Learning 48, 1928–1937 (PMLR, 2016).
Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C. & Bellemare, M. Deep reinforcement learning at the edge of the statistical precipice. In Proc. Adv. Neural Inf. Process. Syst. 34, 29304–29320 (NeurIPS, 2021).
Kapturowski, S. et al. Human-level Atari 200x faster. In Proc. International Conference on Learning Representations (ICLR, 2023).
Hafner, D. Benchmarking the spectrum of agent capabilities. In Proc. International Conference on Learning Representations (ICLR, 2022).
Küttler, H. et al. The nethack learning environment. In Proc. Adv. Neural Inf. Process. Syst. 33, 7671–7684 (NeurIPS, 2020).
Hambro, E. et al. Insights from the NeurIPS 2021 NetHack challenge. In Proc. NeurIPS 2021 Competitions and Demonstrations Track 41–52 (PMLR, 2022).
Espeholt, L. et al. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proc. International Conference on Learning Representations (ICLR, 2018).
Beattie, C. et al. DeepMind Lab. Preprint at https://arxiv.org/abs/1612.03801 (2016).
Racanière, S. et al. Imagination-augmented agents for deep reinforcement learning. In Proc. Adv. Neural Inf. Process. Syst. 30, 5690–5701 (NeurIPS, 2017).
Schmidhuber, J. Evolutionary Principles in Self-referential Learning, or on Learning How to Learn: the Meta-meta-… Hook. PhD thesis, Technische Univ. München (1987).
Schmidhuber, J. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. International Conference on Simulation of Adaptive Behavior: from Animals to Animats 222–227 (MIT Press, 1991).
Schmidhuber, J., Zhao, J. & Wiering, M. Simple Principles of Metalearning. Report No. IDSIA-69-96 (Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 1996).
Thrun, S. & Pratt, L. Learning to Learn: Introduction and Overview 3-17 (Springer, 1998).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009).
Article Google Scholar
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neural networks: a review. Neural Netw. 113, 54–71 (2019).
Article PubMed Google Scholar
Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997).
Article Google Scholar
Feurer, M. & Hutter, F. Hyperparameter Optimization 3–33 (Springer, 2019).
Yao, Q. et al. Taking human out of learning applications: a survey on automated machine learning. Preprint at https://www.arxiv.org/abs/1810.13306v3 (2018).
Storck, J., et al. Reinforcement driven information acquisition in non-deterministic environments. In International Conference on Artificial Neural Networks 2, 159–164 (ICANN, 1995).
Duan, Y. et al. RL²: fast reinforcement learning via slow reinforcement learning. Preprint at https://arxiv.org/abs/1611.02779 (2016).
Niv, Y., Joel, D., Meilijson, I. & Ruppin, E. Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors. Adapt. Behav. 10, 5–24 (2002).
Xiong, Z., Zintgraf, L., Beck, J., Vuorio, R. & Whiteson, S. On the practical consistency of meta-reinforcement learning algorithms. Preprint at https://arxiv.org/abs/2112.00478 (2021).
Sutton, R. S. & Tanner, B. Temporal-difference networks. In Proc. Adv. Neural Inf. Process. Syst. 17, 1377–1384 (NeurIPS, 2004).
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Cobbe, K., Hilton, J., Klimov, O., and Schulman, J. Phasic policy gradient. In Proc. International Conference on Machine Learning 139, 2020–2027 (PMLR, 2021).
Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence 32, 3215–3222 (Association for the Advancement of Artificial Intelligence, 2018).
Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Bradbury, J. et al. JAX: composable transformations of Python+ NumPy programs. http://github.com/jax-ml/jax (2018).
DeepMind et al. The DeepMind JAX ecosystem. GitHub http://github.com/google-deepmind (2020).
Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. Annual International Symposium on Computer Architecture 1–12 (ICSA, 2017).
Hessel, M. et al. Podracer architectures for scalable reinforcement learning. Preprint at https://arxiv.org/abs/2104.06272 (2021).
Kemaev, I., Calian, D. A., Zintgraf, L. M., Farquhar, G. & van Hasselt, H. Scalable meta-learning via mixed-mode differentiation. In Proc. International Conference on Machine Learning 267, 29687–19605 (PMLR, 2025).

Download references

Acknowledgements

We thank S. Flennerhag, Z. Marinho, A. Filos, S. Bhupatiraju, A. György and A. A. Rusu for their feedback and discussions about related ideas; B. Huergo Muñoz, M. Kroiss and D. Horgan for their help with the engineering infrastructure; R. Hadsell, K. Kavukcuoglu, N. de Freitas and O. Vinyals for their high-level feedback on the project; and S. Osindero and D. Precup for their feedback on an early version of this work.

Author information

Author notes

These authors contributed equally: Junhyuk Oh, Gregory Farquhar, Iurii Kemaev, Dan A. Calian

Authors and Affiliations

Google DeepMind, London, UK
Junhyuk Oh, Gregory Farquhar, Iurii Kemaev, Dan A. Calian, Matteo Hessel, Luisa Zintgraf, Satinder Singh, Hado van Hasselt & David Silver

Authors

Junhyuk Oh
Gregory Farquhar
Iurii Kemaev
Dan A. Calian
Matteo Hessel
Luisa Zintgraf
Satinder Singh
Hado van Hasselt
David Silver

Contributions

J.O., I.K., G.F. and D.A.C. contributed equally. J.O., G.F., I.K., D.A.C. and L.Z. developed and analysed the method with advice from H.v.H., S.S. and D.S. J.O., G.F., M.H., D.A.C. and I.K. evaluated the method. I.K. led engineering with contributions from J.O., D.C. and G.F. J.O. and G.F. wrote the paper with contributions from L.Z. and D.A.C. H.v.H., S.S. and D.S. advised the team. J.O. and D.S. led the project.

Corresponding authors

Correspondence to Junhyuk Oh or David Silver.

Ethics declarations

Competing interests

A patent application(s) directed to aspects of the work described has been filed and is pending as of the date of manuscript submission. Google LLC has ownership and potential commercial interests in the work described.

Peer review

Peer review information

Nature thanks Kenji Doya, Joel Lehman and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Robustness of DiscoRL.

The plots show the performance of Disco57 and Muesli on Ms Pacman by varying agent settings. ‘Discovery’ and ‘Evaluation’ represent the setting used for discovery and evaluation, respectively. (a) Each rule was evaluated on various agent network sizes. (b) Each rule was evaluated on various replay ratios, which define the proportion of replay data in a batch compared to on-policy data. (c) A sweep over optimisers (Adam or RMSProp), learning rates, weight decays, and gradient clipping thresholds was evaluated (36 combinations in total) and ranked according to the final score.

Extended Data Fig. 2 Detailed results for the regression and classification analysis.

Each cell represents the test score of one MLP model that has been trained to predict some quantity (columns) given the agent’s outputs (rows).

Extended Data Fig. 3 Effect of meta-network architecture.

(a) The x-axis represents the number of environment steps in evaluation and the y-axis the IQM on the Atari benchmark. All methods are discovered from 16 randomly selected Atari games. The meta-RNN component slightly improves performance. The shaded areas show 95% confidence intervals. (b) The x-axis represents the number of environment steps in evaluation and the y-axis the IQM on the Atari benchmark. All methods are discovered from 16 randomly selected Atari games. Each curve corresponds to a different meta-network architecture, with varying number of LSTM hidden units, or its LSTM component is replaced by a transformer. The choice of the meta-net architecture minimally affects performance. The shaded areas show 95% confidence intervals.

Extended Data Fig. 4 Computational cost comparison.

The x-axis represents the amount of TPU hours spent for evaluation. The y-axis represents the performance on the Atari benchmark. Each algorithm was evaluated on 57 Atari games for 200 M environment steps. DiscoRL reached MuZero’s final performance with approximately 40% less computation.

Extended Data Table 1 Atari57 result

Full size table

Extended Data Table 2 ProcGen result

Full size table

Extended Data Table 3 DMLab-30 result

Full size table

Extended Data Table 4 NetHack result

Full size table

Extended Data Table 5 Crafter result

Full size table

Extended Data Table 6 Hyperparameters of agents in evaluation

Full size table

Supplementary information

Supplementary Information

This Supplementary Information file contains three sections: (1) Design of meta-learned rule; (2) Meta-network details; and (3) Meta-optimization details. It includes 4 Supplementary figures, 1 Supplementary table and Supplementary references.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Oh, J., Farquhar, G., Kemaev, I. et al. Discovering state-of-the-art reinforcement learning algorithms. Nature 648, 312–319 (2025). https://doi.org/10.1038/s41586-025-09761-x

Download citation

Received: 11 December 2024
Accepted: 15 October 2025
Published: 22 October 2025
Version of record: 26 November 2025
Issue date: 11 December 2025
DOI: https://doi.org/10.1038/s41586-025-09761-x