This page requires Javascript. Please enable it to view the website.
Examples of discovered warriors competing against each other in Core War.
Abstract
Large language models (LLMs) are increasingly being used to evolve solutions to problems in many domains, in a process inspired by biological evolution. However, unlike biological evolution, most LLM-evolution frameworks are formulated as static optimization problems, overlooking the open-ended adversarial dynamics that characterize real-world evolutionary processes. Here, we study Digital Red Queen (DRQ), a simple self-play algorithm that embraces these so-called "Red Queen" dynamics via continual adaptation to a changing objective. DRQ uses an LLM to evolve assembly-like programs, called warriors, which compete against each other for control of a virtual machine in the game of Core War, a Turing-complete environment studied in artificial life and connected to cybersecurity. In each round of DRQ, the model evolves a new warrior to defeat all previous ones, producing a sequence of adapted warriors. Over many rounds, we observe that warriors become increasingly general (relative to a set of held-out human warriors). Interestingly, warriors also become less behaviorally diverse across independent runs, indicating a convergence pressure toward a general-purpose behavioral strategy, much like convergent evolution in nature. This result highlights a potential value of shifting from static objectives to dynamic Red Queen objectives. Our work positions Core War as a rich, controllable sandbox for studying adversarial adaptation in artificial systems and for evaluating LLM-based evolution methods. More broadly, the simplicity and effectiveness of DRQ suggest that similarly minimal self-play approaches could prove useful in other more practical multi-agent adversarial domains, like real-world cybersecurity or combating drug resistance.
Introduction
Now, here, you see, it takes all the running you can do, to keep in the same place.
— Red Queen to Alice, Lewis Carroll, Through the Looking-Glass

Building on recent test-time scaling trends
In the real world, biological, cultural, and technological evolution do not operate as optimization on a static fitness landscape, but are better described as open-ended arms races
To prepare for such a future, it is important to study the Red Queen dynamics of LLMs in an isolated scientific setting.
This pursuit requires a test bed that is rich enough to yield insights relevant to the real world, while still being in a sandbox where the researcher maintains full control.
Simulations from artificial life and cybersecurity naturally lend themselves to this goal since they prioritize adversarial dynamics in controlled environments
For this reason, we use Core War
To study adversarial evolution in Core War, we develop an algorithm called Digital Red Queen (DRQ) that uses LLMs to perform multiple rounds of evolutionary optimization to create new warriors.
DRQ is initialized with a single warrior program.
In the first round, it uses an LLM to evolve a second warrior that defeats the initial warrior within the Core War simulation.
In each subsequent round, DRQ continually evolves a new warrior to defeat all previous ones in a multi-agent simulation.
The champions of each round form a sequence of adapted warriors.
Rather than treating DRQ as a fundamentally novel algorithm, it should be viewed as a deliberately minimal instantiation of prior self-play approaches
DRQ uses MAP-Elites
We evaluate DRQ by putting the generated warriors in competition against human-designed warriors. The baseline of static optimization (single-round DRQ) is able to synthesize specialist warriors that collectively defeat or match 283 out of 294 human warriors. However, inspection of individual performance reveals that these warriors are brittle and overfit to their training opponent: any single warrior defeats only about 28% of the human-designed warriors. In contrast, DRQ trains against a growing history of opponents, implicitly incentivizing the emergence of robust, generalist strategies capable of handling diverse threats. When running full DRQ, analysis of its sequence of warriors reveals an intriguing pattern: with more DRQ rounds, the resulting warriors become increasingly general, while simultaneously exhibiting reduced behavioral diversity across independent runs. Together, these two trends indicate an emergent convergence pressure toward a single general-purpose behavior in Core War. This phenomenon is reminiscent of convergent evolution in nature, such as the independent evolution of mammalian and insect eyes to address similar functional demands.
Given the strong performance of DRQ in Core War, we investigate the extent to which LLMs understand this domain. The mapping from a warrior's Redcode source code to its performance requires an expensive simulation that is highly chaotic: small changes in code can lead to drastic changes in battle outcomes. Given the large number of warriors generated by DRQ, we ask whether an LLM can directly predict the outcome of a battle between two warriors using only their source code. To test this, we embed warrior source code using a text embedding model and train a linear probe to predict the warrior's final generality score. We are able to predict generality scores with a test R²=0.461. This result opens a path toward strengthening such predictors and ultimately using them either as surrogate models to bypass simulation or as interpretability tools for understanding what makes source code effective.
Overall, DRQ illustrates how LLM research might move beyond static problem settings and toward more realistic open-ended environments characterized by Red Queen dynamics. At the same time, we hope this work encourages adoption of Core War as (1) a safe and expressive testbed for studying artificial evolution and (2) a benchmark for evaluating an LLM's Red Queen capabilities. The DRQ algorithm itself is simple and general, and could be applied to other adversarial domains, such as discovering real cybersecurity exploits/defenses, designing biological viruses/drugs, or exploring any other complex multi-agent environment of interest. Systematically exploring adversarial dynamics in controlled environments is an important step toward discovering potential dangers before they arise in real-world systems.
Related Work
Program-Based Competition in Artificial Life
Driven by the goal of understanding "life as it could be"
Core War
Core War was originally made as a competitive programming game in 1984
Because the Core does not distinguish between code and data, every instruction can be read, written, or executed.
This creates a highly volatile environment where self-modifying code is commonplace.
A program can inject a DAT instruction in front of an opponent's process, terminating it when that process attempts to execute it.
Some strategies include bombing (placing DATs throughout the Core), replication (copying the warrior's own code into multiple memory locations), and scanning (probing the Core to locate enemies before striking)
Many prior works have evolved warriors using genetic programming
Open-Ended Coevolution and Self-Play
One of the core mechanisms that led to complexity in biology is the Red Queen dynamics of evolutionary arms races
Reinforcement learning has also taken inspiration from Red Queen dynamics in the form of self-play.
Self-play describes situations in which agents are trained in environments where the opponent is themselves, a historical copy of themselves, or related to them in some way
Within self-play, our DRQ algorithm is closely related to Fictitious Self-Play (FSP)
LLM-Guided Evolution
Recent work has begun to merge LLMs with evolutionary algorithms, using LLMs as intelligent mutation or generation operators.
This approach is appealing because it exploits the model's prior knowledge to propose domain-aware edits, while grounded selection expands discovery capabilities beyond the model's pretraining distribution
LLMs for Self-Play
Dharna et al.
Bachrach et al.
Self-play with LLMs has also been used to improve LLM capabilities
Our work unifies these threads by connecting LLMs, coevolution, self-play, and ALife within the rich testbed of Core War. This combination enables the study of Red Queen dynamics in a controlled, yet expressive, environment.
Methods: Digital Red Queen
Our approach, which we call Digital Red Queen (DRQ), is built on prior works on self-play
DRQ Algorithm
DRQ begins with an initial warrior and proceeds through a sequence of rounds, each performing an evolutionary optimization. In each round , a new warrior is evolved to defeat the set of all previous warriors . This process induces a competitive pressure that changes every round, driving the emergence of novel strategies and counter-strategies. The algorithm is detailed below:
Step 1 - Initialization: Start with a base warrior which is either human designed or LLM-generated.
Step 2 - Adversarial Optimization: At round , optimize a new warrior to maximize its expected fitness in an environment which includes all prior warriors. The expectation is over different seeds of evaluation.
Step 3 - Iteration: Repeat for rounds, generating a lineage of warriors .
We do not update older warriors in the lineage, as prior work has shown that historical self-play promotes stability and mitigates cyclic dynamics
Because the number of warriors increases each round, the marginal influence of any newly introduced warrior on the environment decreases over time, implying that the induced fitness function changes less and less as .
Since program synthesis presents a highly deceptive search landscape, most greedy algorithms can get stuck in local minima
Intra-round Optimization with MAP-Elites
MAP-Elites is a widely used quality-diversity algorithm that discretizes a user-defined behavioral descriptor space into a set of cells, each storing at most one elite solution that exhibits the behavioral characteristics of that cell. By restricting competition to solutions that fall within the same cell, MAP-Elites imposes localized selection pressure while preserving global diversity. Partitioning with respect to behavior allows the archive to maintain a broad set of stepping stones, many of which may be individually poor but crucial for discovering strong strategies in other regions of behavior space. This property makes MAP-Elites particularly well suited for Redcode program synthesis.
MAP-Elites follows a simple evolutionary procedure. The archive maps a predefined set of behavioral cells to their current elite solutions. After initializing with random solutions, it performs the following steps: (i) randomly sample an individual from ; (ii) mutate to produce an offspring ; (iii) evaluate its fitness and behavioral descriptor cell ; (iv) insert into the archive at if exceeds the fitness of the current elite in (or if that cell is empty). Iterating this process gradually fills the archive with increasingly high-performing behaviorally diverse solutions.
The fitness function depends on the current round of DRQ , yielding . We define the behavior descriptor function as the discretized tuple (total spawned processes, total memory coverage), which captures two high-level aspects of a warrior's behavior during simulation. We optionally initialize the MAP-Elites archive in round using all previous champions to bootstrap the optimization.
LLMs as the Mutation Operator
Within DRQ, LLMs are used to generate new warriors and to mutate existing ones. In all cases, the model receives a system prompt describing the Core War environment and a concise manual for the Redcode assembly language, including its opcodes, addressing modes, and an example warrior. To generate a new warrior, the LLM is given a user prompt instructing it to produce a novel Redcode program. To mutate an existing warrior, the LLM is provided with the original program and instructed to modify it in ways that could improve performance.
We intentionally chose this simplistic use of LLMs to keep the focus of the study on Core War and the analysis of evolution, rather than on LLM-specific techniques.
Other methods for applying an LLM to modify code exist and could easily be integrated into the DRQ framework.
For example, an LLM could output a diff
It is possible to run DRQ without LLMs by relying solely on random generation and random mutation over the space of opcodes, addressing modes, and numeric parameters.
However, in extremely sparse search spaces, where most points and mutations produce invalid or non-functional programs, some prior over the search space is crucial for practical search efficiency
Self-Play and Red Queen Dynamics
DRQ is purposely one of the simplest multi-agent self-play algorithms that can be constructed for evolving warriors in Core War. DRQ's multi-round design ensures that fitness is not measured by performance against a fixed opponent, but rather against a continually growing population of opponents. This shifting landscape embodies Red Queen dynamics: each new warrior must continually adapt to overcome the latest strategies, driving a process of adversarial innovation.
Experiments
We evaluate DRQ with experiments designed to assess 1) its ability to evolve generally competitive Core War programs, and 2) its capacity for continual improvement through Red Queen dynamics.
All experiments use the following fitness function, which accounts for both survival and dominance within the battle. In a battle with warriors and simulation timesteps, a total of units of fitness are distributed evenly over time. At each timestep, the remaining (living) warriors share a fitness of . This design incentivizes warriors to survive as long as possible while also eliminating others to increase their share of the reward. The cumulative reward across all timesteps defines the warrior's fitness, where is an indicator for whether warrior is alive at simulation timestep :
Note that a warrior's fitness is context-dependent on other warriors. A warrior is said to defeat another warrior if it achieves higher fitness in a 1-on-1 battle between the two.
All experiments use the following MAP-Elites behavioral descriptors: 1) the total number of spawned threads (via SPL opcodes), and 2) the total memory coverage of the warrior during simulation.
These two axes capture two important strategical aspects of warriors in Core War.
The grid is discretized in log space.
All experiments use GPT-4.1 mini (gpt-4.1-mini-2025-04-14)
For terminology, rounds correspond to steps of DRQ (outer loop), while iterations correspond to optimization steps within a round (inner loop).
Static Target Optimization Against Human Warriors
The first experiment evaluates the effectiveness of static optimization against a target. This baseline corresponds to a single round of DRQ. We use a dataset of 294 human warriors and perform one 1000-iteration optimization run for each. We do not initialize the optimization with the human warriors.

The figure above summarizes the results. A single warrior generated by the LLM zero-shot defeats, on average, only 1.7% of all human warriors, which is expected given that Redcode is relatively out-of-distribution in most LLM pretraining datasets. Using a best-of-N sampling strategy produces a set of warriors that can collectively defeat 22.1% of human warriors for N=8. In contrast, evolutionary optimization against each human warrior generates a specialized warrior for every opponent; this set can collectively defeat 89.1% of human warriors and defeat or tie 96.3%. The large jump in performance from best-of-N to evolved warriors demonstrates how evolution can drive performance in out-of-distribution domains.
These numbers reflect specialist performance: the percentage of human warriors defeated by at least one of the evolved warriors. Another metric is generalist performance: the percentage of human warriors defeated or tied by a single warrior. On average, an evolved warrior can defeat or tie only 27.9% of all human warriors, indicating that they are brittle and likely overfit to their training opponent.
Iterative Red Queen Dynamics

Our second experiment investigates the dynamics of running DRQ for multiple rounds. Due to the computational cost, we select a smaller dataset of 96 diverse human warriors and conduct multi-round DRQ runs against each one. We ablate the effect of history length in DRQ, which determines how many previous warriors each round optimizes against. For example, plays against only the previous round's champion, while considers the champions from the previous three rounds. We initialize the optimization in each round with all prior champions.
To ground the analysis, for each query warrior we measure its fitness in 1-on-1 battles against a dataset of 317 human warriors.
A warrior's generality is defined as the fraction of human warriors it defeats or ties, measuring its robustness to new threats in a zero-shot manner.
A warrior's phenotype is defined as the vector of fitness values against each unseen human opponent, capturing its black-box performance profile against a diverse range of strategies.
A warrior's genotype is defined as a text embedding of its source code, representing the lowest-level description of the warrior.
We get embeddings using the OpenAI text-embedding-3-small model
The figure above summarizes the dynamics of multi-round DRQ across 96 independent runs. Across all history lengths , we observe a consistent increase in average generality over rounds (Left), indicating that DRQ progressively discovers more robust warriors. This trend suggests that optimizing against a small but changing set of adversaries can induce a pressure towards generality.
At the phenotype level, DRQ exhibits two distinct forms of convergence. First, the variance of warriors' phenotypes across independent runs decreases over rounds (Middle Left), indicating convergence across different initial conditions. Second, the rate of change of the phenotype decreases over rounds within each run (Middle Right), indicating convergence toward a stable phenotype within a single run. The latter effect is partly expected, as the fitness function changes more slowly in later rounds. However, convergence across different independent runs is largely unexpected and suggests a universal attractor in phenotype space.
In contrast, no corresponding convergence is observed at the genotype level. The variance of genotypes across runs remains approximately constant over many rounds (Right), indicating that DRQ does not collapse onto a single canonical implementation. This dissociation between phenotypic and genotypic convergence is further emphasized in the figure below, which visualizes two principal axes of the phenotype and genotype spaces.

Under the logarithmic fits, full phenotypic convergence would require an exponential number of rounds, implying that while convergence pressure exists, it is weak and only detectable statistically when aggregating many runs.
Taken together, these results suggest that DRQ drives warriors toward similar general-purpose behaviors while preserving diversity in their underlying implementations. This mirrors the phenomenon of convergent evolution in biology: different species have evolved similar traits (like eyes or wings) independently, but through distinct genetic mechanisms. In both DRQ and biology, this phenomenon is likely driven by the selection pressure on phenotypic function rather than on the underlying genotypic representation.
Cyclic Dynamics

Cyclic dynamics are a well-known phenomenon in self-play and coevolutionary systems, where agents rotate among strategies that dominate one another, analogous to rock–paper–scissors
We define a cycle as a triplet of warriors such that defeats , defeats , and defeats .
As the history length increases from to (full DRQ), we observe a 77% reduction in the total number of cycles across all runs.
This finding is consistent with prior work showing that incorporating historical opponents into self-play reduces cyclic behavior
What Makes a Good Core War Warrior?

This section investigates what makes a good warrior in Core War.
Since our search was conducted using MAP-Elites
The figure above visualizes the MAP-Elites grid along the two predefined axes of memory coverage and spawned threads. Reported fitness values within each bin are averaged across 1,920 MAP-Elites grids from the full DRQ runs. Although this averaging is not strictly justified, since fitness is defined relative to an opponent, it serves as a rough heuristic that provides meaningful insights. Warriors that fork many threads tend to perform best. This aligns with intuition: eliminating such a warrior requires halting all of its threads, and having more threads makes this increasingly difficult. Interestingly, among programs that create fewer threads, a different strategy emerges: maximizing memory coverage, suggesting that spatial spread is robust primarily when parallelism is limited.

The figures above show two warriors discovered by DRQ called Ring Warrior Enhanced v9 and Spiral Bomber Optimized v22.
These examples were selected to illustrate two complementary aspects of DRQ: its ability to synthesize qualitatively distinct strategies within a single program, and to produce generally performant warriors.
Does MAP-Elites Matter?

This section investigates the role of MAP-Elites in DRQ. We replace MAP-Elites with a single-cell variant that maps all candidate warriors to the same cell, thereby removing the critical diversity-preserving mechanism.
As shown in the figure above, this variant significantly reduces optimization performance in each round. These results highlight the importance of preserving diversity during search for Core War program synthesis and justifies MAP-Elites as the intra-round optimization algorithm.
Is Fitness Predictable?

Determining the generality of a warrior requires many simulations against a suite of human-designed opponents.
These simulations are computationally expensive.
This raises a natural question: can we statistically predict a warrior's final generality score more cheaply using only its source code?
To investigate this, we embed the raw Redcode source code of all warriors discovered by DRQ using the OpenAI text-embedding-3-small and text-embedding-3-large models
As shown in the figure above, the linear regression achieves a test R² = 0.442 using the small embedding model and R² = 0.461 using the large embedding model. These results indicate that a warrior's generality can be moderately predicted from its source code alone. This is notable given the complexity of the underlying mapping: generality is determined by 317 separate 80,000-timestep simulations, each involving chaotic interactions with opponents and extreme sensitivity to small code changes.
Predictive models of battles could open new doors for future exploration.
First, they may enable mechanistic interpretability of the embedding model and linear probe, helping to decipher what makes good source code.
Second, they could potentially be used to pre-filter warriors or even bypass full simulations entirely during the search for new programs.
If successful, this approach would challenge a prevailing intuition that complex systems cannot be predicted without running the full simulation
Conclusion
Summary This work studies a minimal self-play algorithm that leverages LLMs to drive adversarial program evolution in Core War. We show that evolving against a growing history of opponents produces more robust strategies and exhibits convergence across independent runs, a phenomenon reminiscent of convergent evolution in biology.
Discussion
Recently, malicious hackers have started leveraging LLMs to their advantage, and the cybersecurity arms race between offense and defense is well underway
Because Core War is Turing-complete, it can simulate arbitrary algorithms, providing a rich environment for exploring behaviors relevant to real-world systems. At the same time, Core War is entirely self-contained: its programs run on an artificial machine with an artificial language, making it impossible for any generated code to execute outside the sandbox. This isolation provides a necessary layer of safety for this line of research.
Algorithmically, DRQ is a simple loop: each new agent is optimized to defeat a fixed set of past agents, creating a linear lineage with no updating of earlier strategies. Future extensions could explore richer settings where many agents simultaneously co-evolve within a shared ecosystem. Such extensions would more closely mirror real-world phenomena, from microbial communities to the modern cybersecurity landscape, where large populations adapt in parallel rather than along a single line of descent.
Despite its simplicity, vanilla DRQ performs remarkably well in a rich testbed like Core War, suggesting that this minimal self-play algorithm is worth studying in greater depth. DRQ is a promising candidate for application to other competitive multi-agent environments. In principle, the core ideas in DRQ could transfer to other domains like artificial life simulations, biological modeling for drug design, real-world cybersecurity, and even competitive market ecosystems.