Hiranmay Darshane

12 min read Original article ↗

Thinking out loud: evolution and pretraining

February 2026

Meta: Thinking out loud, super speculative. Rough chain-of-thought, no proofreads.

Table of Contents

tl;dr
Evolution doesn’t pretrain specific behaviors/policies, it meta-trains learning systems across timescales. The genome encodes architecture and learning priors, not weights, while pretraining optimizes weights directly. The analogy works only at the level of “both are tasked with producing useful priors,” and breaks mechanistically. But much good can be derived by thinking of things in terms of it.

A very appealing analogy in machine learning circles, that you see being tweeted about every now and then and one that has been particularly popularized recently by Karpathy[1] on his Dwarkesh appearance is that of evolution as some kind of a pretraining mechanism. The analogy is intuitive, in the sense that you have an expensive process. Another reason of speculation as to why this is so popular is that people have a confabulation that pretraining and evolution are both long and expensive processes - and are thus similar.

In my opinion, the analogy between evolution and pretraining is shallow and is in first-order levels of utility. The interesting biology, and likely the interesting machine learning, is in the higher-order levels.

We know that pretraining initialises weights with some form of exactness. The weights aren't dynamic and stay constant throughout, and these weights are finally parametrised to-value by SGD. Assume that evolution is SGD and initialises the genome as some equivalent of the actual weight matrices.

From this point, the analogy breaks. The neuroscientific evidence sits uneasily with this.

Synaptic connectivity is demonstrably experience-dependent. If one assumes a fixed genome set P that dictates some behavior, it should be the case that the behavior is consistent across a population with the genome set P and is independent of the environment. That isn't the case.

Hubel and Wiesel's "monocular deprivation" experiments[2] showed that occluding one eye during a critical window permanently reorganizes the connectivity of primary visual cortex - the deprived eye loses cortical representation to the open eye. The final wiring is not read out and implemented from the genome, it is shaped by what input arrived and when. The genome probably just provided the "reaction conditions" and the reactants were careful from the environment, that the genome uses as some sort of prior information while optimising the "reaction conditions".

The same pattern appears in a few different ways. Children raised in severe sensory or social deprivation, say feral child cases, or Romanian orphanage studies, fail to develop capacities that are foundational: language syntax, face recognition, attachment. These are not obscure skills. One would assume these are baked into the genome because of their criticality. Once again, the genome probably encodes the scaffold-type prior for them, but requires environmental input during specific windows to finally structure them.

Similar experience-dependence has been documented across sensory systems, in hippocampal circuit formation, and in the organization of higher cortical areas. The genome does not appear to be storing a configuration that development merely unpacks. One can argue against "experience-dependency" by an argument like "that's just neural superposition", I think this is a valid argument. Any kind of pluripotency or experience-dependency would necessitate superposition in the sense of a certain direction in weight space being used for multiple features. Let's term this as environment-variant superposition. We will encounter another superposition-type phenomenon ahead too.

This is getting murky, but there is a satisfying reconciliatory end to this section, but let me ramble on for a while before getting to it.

Thus, instead of specific "weights" or "circuits", what DNA plausibly encodes instead is architectural constraints and learning-relevant structure: the gross organization of visual cortex, the distinct circuit motifs of hippocampus and cerebellum, and the suchlike. In this sense, the genome is responsible for creating a basic substrate that learns and adapts dynamically to its environment, instead of warmly initialised weights.

Also, we know that genetic material is present in each cell of organisms like human. The genetic material must guide each cell division, across an insane amount of variety. There is another pressure on the genome here, that the genome itself must be pluripotent to some degree - you can think of this as the entire weight matrice being responsible for each token, and there is evidence that the same coding regions in DNA code different things depending on the cell. Once again, this becomes easily amenable to a "this is superposition" argument. Let's term this as expression-variant superposition.

There is a simple argument for both the superpositions that superposition does the same thing in DNNs. The superposition objection would says: attention weights are also context-dependent in their computational role, not just their output. And each value in the weight matrice affects each token in a dense model. Completely fair.

But, in the superposition context, the same directions in weight space are being reused for multiple features - but the weight values themselves are determined entirely by training, not by what the model encounters at inference.

This weights vs genome argument is very murky. In a very general sense, the actions of an autoregressive model depend on three things, the context/environment, the weights, and the hyperparameter.

So while there is definite superposition, it is best to think of the genome as some kind of a hyperparameter. And then shift the brunt of superposition onto that, while making the analogy that weights are more local-level non-genomic feature that guide action trajectories - think the epigenetic state of the cell. The context is simply just the environmental state.

The genome is different in that the physical structure of the regulatory relationship - which enhancer contacts which promoter, which transcription factors bind - is determined at runtime by cellular context, not fixed at "training time" (i.e. by evolution).

Thus, a satisfying end here is that pretraining optimizes weights directly while evolution optimizes hyperparameters. Some sort of pluripotency and superposition exists in both cases, but is higher-ordered with evolution.


2. Evolution is definitely not doing vanilla SGD on the genome either, AdamW? probably yeah.

Okay, let's say you agree that evolution is parametrising the hyperparams (genome). In this setup, the genome is now the parameter/weight set and the hyperparameters are something like mutation rate, mutation density, environmental fitness functions, etc.

Sourbut (2022)[3] shows formally that under simplifying assumptions (fixed fitness function, radially symmetric mutation density, infinitesimal mutation limit), natural selection is mathematically equivalent to stochastic gradient descent. The expected update direction aligns with the fitness gradient obtained from the environment, and the step size scales monotonically with gradient magnitude.

But the key assumption for unbiased gradient descent is radial symmetry of the mutation probability density, in the sense that mutations are equally likely in all directions in genome space. In an astonishingly beautiful and elegant synthesis, Keijser (2022)[4], synthesising Monroe et al. and Martincorena et al., shows this assumption does not hold empirically. Mutation rates vary approximately 20-fold across the E. coli and Arabidopsis genomes. True to naive intuition, essential genes - those whose function is highly conserved across species and environments, mutate significantly slower than genes with environmentally-dependent function. The way to think about this is that there's some meta-optimisation going on at the organism level which is modulating the inherent mutational characteristics of cells - the data supports decreased mutation rates in essential genes. Independent of that paper, I think there could be an alternative evolutionary non-organism-level explanation too: I think there's some path-dependency of evolution, in the sense that you can't easily mutate these critical genome subsets because the path to an alternative follows an extremely unlikely trajectory of multiple mutations. These are as good as permanent and can probably be thought of as warmly initialised or even manually-engineered weights.

The non-radial mutation density is itself shaped by selection: the genome implements lower effective step sizes on the dimensions that have been reliably optimal across some time, and higher effective step sizes where continued search is worthwhile.

If one squints a bit, this is structurally analogous to adaptive and momentum-like gradient methods in machine learning. Algorithms like Adam maintain per-parameter learning rates scaled by some recent history of gradient variance - lower rates on directions that have been consistently informative, higher rates where the signal is noisy. Evolution appears to have arrived at something similar, operating over timescales rather than training steps.

I think two explanations are hard to pull apart. If it's just path-dependency, you'd expect the pattern of low mutation rates to track how central a gene is in the interaction network i.e. how many other genes depend on it. If it's active meta-optimisation, you'd expect the pattern to shift as environments change over long timescales, since what counts as "essential" would change too. Most likely both are happening at once.


3. Why not encode more in DNA, or less?

If the genome encodes architecture and learning rules rather than behaviors directly, a natural question arises - why not encode more in DNA, or less?

At the genomic level, there are hard constraints on the write speeds. Anyway, plasticity is directly at odds with preserving genetic information.

The answer appears probably has to do with what is stably encoded at a given timescale or at a given level, subject to all constraints (for ex: genomes have the simple constraint of preserving genetic information losslessly).

Perhaps each memory medium or learning pattern (genomic, cellular, behavioral, etc.) can only stably encode what varies more slowly than it updates.

DNA, updating on the timescale of generations, can stably encode whatever has been consistently adaptive across millions of years: basic cellular machinery, gross neural architecture, the identity of neuromodulatory systems. What it cannot reliably encode are synaptic connectivity patterns that are adaptive only for a particular individual in a particular environment for a particular skill being learnt.

Let's say you're trying to learn math, synaptic plasticity probably handles some connectivity patterns instead of the actual stuff that you learn, which is handled by learning processes of the brain one layer less general than synaptic plasticity. Culture and language - operating on century-to-millennium timescales - handle what varies too quickly for biological learning but is too important in a survival-optimising sense for ONLY individual learning to capture. Each layer exists because there is variance at that timescale that the layers below cannot efficiently absorb.

The corollary is that the number of layers in the hierarchy is determined by the variance structure of the environment - how many distinct clusters of timescale exist at which the world is approximately stationary. This is not an architectural choice but a consequence of what the environment is like. How does this tie into pretraining? I think pretraining is half oblivious to all of this, you could argue that in theory, a "cognitive core" model as suggested by Karpathy and others (learns no factual info, just learns general cognitive capabilities) sits one level above the model itself being called for general tasks. Also, it is obviously the case that pretraining has little regards for the general phenomena of encoding higher level patterns that allow you to modulate lower level learning tasks sometime after your priors are initialised. But I think in-context learning is important here. My view is that the objective function is such that the model needs to find ways to maximally extract transformation from the given tokens to best succeed at predicting the next token. This is definitely some form of higher-order learning. Does not tie in cleanly at all, but it is something.


4. The data efficiency question

A separate claim, worth distinguishing from the above, is that pretraining on internet-scale text is analogous to evolution in terms of what it produces - a broadly capable prior over tasks - even if the mechanisms differ.

Karpathy on Dwarkesh described pretraining as "crappy evolution": a practically available method for arriving at the some kind of minimum viable prior with built-in knowledge and structure, analogous to what evolution provides organisms.

One thing worth noting is that if we treat pretraining as playing a role similar to evolution, it actually accomplishes this with a staggering difference in speed. Natural selection needed ~3.5 billion years to produce the human genome. By contrast, an LLM develops wide-ranging language and reasoning abilities in a matter of weeks (wall-clock training time). Part of this gap comes down to how much more precisely gradient descent can optimize against a clear, stable ground truth target compared to the trial-and-error grind of evolution that is operating on shifting, environment-dependent fitness criteria. But another piece of the puzzle is that the training data itself - human-generated text - is already the output of minds that evolution spent so much building. Pretraining has a natural advantage here and it grabs it.

If this entire framing is right, i.e. evolution encodes a prior over learning algorithms, not a prior over behaviors, then the actually interesting question is not how to make gradient descent more biologically plausible, but what the brain's actual learning algorithm is, and what assumptions about environmental structure are baked into it. This goes back to our point about each layer learning what is more variant than the level below.

I want to end where I started: The analogy between evolution and pretraining is shallow and is in first-order levels of utility. The interesting biology, and likely the interesting machine learning, is for sure in the higher-order levels of utility.


Thanks to the authors of the pieces by Sourbut (2022) and Keijser (2022), which this post draws on directly.