Language Models Are Few-Shot Learners, They Just Can't Remember

In 2020, OpenAI published a paper with a bold title: Language Models are Few-Shot Learners. GPT-3 could be shown a handful of examples in its prompt and perform tasks it was never trained on. Translation, arithmetic, code generation, all from a few demonstrations at inference time. It felt like a breakthrough.

It was. But the title is misleading.

What GPT-3 demonstrated was few-shot prompting, not few-shot learning. Show the model three examples of sentiment classification, and it classifies the fourth correctly. Close the chat window, and it has learned nothing. The "learning" lived in the context window, and the context window is a whiteboard that gets erased after every conversation.

Six years later, this is still true. Every personalisation feature in today's LLMs, from ChatGPT's memory to system prompts to RAG pipelines, is a workaround for the same limitation: the model cannot learn from experience. It can only be reminded of it.

Here's a concrete example. Ask any frontier coding model to estimate how long a complex refactor will take. It will come back with multi-week timelines. The models don't know that because they exist, those timelines have been slashed to hours. You can say this in the prompt, and the model will nod along, but the next estimation reverts to the old priors. The world changed fundamentally after their knowledge cutoff, and context alone doesn't seem to durably rewrite what's baked into the weights.

What would it take to make the title true? The pieces already exist. They just haven't been connected.

Attention is (almost) gradient descent

Von Oswald et al. (2023) proved that a single linear self-attention layer performs a computation equivalent to one step of gradient descent on a regression loss. Mahankali et al. showed this is provably optimal for one-layer linear transformers.

The intuition is simple. Standard learning updates weights via a gradient step: w_new = w_old - α · ∇L(w_old, data). In self-attention, KV cache entries from in-context examples create a temporary delta on the model's output, steering behaviour toward the demonstrated pattern. For linear attention, this delta is mathematically what you'd get from one gradient step on those examples.

The KV cache is a transient weight update. Every transformer already has the machinery for learning. It's just trapped in volatile memory.

The equivalence is exact for linear attention on regression tasks. Real transformers are messier. But the mental model holds, and it reframes the problem. The question isn't "how do we teach LLMs to learn?" They already know how. The question is "how do we make the learning stick?"

Context vs weights

The mainstream answer is: make context windows longer. Opus 4.7 has 1M tokens. KV cache compression keeps improving. Linear attention variants are making long context computationally cheap. The implicit assumption: if context gets long enough and cheap enough, you don't need weight updates.

This assumption is incomplete in a way that matters. I wrote about it in detail in Context Is Software, Weights Are Hardware. The short version: both KV cache and weights modulate activations to steer model output, but they're not equally powerful. Context is software running on frozen hardware. Weight modification redesigns the hardware itself. Even within what context can express, weights win on efficiency (O(1) vs O(n) at inference), compression (a LoRA adapter vs millions of context tokens), and composability (each weight update builds on the last, while context additions all flow through the same frozen computation).

But the deeper problem isn't speed. It's that weights are entangled. Billions of parameters jointly encode everything the model knows, all overlapping in the same high-dimensional space. Change one capability and you risk quietly corrupting another. This is why production systems are deliberately designed not to learn online. It's not a missing feature. It's a safety constraint.

What's needed is a way to make weight updates cheap, fast, and interference-safe: inject new knowledge without degrading existing competence.

Injecting knowledge into weights

In February 2026, Sakana AI published "Doc-to-LoRA." It is, in my view, a proof of concept for something much bigger than the paper claims.

A 309-million parameter hypernetwork (Perceiver architecture) takes a document as input and outputs LoRA adapter weights for a target LLM:

Feed the document through a frozen LLM to extract per-layer activations
The hypernetwork reads those activations via cross-attention
It outputs rank-8 LoRA matrices for each target layer
Merge the LoRA matrices into the target LLM

Under one second. After injection, the model answers questions about the document without the document in its context window. The knowledge lives in the weights.

The numbers: near-perfect needle-in-haystack at 4x the model's native context length, 83.5% of full-context QA performance with sub-second update time (vs 40+ seconds for standard distillation), under 50 MB constant memory regardless of document length.

But the numbers aren't the point. What Doc-to-LoRA proves is more fundamental: there exists a learnable function f(context) → Δweights that preserves the information in context. A neural network can learn to map context into weight modifications. This is the Optimal Brain Surgeon (Hassibi & Stork, 1993) made constructive. OBS used second-order derivatives to find which weights to remove with minimal damage. Doc-to-LoRA's hypernetwork has learned the inverse: which weights to modify, and how, to add knowledge with minimal disruption. A learned surgeon that builds instead of cuts.

Two questions remain. Does the injection break what's already there? And who decides what to inject?

Why safe injection might be possible

The catastrophic forgetting objection is the obvious one. You're modifying weights. How do you know you're not corrupting existing knowledge?

Here's why there's reason for optimism.

Consider what in-context learning already does. Within a single conversation, you can give the model few-shot examples for sentiment analysis, then ask it to write code, then ask it to translate French. The KV cache entries from the sentiment examples don't destroy the model's ability to code or translate. ICL modulates activations for the target task without catastrophically interfering with other capabilities. Same forward passes. Same context window.

This isn't accidental. During pretraining on trillions of tokens, the model processes sequences where wildly diverse tasks and patterns coexist. The weights evolved to handle KV cache entries that steer activations in non-interfering directions. The model didn't just learn to do ICL. It learned to do ICL safely, steering activations for one task without corrupting the pathways for others.

Now connect this to the Von Oswald result. The formal equivalence between ICL and gradient descent is proven for linear attention; real transformers are more complex. But the key observation doesn't depend on the exact mathematics: ICL steers activations for diverse tasks without catastrophic interference. Whatever the mechanism, KV cache entries act as transient weight deltas that stay in safe directions. Which means: safe modification directions exist in the parameter space. They're already being used every time ICL works. They're just temporary.

A hypernetwork trained to inject knowledge while preserving performance across a diverse capability set could learn to find those same safe directions for permanent modifications. To inject without breaking, the hypernetwork must learn something about how knowledge is organised inside the transformer. It can't produce non-interfering modifications by accident.

The hard problem: sequential composition

But ICL's safety has a bound. The modulations are temporary and contained within one context window. They don't compound.

Permanent weight modifications are different. Each update changes the parameter landscape. The safe subspace for update #1 might not be safe after update #500 has shifted the geometry. Each modification changes the terrain the next modification operates on.

This is the hard open problem. Not speed, Doc-to-LoRA solved that. Not the decision policy, RL has the right structure (more on this next). The hard problem is the guarantee that a model which has updated itself a thousand times still behaves coherently, that new knowledge doesn't silently corrupt old competence.

Doc-to-LoRA shows one injection works. Nobody has shown a thousand do.

Learning what to keep and what to discard

The second question from the Doc-to-LoRA section: who decides what to inject?

Mem-α, Memory-R1 (2025), and Neural Garbage Collection (Li et al., 2026) all use RL to train models to manage their own memory: what to store, what to discard, optimised for downstream performance. NGC frames it sharply: "if it can learn to reason, why can't it learn to forget?" It achieves 2-3x KV cache compression while maintaining accuracy, with the eviction policy learned entirely from task rewards.

All three target token-space. But the fundamental result is deeper than the specific medium: neural networks can learn, via RL, to evaluate incoming information and decide what's worth keeping. The retention policy is learnable. The target storage medium is an implementation detail.

Redirect this at weight-space. The RL policy decides what is worth committing to weights. The hypernetwork handles how to commit it safely.

Now consider what this combination produces. The hypernetwork has learned something about how knowledge is structured in the model's weights; it must have, that's how it injects without interference. The RL policy has learned which information helps future tasks. Together, the system doesn't just filter incoming information. It has a model of what it knows and what would improve it.

Once the RL policy is good enough, something shifts. The system stops being a passive filter. It can recognise what it doesn't know, because the hypernetwork maps the model's knowledge structure and the RL policy evaluates what's missing against future task demands. It starts signaling what information would be most valuable, and in agentic settings, actively requesting it through tool calls and actions. This follows directly from the RL objective: the policy is rewarded when stored knowledge helps downstream tasks, so it learns to pursue useful knowledge proactively, not just filter what arrives. The system goes from reactive (information arrives, decide keep or discard) to proactive (I have gaps, find what fills them, inject).

The hypernetwork gets absorbed

The hypernetwork doesn't need to stay external.

Stage 1 (past): Manual fine-tuning. A human decides what to train on. An engineer runs gradient descent offline. The model is a passive recipient.

Stage 2 (present): Learned injection. Doc-to-LoRA. A hypernetwork learns how to inject context into weights. Fast and automatic, but the what-to-inject decision is still external, and the hypernetwork is a separate module.

Stage 3 (future): Self-directed learning. Irie et al. (ICML 2022) demonstrated self-referential weight matrices that modify themselves at runtime. Toy-scale so far, but the theoretical machinery exists. If the hypernetwork is absorbed into the model's forward pass, the model becomes self-aware (self-referential, to be precise): the injection mechanism lives inside the same parameters it modifies. The model knows what it knows, can evaluate what's worth learning, and can update itself.

The whole system collapses into one loop. The knowledge-aware injection mechanism, the RL retention policy, the active information-seeking behaviour, all unified in a single self-directed learner.

State: the model's current weights plus information in context. Action: what to consolidate, which layers, what rank. Reward: future task performance. The temporal credit assignment is hard. The reward for remembering something today might not arrive for weeks. But this is exactly what RL is built for, and the reward signal is clean: downstream accuracy improves, or it doesn't.

What this gets us

If this works, personalisation becomes real, not prepended preferences but knowledge woven into parameters. Knowledge stays current without retraining. Context windows become working memory, not the whole story.

But the most interesting consequence is specialisation without infrastructure. Today, improving a model in a domain means building verifiers, curating tasks, defining reward functions, and running a training job. This works for math (ground-truth answers) and code (test suites). It completely fails for long-tail expertise: the conventions of a specific codebase, the nuances of a specific industry, how a specific user thinks. You cannot build an RL environment for everything, so those things never make it into weights. A model that learns from deployment doesn't need that infrastructure. Failures become training signal. Successful strategies consolidate into weights. The RL environment emerges from the model's interaction with the world.

The open problems are real. Sequential interference dominates: the guarantee that a thousand updates compose without degradation (Section 6). Temporal credit assignment in the RL loop is hard, the reward for remembering something today might not materialise for weeks. Weight-space memories are opaque, unlike token-space memories which are human-readable. And the compute cost of online weight updates during serving is nontrivial.

But unlike six years ago, we can point at each component and say: this piece works. ICL as a learning mechanism, proven. Learned injection, proven. Safe modulation directions evidenced by ICL's non-interference. Learnable retention policies, proven. Self-modifying networks, proven at small scale. The assembly is the engineering challenge. History suggests it won't stay unsolved for long: attention, LoRA, RLHF, each went from paper to production in under three years.

GPT-3's title was six years early. The few-shot learners are coming. They just need to learn what's worth remembering.