Google’s Titans – J.CV

If you’ve used ChatGPT or Claude for more than a few sessions, you’ve probably noticed something frustrating: they forget. Not everything – these systems now have “memory” features that store facts about you. But they don’t really learn you. They don’t internalize how you think, how you make decisions, what kinds of explanations click for you. Every conversation still feels a bit like starting over.

That’s not a bug in the product. It’s a limitation of the architecture. Today’s language models are essentially stateless – they see whatever fits in the current context window, plus maybe some stored notes, and that’s it. They’re incredibly capable pattern matchers, but they don’t have real long-term memory the way we, humans do.

I’ve long believed this is one of the key gaps that needs to close for AI to get meaningfully better. Not just bigger models, not just more data, but architectural changes that let models actually remember and learn over time. So when Google published their work on Titans last year, and then made a recent blog post on it + the MIRAS framework, I got excited. This is the kind of change I’ve been waiting for.

The problem Titans is solving

The dominant architecture in AI right now – the Transformer – uses something called “attention” to process text. Attention is fantastic for understanding relationships within whatever the model is currently looking at. But it has a hard limit: the context window. That’s the maximum amount of text the model can see at once.

A stylized diagram on a dark background showing a long horizontal chain of gray squares representing a sequence of data. A specific section is highlighted in bright blue and acts as the base for a floating grid structure above it. This illustrates how an AI model focuses intensely on a small segment of data while the rest remains in the dark. — **Visualizing the Context Window:** Standard attention mechanisms process text by creating a dense matrix of relationships (the grid), but this process is computationally expensive. As a result, the model can only “see” the tokens within the highlighted window (blue), effectively ignoring the vast sequence of history that came before it (faded gray).

Context windows have gotten bigger (some models like Gemini now handle millions of tokens), but there are two problems. First, cost scales quadratically – twice the context means four times the compute (so it becomes expensive, slower, and needs a lot more electricity and water). Second, even when information is technically inside the window, models still struggle to use it reliably. Research from Chroma calls this “context rot” – performance degrades as you stuff more into the window, even on simple tasks.

The practical upshot: what the model sees is all it can use. This is why “context engineering” has become such a big deal. Carefully curating what goes into that limited window is now a core skill for building AI applications.

Titans takes a different approach. Instead of trying to cram more into the context window, it adds a separate long-term memory system that works alongside attention.

How Titans works

The core insight is treating attention as what it actually is: short-term memory. Attention is precise but local and expensive to scale. Titans keeps attention for that job, but adds a neural long-term memory module — not a fixed vector or matrix like previous approaches, but an actual deep network (a multi-layer perceptron) that learns to compress historical context into its parameters and retrieve it later.

This is a significant change from existing recurrent architectures. Models like Mamba or other state space models compress everything into a fixed-size state (I will do another post on these, I’m going deep on this). The problem is obvious: you can’t losslessly compress arbitrarily long sequences into a fixed-size vector. Something has to give. Titans sidesteps this by making the memory module itself learnable and deep, giving it the expressive power to capture non-linear relationships in historical data rather than forcing everything through a linear bottleneck.

A diagram showing a flow of data from left to right. A sequence of blue squares (input) feeds into a complex, diamond-shaped web of interconnected nodes and lines (a neural network), which then funnels down to a single output block. The style is abstract, using neon blue lines on a dark background to represent the flow of information through layers. — **Memory as a Deep Network:** Unlike traditional RNNs that compress history into a flat vector, Titans uses a deep neural network (center) as its memory. Historical data flows in and modifies the connections within the deep network itself, allowing it to compress complex, non-linear relationships into its weights.

A clever part is how Titans decides what to remember. The paper introduced a “surprise metric” that directly parallels how human memory works: unexpected events are more memorable. In Titans, surprise is measured by the gradient of the network with respect to each input. Large gradient = the input diverges significantly from what the memory currently encodes = high surprise = write to memory. Small gradient = expected, predictable = mostly ignore.

A conceptual schematic showing a central neural network block with a large, curved set of lines swooping from the output back to the input, creating a feedback loop. To the left, a smaller cluster of nodes represents the current input being fed into this cycle. — **The Surprise Loop:** How Titans decides what to remember. The architecture creates a feedback loop where the model compares the current input against what it expected. If the input is “surprising” (high gradient), the loop activates, updating the weights of the memory network to encode the new information.

But the authors found that pure momentary surprise isn’t enough. If you only respond to the current token’s gradient, you miss important information that comes right after a surprising event. Think about how human memory works: a car crash is surprising, but you also remember the five seconds before and after, even if those moments weren’t individually surprising. Titans handles this with a momentum term that captures “past surprise” – recent context that should be retained because it’s connected to something important, even if the individual tokens aren’t surprising on their own.

The update rule combines three components: momentary surprise (the current gradient), past surprise (momentum from recent tokens), and a forgetting mechanism (weight decay) that prevents memory overflow. This forgetting gate is data-dependent – the model learns when to clear old information based on context, not just time. That means it won’t (/shouldn’t) overindex on things. When you see a topic change, the gate can flush stale memories. When you’re deep in a continuous thread, it retains more.

Architecturally, Titans comes in three variants that combine short-term memory (attention), long-term memory (the neural module), and persistent memory (fixed parameters that encode task knowledge). The best-performing variant, MAC (Memory as Context), retrieves from long-term memory before attention runs, letting the attention mechanism decide whether historical context is relevant to the current query. This gives you the best of both worlds: precise local reasoning from attention, plus access to compressed historical context that would otherwise be lost.

The results validate the design. On BABILong, a benchmark requiring reasoning across facts distributed in extremely long documents, Titans outperformed GPT-4, and a RAG‑augmented Llama‑3.1‑70B underperformed Titans despite ~70× more parameters (noting this research occurred 1 year ago, so a lot has changed!).

It scales to context windows beyond 2 million tokens with higher accuracy than baselines on needle-in-haystack retrieval. And critically, it does this while maintaining fast, parallelizable training – the memory updates can be computed with matrix operations, not sequential steps.

How memory works in products today

To understand why this matters, it helps to look at how “memory” currently works in consumer AI products like ChatGPT.

There are basically three layers we have at our disposal to “give AI memory”. First, the context window – whatever fits in the current conversation, sometimes with summarization to compress older messages. Effective, stateless, and sometimes expensive. Second, persistent memory – a profile the system builds about you (preferences, facts you ask it to remember) that it can reference across sessions. Third, retrieval systems like RAG – external documents that get searched and injected into the prompt when relevant.

I don’t know exactly how ChatGPT’s memory works under the hood (edit, now I do, thanks to Simon), but if I had to guess, it’s writing important user information to text files, choosing specific things to memorize by embedding, and pulling summarized conversations if helpful. That’s roughly how we do it in our products at Automattic, and it scales well enough.

Claude Code’s Docs basically detail exactly how it’s “memory” implemented in the product. Pretty interesting –

But these approaches are “flat.” Persistent memory is just structured text the model can reference. Not a deep representation of your behavior (and even if it is, the model can’t use it like that). RAG is stateless: retrieved passages are context for this one answer, not something the model internalizes over time. Cross-session patterns – how (and even more important, why) you actually think and decide across months – are barely captured.

RAG vs learned memory

It’s worth being precise about how Titans-style memory differs from RAG, because on the surface they solve similar problems.

RAG stores documents and retrieves text chunks. You embed your corpus, query it with the user’s input, and inject the most semantically relevant passages into the context window. The model then reasons over those passages alongside the user’s question. It works well, it’s widely deployed, and it’s easy to understand: you’re giving the model relevant reference material.

<aside> It was also AI’s first bubble. Some of these companies raised hundreds of millions of dollars but were then effectively replaced by pg-vector. Now many databases support native vector lookups, which made WPVDB possible. </aside>

But RAG is basically a lookup table with fuzzy matching. The same query against the same corpus returns the same passages. The model doesn’t learn anything from the retrieval – it just gets additional context for this one request. There’s no accumulation of understanding over time.

A split-screen comparison. On the left, a structured grid of uniform, separate squares, representing a database or lookup table. On the right, a messy, organic web of interconnected nodes, representing a neural network. A vertical line divides the two approaches. — **Discrete vs. Distributed Memory:** RAG (left) functions like a library, storing distinct, separate chunks of text in a “flat” database. Titans (right) functions like a human brain, distributing information across a complex web of neural connections. In Titans, facts are not stored in specific boxes but are implicitly encoded in the weights of the network.

Titans memory is something different. The memory module is a neural network whose weights update during inference. When the model “retrieves” from memory, it’s doing a forward pass through that network – the output is a compressed representation of patterns the network has learned, not verbatim text. You can’t point to a specific document that’s “stored” in memory, because the storage is distributed across weights. The memory encodes relationships and patterns, not retrievable documents.

This has tradeoffs. With RAG, you can inspect what’s in the database, add or remove specific documents, and explain exactly why a particular passage was retrieved. This is especially important for enterprise deployments, which always require audits and logs.

With Titans-style memory, the “memories” are opaque – distributed across network parameters, not individually addressable. You can save and load the memory state, reset it, or scope it per-user, but you can’t do granular “add this fact, delete that one” operations the way you can with a document store. Kinda like a brain :).

What you gain is something RAG can’t provide: genuine learning. A Titans memory module that processes a year of your interactions doesn’t just store retrievable facts. Instead, it develops a compressed model of your behavior. The patterns are implicit in the weights, accessible through inference, and shaped by the surprise metric’s selective attention to what matters.

This isn’t self-learning in the fullest sense – the base model weights are frozen, and the memory module’s updates are guided by the surprise metric rather than an open-ended objective. But it’s a meaningful step toward systems that accumulate understanding rather than just accessing stored text.

What real long-term memory unlocks

Once you have a neural memory that learns during use … not a static profile, but an actual network that updates its weights based on what surprises it, the possibilities shift dramatically.

Think about what that surprise metric actually captures. It’s not just “is this fact new?” It’s “does this input diverge from the patterns I’ve learned?” That means a Titans-style memory doesn’t just store that you prefer email over Slack. It learns the distribution of your communication patterns – when you use which channel, how your tone shifts by context, what kinds of requests you escalate versus handle yourself. The memory could encode behavioral patterns, not just facts.

In the past to try and get more personalized content generations for WordPress, I experimented with small scale LORAs (using LoRAX) that were continually trained on organically created content from individual authors. This in a way is similar, but probably even more effective

The momentum term matters here too. Human expertise isn’t a collection of isolated facts – it’s connected knowledge, where one insight leads to another. By retaining context around surprising events, Titans can capture these connections: not just “the deployment failed” but the chain of decisions and signals leading up to it. Over time, perhaps the model builds something closer to intuition than a lookup table.

At the organizational level, this could be transformative. Imagine a system trained on years of incident reports, postmortems, support tickets, sales calls, and internal policy documents. Not just searchable, actually internalized. The memory module would encode patterns like “when this customer segment complains about latency, the root cause is usually X” or “proposals structured this way get approved faster.” It’s the kind of institutional knowledge that currently lives only in the heads of long-tenured employees.

The forgetting mechanism is equally important for enterprise use. Without it, memory fills up and old information interferes with new. But Titans’ adaptive forgetting means you can deploy a system that naturally prioritizes recent, relevant context while gracefully deprecating stale information. And because the forgetting gate is explicit and data-dependent, you can audit what’s being retained, reset memory for specific tenants, or scope it per-user which are all critical for regulated environments.

The test-time learning aspect deserves emphasis. The memory module’s weights actually update during inference. This isn’t fine-tuning or retraining. It’s the model learning from each interaction as it happens, using the surprise metric to decide what’s worth encoding. For products, this means an assistant that genuinely improves with use, without any backend training pipeline. The learning is automatic, continuous, and scoped to the individual user or organization.

Where this gets used

Drawing this together, I see a few clear commercialization paths.

First, persistent personal AI – assistants that grow an internal model of how you think and work, applicable across tools and time. Not a profile string, but a learned representation of your decision-making patterns. A Titans-style memory module is one of the few ways to make this scalable without exploding context costs.

Second, enterprise copilots with organizational memory – systems that sit on logs, tickets, and metrics, and actually improve as the company encounters incidents and edge cases. Not retrained weekly, but continuously learning under policy. The 70x parameter efficiency over GPT-4 on long-context reasoning suggests this could be economically viable even for mid-size organizations.

Third, memory state management as infrastructure – not “memory as an API” in the way you’d query a database, but tooling for saving, loading, resetting, and scoping memory states. Checkpoint a user’s memory before a risky interaction. Maintain separate memory states per tenant. Merge organizational memory with individual memory at inference time. The clean separation between frozen backbone and live memory module makes this architecturally tractable, even if individual memories aren’t inspectable the way documents are.

None of this replaces what already works. RAG stays for external knowledge where you need explainability and granular control. Simple profile memory stays for user-visible preferences. Long-term neural memory slots in as the piece that learns patterns over time; something genuinely new, not just a faster database.

Much of the future will combine past research

This pattern of combining existing research into novel architectures isn’t hypothetical – we’re already seeing it work elsewhere. Personally, I’ve been doing many experiments that slice together pieces of architectures. These are at extremely small scales, but prove stability!

DeepSeek’s recent models are a masterclass in architectural recombination. Their V3 and R1 models layer together Multi-Head Latent Attention (substantially reducing KV cache and actually improving performance over standard attention while slashing memory), Mixture of Experts with auxiliary-loss-free load balancing, and multi-token prediction for denser training signals.

A new, better, soup with the same ingredients.

For reinforcement learning, they introduced GRPO (Group Relative Policy Optimization) in their DeepSeekMath work – a variant of PPO that eliminates the separate critic model by computing advantages relative to a group of sampled responses, cutting RL compute requirements roughly in half.

None of these ideas are entirely new; what’s new is the specific combination and the willingness to rethink which components are actually necessary. The result is a frontier model trained for a fraction of typical costs.

Titans represents a similar opportunity: take the Transformer stack that works, add a principled long-term memory module, and unlock capabilities that neither component achieves alone.

The last few years in AI have been dominated by a simple playbook: make models bigger, train on more data, expand the context window. That approach still works, but it’s hitting real constraints – cost, energy, diminishing returns.

The next wave of progress will come from smarter architectures, not just more scale.

Titans is exactly that: a way for models to remember and adapt at test time without throwing away the Transformer stack that already works. The surprise metric is elegant, the momentum term solves a real problem, and the benchmark results are hard to argue with.

I’m hoping it shows up in products soon. The teams that figure out how to use this ingredient will build meaningfully better AI.