Why Files Are Not Enough as Memory for AI Agents

Working Memory: Good Enough for a Single Session

When an agent runs on a single task — say, extracting data from one company’s financial spreadsheet — that’s called a rollout or a single session. During that session, the agent maintains what we’d call working memory: information stored in files that helps it keep track of what it’s doing.

This is genuinely useful. Working memory helps agents achieve short-term learning within a single rollout. It also solves a practical problem: by offloading session data into files, the agent can selectively pull in only the relevant bits (using tools like grep) rather than cramming everything into the LLM's context window. Less noise, better performance.

But here’s the catch — working memory resets between sessions. If you run that same financial analysis agent on a different company’s spreadsheet, it starts from scratch. No accumulated knowledge. No improvement over time.

Working memory doesn’t learn.

Long-Term Memory: The Real Challenge

Long-term memory is memory that persists across multiple rollouts. Think of an agent that analyzes financial data for dozens of companies over weeks or months. Ideally, it should get better at this over time — recognizing patterns, avoiding past mistakes, building on prior insights.

There are broadly two approaches to long-term memory: files controlled by the LLM agent itself, and bespoke memory representations with task-specific evolution and retrieval systems.

The file-based approach is the most common today. It’s also deeply flawed.

Why File-Based Long-Term Memory Breaks Down

The fundamental problem is scale. Memory grows with each rollout. Over time, and across diverse tasks, the accumulated memory inevitably becomes large. And large memories make it both inefficient and ineffective for LLMs to retrieve information and construct useful context.

Two common solutions get proposed here. Neither works.

RAG (Retrieval-Augmented Generation)

The idea is simple: store memories in a vector database and retrieve semantically similar content when needed. In practice, RAG struggles with precision.

Consider a concrete example: you want to extract all the products a user likes from prior conversations. RAG will happily return every conversation that mentions user preferences for products — including the ones where the user said they disliked something. You get high recall, low precision, and a pile of irrelevant information.

Worse, RAG doesn’t return information in the structured form you actually need. The burden of making sense of RAG’s messy output falls on the LLM at inference time, adding cost and latency while reducing reliability.

Agentic Search Over File Hierarchies

The alternative is to let the agent organize its own memory into a hierarchy of files and then search through that hierarchy when it needs something. This sounds elegant but has serious problems.

First, the agent is now responsible for architectural decisions about how to organize information — decisions that are high-level, long-term, and largely irreversible. It’s not clear that agents can make good choices here. These aren’t the kind of step-by-step reasoning tasks that LLMs excel at.

Second, retrieval becomes an iterative process of sifting through an ad-hoc file structure, which is painfully slow and expensive. For time-sensitive agentic workflows, this is a non-starter.

The Shared Failure

Both approaches — RAG and agentic search — share a fundamental flaw: they make architectural and retrieval decisions that are not designed for the specific nature of the task at hand. They’re general-purpose solutions applied to problems that demand specificity. And critically, neither approach leads to genuine learning — neither enables a model’s performance on a given task to actually improve over time.