Every time you start a new conversation with an LLM, you start from zero.
The model has no idea who you are. It doesn’t know what you’ve built, what you care about, or what you’ve already tried. You can spend months talking to it, and every Monday morning, you have to reintroduce yourself from scratch.
This is the stateless AI problem. And it’s not a limitation of the model’s intelligence. It’s a limitation of how conversations are structured. The model only sees what’s in its context window: the text in front of it right now. Nothing before. Nothing after.
For one-shot tasks, statelessness is fine. But for anything requiring real personalization, like building a product that knows its users, a coding assistant that remembers your codebase, or a mentor that tracks your progress, statelessness is a product killer.
The field has responded with over a dozen distinct approaches to solving this. Most teams pick one and hope for the best. This article maps all of them: what each system actually does, when it wins, when it loses, and which one you should be building with right now.
The Mental Model Before the Systems
Before comparing frameworks, you need to understand the spectrum. Memory for AI systems ranges from ephemeral to permanent:
Press enter or click to view image in full size
Every memory system is making a bet about where on this spectrum to focus. Some systems are optimized for retrieving facts from long conversations. Others are designed to compile knowledge over weeks. Others focus on tracking how facts change over time. The choice of system is really a choice about what kind of memory problem you have.
There are also two fundamentally different problems hidden under the label “AI memory”:
- Personalization: remembering who the user is, what they prefer, what they’ve told you
- Institutional knowledge: accumulating domain expertise, operational patterns, learned workflows
Most systems solve one of these well. Few solve both.
The Primitives: What All Systems Are Built From
Before the frameworks, the vocabulary.
Vector embeddings convert text into lists of numbers that encode meaning. Semantically similar text produces geometrically similar vectors, so you can search by meaning rather than keywords. The key metric: use cosine similarity (measures angle between vectors), not L2/Euclidean distance (measures magnitude). Switching from L2 to cosine is the most impactful single fix for most broken retrieval systems. Takes 15 minutes.
BM25 is keyword search that still works. It scores documents by term frequency × inverse document frequency, great for exact matches like usernames, IDs, or specific technical terms that semantic search misses. Hybrid retrieval combines both signals, and it’s not optional for production systems:
final_score = α × cosine_similarity + β × bm25_normalizedRAG (Retrieval-Augmented Generation) is the standard pattern: embed a query → find relevant chunks → inject them into the LLM prompt → generate a grounded response. Its quality ceiling is retrieval quality. Garbage retrieval, garbage response.
Chunking is why you can’t just embed a 100-page document. Too small (50 chars) loses context. Too large (5000 chars) dilutes the embedding. For atomic facts: 150–400 chars. For procedural workflows: 2000–4000 chars.
With those in hand, here are the 10 systems that actually matter.
The 10 Systems
1. mem0: The Fastest Path to Production
GitHub: ~48,000 stars | License: Apache 2.0 | Funding: $24M raised Oct 2025
mem0 is the most widely deployed semantic memory layer as of mid-2026. Its core value proposition: pip install mem0ai, five lines of code, and you have working memory.
The architecture runs LLM-based fact extraction on every conversation turn, stores extracted facts in a vector database (Qdrant, FAISS, Pinecone, or ChromaDB), and retrieves with a 4-signal hybrid stack: semantic similarity + BM25 keyword matching + entity linking boost + temporal recency.
The April 2026 redesign removed the DELETE operation. When a user says “I moved from Mexico City to Dubai” and the system already knows “User lives in Mexico City,” both facts are stored. Conflicts are resolved at read-time via temporal recency, with the newer fact surfacing higher. This ADD-only approach cut extraction cost by 60–70% (one LLM call instead of three) and eliminated permanent data loss from wrong DELETE decisions.
What it’s great at: Any SaaS product where multiple users need personalized memory. user_id isolation is built in. The free tier handles 10K memories. If you need memory and you need it now, this is the answer.
What it misses: Graph features require $249/month Pro. The flat vector store means “I like pizza” and “I love pizza” are stored as separate facts. No temporal validity on facts, so you know the newer fact ranks higher, but the old one never truly expires. And if your memory problem is graph-shaped (entities, relationships, how they change over time), mem0 is the wrong layer.
2. Letta (formerly MemGPT): The Self-Editing Agent
GitHub: ~21,000 stars | License: Apache 2.0 | Funding: $10M seed
Letta’s insight is drawn from operating systems. Your computer creates the illusion that programs have unlimited RAM by transparently paging data to disk and retrieving it when needed. Letta does the same for LLMs, managing what stays in context (“fast memory”) and what goes to external storage (“disk”).
The architecture has three tiers:
Core Memory: a compact block always present in the system prompt. Contains identity facts, preferences, current context. The agent can directly rewrite this in real time via tool calls:
core_memory_replace(
label="human",
old_text="User lives in Mexico City",
new_text="User moved to Dubai, June 2026"
)The moment the user corrects something, it’s updated. No lag. No batch job. This is what separates Letta from everything else: the agent actively manages its own understanding.
Recall Memory: the conversation history. Searchable. When the context window fills, oldest messages are compressed via auto-summarization, keeping conversations effectively indefinite.
Archival Memory: unlimited vector storage (SQLite for dev, Postgres + pgvector for production). The agent calls archival_memory_insert() to save things worth keeping long-term, and archival_memory_search() to retrieve them.
What it’s great at: Stateful agents where the core value is that the AI genuinely updates its understanding as it learns new things. The real-time self-edit is the gold standard for conversational memory.
What it misses: It’s a framework, not a library. You adopt Letta’s architecture, not just a memory module. A hallucinating agent can corrupt its own memory. Quality of self-editing depends entirely on LLM quality. No built-in contradiction detection.
3. Graphiti / Zep: Temporal Knowledge Graphs
GitHub: ~24,000 stars (Graphiti) | License: Apache 2.0 | Paper: arXiv 2501.13956
Graphiti is Zep’s open-source temporal knowledge graph engine (Zep Community Edition was deprecated in 2025; use Graphiti directly). It’s the right answer to a question that most systems get wrong: facts change over time, and you need to know when a fact was true, not just whether it is.
The key innovation: every fact in the graph has explicit validity timestamps.
Entity: Alan
lives_in → Mexico City | valid: 2025-01-01 → 2026-06-01
lives_in → Dubai | valid: 2026-06-01 → presentWhen you query “where does Alan live?” You get Dubai. The old fact isn’t deleted, it’s expired. Query “where did Alan live in early 2025?” You get Mexico City, correctly. This is temporal validity as a first-class feature, not a retrieval score.
The architecture layers three retrieval mechanisms: semantic embeddings for similarity, BM25 for keyword matching, and graph traversal for relational queries. Combined, this hits sub-200ms retrieval latency. Benchmarks: 94.8% on DMR, up to 18.5% accuracy improvement over RAG baselines on LongMemEval enterprise tasks.
What it’s great at: Any system where facts evolve over time, and where relationships between entities matter: a mentor tracking a founder’s evolving relationships, a CRM-style AI that needs to know how a customer’s situation changed, or any long-lived agentic system.
What it misses: More complex to set up than mem0. Entity resolution is an open problem (“Alan the founder” and “Alan the writer” are different; the system needs to figure that out). The free tier caps at 1K credits.
4. Cognee: Graph-First Memory
GitHub: ~12,000 stars | License: Apache 2.0 | Funding: €7.5M
Cognee takes a different bet: rather than retrieving the most similar text chunks, it builds a queryable knowledge graph over everything you’ve ingested, and uses vector search as a hint for graph traversal rather than as the retrieval mechanism itself.
The architecture runs three linked storage systems simultaneously:
- Graph store (entities, relationships): Kuzu by default, Neo4j/Memgraph/FalkorDB supported
- Vector store (semantic embeddings): LanceDB by default, Qdrant/pgvector supported
- Relational store (chunks, provenance): SQLite by default, PostgreSQL supported
The cognify pipeline runs six stages on every document: classify → verify → chunk → LLM entity/relationship extraction → summarize → embed. The result: every graph node has a corresponding embedding, so you can move between semantic similarity and relational traversal without losing coherence.
The default retrieval mode is GRAPH_COMPLETION: vector search finds candidate entities, then the graph is traversed to build structured multi-hop context before generation. Complex queries like "how has Company X's relationship with their market position changed over the past year?" become tractable.
Cognee runs fully locally (no external DB required by default). It ingests from 30+ data sources including text, images, and audio.
What it’s great at: Institutional knowledge problems: document corpora, codebases, competitive intelligence, domain expertise accumulation. If your memory problem is shaped like a graph, this is built for it.
What it misses: Python-only (no TypeScript SDK as of mid-2026). Smaller community than mem0 or Letta. Slower on pure personalization (user facts, conversational memory), and it’s optimized for structured knowledge, not conversational episodics.
5. Hindsight: The New Benchmark Leader
GitHub: ~4,000 stars (growing fast) | Pricing: Free self-hosted
Hindsight is the newest system on this list and the one with the highest verified benchmark: 94.6% on LongMemEval as of mid-2026. It does something others don’t: it runs four retrieval strategies in parallel, then reranks with a cross-encoder.
Incoming query
↓ simultaneously:
├── Semantic search (vector similarity)
├── Keyword search (BM25)
├── Graph traversal (entity connections)
└── Temporal search (recency + validity)
↓
Cross-encoder reranking across all results
↓
`reflect`: LLM synthesis across top results into coherent context
↓
Response generationThe reflect feature is the key differentiator: instead of injecting raw retrieved chunks into the prompt, Hindsight synthesizes them first. The LLM sees a coherent, deduplicated summary of what's known, not a pile of potentially contradictory fragments.
The cost: reflect adds 100–600ms latency versus 10–50ms for vector-only systems. Whether that matters depends on your use case.
Hindsight is also MCP-first, designed to plug into Claude’s Model Context Protocol natively, which makes it a strong fit for anyone building on top of Anthropic’s tooling.
What it’s great at: High-accuracy retrieval where latency tolerance exists. Anyone building with Claude/MCP who wants the best benchmark numbers without a heavy framework commitment.
What it misses: Newest project with a smaller ecosystem and fewer documented production deployments. reflect latency is real.
6. HippoRAG: Associative Memory
Paper: arXiv 2405.14831 (v1), 2501.14247 (v2) | Lab: Stanford
HippoRAG is the most biologically-grounded system in this list. The hippocampus is your brain’s indexer for episodic memory; it doesn’t store memories, it builds an associative index that lets you traverse from one memory to adjacent ones. Seeing an old classroom triggers a teacher’s face, which triggers a lesson, which triggers a conversation with your dad about it. None of those connections were explicitly stored. They emerged from associative indexing.
HippoRAG applies this to RAG. At indexing time, an LLM extracts entities and relationships, building a knowledge graph. At retrieval time, instead of returning top-K semantically similar chunks, it runs Personalized PageRank (PPR) from the query entities through the graph, propagating relevance associatively across connected nodes, with no additional LLM calls at retrieval time.
The result: query “pricing” and you get not just memories that mention pricing, but memories about customer discovery, revenue model, and the specific conversation where pricing anxiety first appeared, all of them associatively connected.
Benchmarks: +20–30% over standard RAG on PopQA multi-hop questions. Consistently dominates on queries requiring more than 4 reasoning hops.
Key advantage over GraphRAG: Incremental indexing. Add a new memory → extract entities → add to graph → done. No $2–7/document re-indexing. Scales naturally for personal memory use cases.
7. A-MEM: Zettelkasten for AI
Paper: arXiv 2502.12110 (Rutgers, Feb 2025) | Code: github.com/agentic-memory/amem
A-MEM applies the Zettelkasten note-taking philosophy to AI memory. Every existing system appends new memories to a store and retrieves the most similar ones later. The store is flat and static; memories don’t affect each other after insertion. A-MEM breaks this pattern.
When a new memory arrives, A-MEM doesn’t just store it. It:
- Finds semantically related existing memories
- Creates explicit links between the new memory and those neighbors
- Contextually refines the linked existing memories, updating them with the new context
Memory 1: "User is building edtech for Mexico"
Memory 2 arrives: "User's target is preschool teachers"
→ Links to Memory 1
→ Memory 1 refined: "User is building edtech for Mexico, targeting preschool teachers"
→ New item stored with link: [002 → links: 001]
Memory 3: "User frustrated by teacher engagement"
→ Links to 002 (teachers) and 001 (product)
→ 002 refined: "Target is preschool teachers (engagement is known challenge)"The retrieval of any single memory now surfaces its connected context cluster automatically. You ask about one thing and adjacent relevant things emerge naturally, which is how a mentor recalls relevant context the founder didn’t know to ask for.
A-MEM also solves semantic deduplication more elegantly than hash-checking. Instead of storing “I prefer terse answers” and “User dislikes long explanations” as two separate facts, the second refines the first. One richer item instead of two redundant ones.
Benchmarks: Outperforms SOTA baselines (RAG, MemoryBank, ChatDB) across 6 foundation models on multi-step reasoning and long-horizon conversation tasks.
8. Microsoft GraphRAG: Large Corpus Analysis
GitHub: ~25,000 stars | License: MIT | Paper: arXiv 2404.16130
GraphRAG answers a different question than most systems. Not “what did this user say about X?” but “what are the themes across this entire corpus of documents?”
The indexing phase (done once) extracts entities and relationships from every chunk, runs Leiden community detection to cluster them, and pre-generates LLM summaries of each cluster. When you ask a global question (“What are the main themes in our research corpus?”), GraphRAG reads pre-built cluster summaries and runs a parallel map-reduce synthesis, dramatically faster and more comprehensive than naive RAG on the same corpus.
Benchmarks on 1M+ token corpora: 81.67% overall accuracy vs 57.50% for vector RAG. Especially dominant on thematic aggregation and multi-hop reasoning (>5 entities).
The cost: indexing runs $2–7 per document (at GPT-4 rates). Re-indexing on document changes costs the same as initial indexing. This makes it wrong for personal memory (which changes every conversation) and right for stable, large document corpora (research reports, legal documents, competitive intelligence).
9. Karpathy’s LLM Wiki: Knowledge That Compounds
Source: gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
The least “framework-y” system on this list. Karpathy’s insight: humans abandon wikis because maintenance is tedious. LLMs don’t get bored and can update 15 files in one pass. So: build a wiki where the LLM is the only editor.
Three operations:
- INGEST: Feed a source document → LLM reads it → creates a summary page → updates up to 15 existing pages with cross-references and corrections → logs everything
- QUERY: Ask a question → LLM reads relevant wiki pages → answers → files the answer back as a new wiki page (so synthesized knowledge compounds, not just raw retrieval)
- LINT: Periodic health check. Find contradictions, stale claims, orphan pages, missing cross-references
The LINT operation is what most systems skip. Contradictions accumulate silently everywhere else. Here they get caught proactively.
What it’s great at: Individual knowledge workers accumulating domain expertise over months. Research, competitive intelligence, course notes, book learning. The wiki gets smarter every time something new is ingested, because knowledge is compiled into structured pages rather than re-derived from raw documents every time.
What it misses: Not a library. You implement the pattern yourself. Manual INGEST (you have to feed it documents). No real-time memory capture from conversations. At 1000+ pages, the index.md approach starts breaking down.
10. LangMem: Simple, Free, LangGraph-Tied
GitHub: ~1,300 stars | License: MIT
Mention LangMem for completeness: it’s flat key-value + vector search, deeply integrated with LangGraph, free, and MIT-licensed. If you’re already in the LangGraph ecosystem and need basic memory without standing up another service, it works. If you’re not in LangGraph, there’s no reason to pick it over the alternatives above. The value is entirely in the ecosystem integration.
The Benchmark Reality
Here’s what mid-2026 LongMemEval numbers look like across the field:
System LongMemEval Notes Hindsight 94.6% Multi-strategy + reflect synthesis Zep/Graphiti 94.8% (DMR) Temporal KG, enterprise tasks SuperMemory 81.6% Closed source mem0 49–94.8% Varies by benchmark config and tier Letta Not standardized Self-editing architecture, different evaluation
A few honest notes on these numbers:
First, LoCoMo and LongMemEval both test the same thing: can the system retrieve facts from long conversational histories? They don’t test whether the system improves actual task performance over time (the institutional knowledge problem). The benchmarks measure retrieval, not value.
Second, mem0’s numbers vary wildly depending on whether you’re measuring the free tier (vector-only) or Pro (graph-enhanced), and which benchmark config is used. The April 2026 ADD-only redesign dramatically improved scores on some evals.
Third, no standard benchmark exists for multi-hop associative reasoning in personal memory. HippoRAG and A-MEM are likely the winners there, but the benchmarks that would prove it don’t exist yet.
How to Choose
Conversation memory for an existing chatbot?
→ mem0 (fastest, most integrations, best benchmarks for this use case)Stateful agent that actively manages its own memory?
→ Letta (self-editing core blocks, OS-inspired tiering)Facts that change over time + relationship graphs?
→ Graphiti (temporal validity, native entity relationships)Structured knowledge / document corpus / institutional expertise?
→ Cognee (graph-first, multi-modal) or GraphRAG (if corpus is stable + large)Maximum retrieval accuracy, building with Claude/MCP?
→ Hindsight (94.6% LongMemEval, MCP-native)Associative multi-hop memory / mentor-style surfacing?
→ HippoRAG (PPR traversal) + A-MEM (linked memory)Personal knowledge worker who reads a lot?
→ Karpathy Wiki pattern (compiling knowledge > re-deriving it)LangGraph-native and want simplest possible option?
→ LangMemAlready in Postgres and want to add memory without a new service?
→ pgvector + hybrid BM25 (the pragmatic answer nobody talks about)
The honest multi-system answer: Production hyperpersonalization usually requires a stack, not a single library.
A pattern that works: Letta’s core memory blocks (always-present identity context) + mem0 or Graphiti for long-term fact storage + HippoRAG-style retrieval for associative surfacing. Each handles what it’s best at.
What No System Has Solved Yet
The systems above are genuinely impressive. And they still miss things that matter.
Forgetting. ADD-only accumulation means stale facts never disappear; they just rank lower. After 2 years of daily use, you have 3,650 facts, most of which are about finished projects, resolved concerns, and contexts that no longer exist. They dilute retrieval signal for current queries. No production system has implemented proper salience decay (track access frequency per memory item, archive what hasn’t been touched in 6+ months). This is architecturally straightforward and semantically hard.
Proactive memory. Every system on this list is reactive: it retrieves information when asked. A human mentor doesn’t wait to be asked. When you say “I’m thinking of spending two months building this feature,” a good mentor immediately surfaces “you mentioned three weeks ago you hadn’t validated whether anyone would pay for it.” Building that requires trigger pattern recognition, background relevance monitoring, and conversational timing judgment. None of those exist in any open-source framework yet.
Cognitive profiles. None of these systems model how a person thinks, only what they’ve said. The difference between a good database and a good mentor is that the mentor knows when you’re avoiding something, what kind of push you actually respond to, and where your confidence is miscalibrated. Building this requires inference from patterns across many sessions, not explicit fact storage. It’s the hardest problem in the field.
These aren’t reasons to not build. They’re the frontier. The teams that figure out proactive memory and cognitive modeling are building something genuinely new.
The Bottom Line
If you’re building an AI product with any kind of user relationship over time, you need memory architecture. The stateless default is not a valid product choice for anything beyond one-shot tasks.
The right answer depends on your problem:
- Ship fastest? → mem0
- Self-editing agent? → Letta
- Facts that evolve? → Graphiti
- Knowledge that compounds? → Cognee or Karpathy Wiki
- Highest accuracy, MCP-native? → Hindsight
- Associative reasoning? → HippoRAG + A-MEM
The field is moving fast. Hindsight didn’t exist 18 months ago. Graphiti is 6 months old. The next breakthrough, probably proactive memory or cognitive modeling, is being built right now by someone who read all of these papers and asked “what’s still missing?”
Interested in how this applies to building AI mentors and hyperpersonalized learning systems? Follow for the next piece; I’m going deeper on the cognitive architecture layer that none of these libraries ship.
Sources & Further Reading
- mem0: github.com/mem0ai/mem0
- Letta: github.com/letta-ai/letta
- Graphiti: github.com/getzep/graphiti | Paper: arXiv 2501.13956
- Cognee: github.com/topoteretes/cognee
- Hindsight: usehindight.ai
- HippoRAG: arXiv 2405.14831 (v1), 2501.14247 (v2)
- A-MEM: arXiv 2502.12110
- GraphRAG: github.com/microsoft/graphrag | arXiv 2404.16130
- Karpathy LLM Wiki: gist.github.com/karpathy
- LangMem: github.com/langchain-ai/langmem
- Benchmark comparison (2026): vectorize.io/articles/best-ai-agent-memory-systems