Hey HN,
4.5 months ago I was climbing cell towers and installing closets. Zero programming background. Today I'm releasing VAC Memory System β an open-source conversational memory that achieves 80.1% on LoCoMo benchmark, beating Mem0 (68. 5%), Letta/MemGPT (74%), Zep (75. 1%), and Memobase (75.8%).
GitHub: https://github.com/vac-architector/VAC-Memory-System
π₯ The Problem I Solved
Vector databases have a critical blind spot: they return semantically similar but factually wrong results.
Example:
Query: "Where did I meet Alice?"
FAISS returns: "I met Bob at the coffee shop"
Why? High cosine similarity β same sentence structure, location mention. But wrong entity.
BM25 catches "Alice" via exact match but misses paraphrasing ("encountered Alice", "ran into her").
Mem0's approach: LLM-driven extraction β graph storage β semantic retrieval. Great for preferences, but still vulnerable to entity confusion on factual questions.
MemGPT/Letta's approach: OS-like virtual memory with paging between "core" and "archival" memory. Elegant for context management, but retrieval relies on the same semantic search that causes false positives.
π― My Solution: MCA (Multi-Candidate Assessment)
I invented a physics-inspired pre-filter that runs before expensive vector search:
# The "gravitational" ranking formula coverage = len(query_keywords & memory_keywords) / len(query_keywords) distance = max(0. 1, 1. 0 - coverage) mass = coverage * importance_weight force = G * (query_mass * memory_mass) / (distanceΒ² + Ξ΄)
The insight: Treat memories as planets with mass. Keywords create "gravitational attraction". High entity overlap = strong pull toward the query.
Why it works:
- Query: "Where did I meet Alice?" β keywords:
{alice, meet} - Memory: "Met Bob at coffee shop" β coverage = 0/2 = 0% β filtered out
- Memory: "I met Alice at the library" β coverage = 2/2 = 100% β passed through
MCA eliminates false positives before FAISS even runs.
βοΈ Full Architecture (8 Steps)
Query: "Where did I meet Alice?"
β
[1] Query Classification (factual/temporal/conceptual)
β
[2] LLM Synonym Expansion (Qwen 14B via Ollama)
"alice" β ["alice", "alicia", "her"]
"meet" β ["meet", "met", "encountered", "ran into"]
β
[3] MCA-FIRST FILTER (coverage β₯ 0.1)
1000 memories β ~30 candidates
β
[4] FAISS (BGE-large, 1024D)
Adds semantic matches: "visited Alice", "saw her"
β 100 candidates
β
[5] BM25 (Okapi with custom tokenization)
Catches keyword variations FAISS missed
β 40 more candidates
β
[6] Union + Deduplication β ~120 unique
β
[7] Cross-Encoder Reranking (bge-reranker-v2-m3, 278M params)
120 β 15 best
β
[8] GPT-4o-mini (T=0.0, max_tokens=150)
β Final answer
π Head-to-Head Comparison
| Aspect | VAC Memory | Mem0 | Letta/MemGPT | Zep |
|---|---|---|---|---|
| LoCoMo Accuracy | 80.1% | 68.5% | 74. 0% | 75.1% |
| Architecture | MCA + FAISS + BM25 + Cross-Encoder | LLM extraction + Graph | OS-like paging + Archive search | Summarize + Vector |
| Entity Protection | β MCA pre-filter | β Semantic only | β Semantic only | β Semantic only |
| Latency | 2. 5 sec/query | ~3-5 sec | ~2-4 sec | ~2-3 sec |
| Cost per 1M tokens | <$0.10 | ~$0.50+ | ~$0. 30+ | ~$0.20+ |
| Reproducibility | 100% (seed-locked) | Variable | Variable | Variable |
| Conversation Isolation | 100% | Partial | Partial | Partial |
π¬ Why Existing Solutions Fail on Factual Questions
Mem0's Graph Memory
- Strength: Great for storing relationships and preferences ("User likes pizza")
- Weakness: On factual retrieval ("When did I meet Alice? "), the LLM-driven extraction can miss nuances. Graph traversal still relies on semantic similarity for node matching.
- VAC advantage: MCA ensures entity-level precision before any semantic matching
Letta/MemGPT's Virtual Memory
- Strength: Elegant OS-inspired design. Self-editing memory blocks. Multi-step reasoning via "heartbeats".
-
Weakness: Archival retrieval uses
archival_memory_searchwhich is... vector search. Same entity-confusion problem. - VAC advantage: Hybrid retrieval (MCA + FAISS + BM25) covers all failure modes
Pure Vector Search (FAISS/Pinecone)
- Strength: Fast, scalable, catches paraphrasing
- Weakness: Optimizes for cosine similarity, not factual correctness
- VAC advantage: Cross-encoder reranking on filtered candidates, not raw vectors
π The Numbers Behind 80.1%
Validated across:
- 10 conversations Γ 10 seeds = 100 runs
- 1,540 total questions
- 3 question types: Single-hop (87%), Multi-hop (78%), Temporal (72%)
Component Recall (ground truth coverage):
- MCA alone: 40-50%
- FAISS alone: 65-70%
- BM25 alone: 50%
- Union (MCA + FAISS + BM25): 94-100%
Key insight: No single retrieval method is sufficient. The union catches what each individual method misses.
π οΈ Technical Deep Dive
Why MCA Works (The Physics Metaphor)
When I explained my idea to Claude CLI, I said: "Memories are like planets. They have MASS based on frequency. They ATTRACT the query with gravitational force."
Claude thought I was crazy. Three hours of arguing later:
def gravitational_force(m1, m2, distance): G = 1.0 return G * (m1 * m2) / (distance ** 2 + 0.001)
Result: +15% recall improvement.
Embedding Stack
- Model: BAAI/bge-large-en-v1.5 (1024D)
- Index: FAISS IVF1024,Flat
- Why BGE: #1 on MTEB leaderboard for retrieval
Cross-Encoder Precision
- Model: BAAI/bge-reranker-v2-m3 (278M params)
- Why: Cross-encoders see query+document together, not separate embeddings
- Impact: Converts 94% recall β 80% accuracy
LLM Generation
- Model: GPT-4o-mini
- Temperature: 0.0 (deterministic)
- Why: Cheapest + fastest + reproducible
π° Cost Comparison
| System | Cost per 1M tokens | Notes |
|---|---|---|
| VAC Memory | <$0.10 | GPT-4o-mini at T=0.0 |
| Mem0 | ~$0.50+ | LLM extraction overhead |
| Letta Cloud | ~$0. 30+ | Agent orchestration |
| OpenAI Memory | ~$0. 30+ | Built-in, but limited |
VAC is 5-10x cheaper because:
- MCA filter reduces candidate pool before expensive operations
- Single LLM call for final answer only
- No LLM-driven memory extraction/consolidation
π§ͺ Reproducibility
Every result is verifiable:
# Run with seed SEED=2001 LOCOMO_CONV_INDEX=0 python orchestrator.py # Same seed = same results # 100 runs validated
This matters. Most AI benchmarks are non-reproducible. VAC locks everything:
- Random seeds
- Temperature = 0. 0
- Deterministic FAISS search
- Hash verification of indexes
π¦ What's in the Repo
VAC-Memory-System/
βββ mca_lite.py # ~40 lines: learn MCA concept
βββ pipeline_lite.py # ~250 lines: 4-step demo pipeline
βββ Core/*. so # Compiled production binaries
βββ data/ # Pre-built SQLite + FAISS indexes
βββ baseline_100/ # 100 verified benchmark runs
βββ run_test.sh # One-click testing
LITE version: Fully open-source Python to learn the architecture FULL version: Compiled binaries that achieve 80.1%