Librarian - Intelligent Context Management for AI

2 min read Original article ↗

Open Source

Stop burning tokens.
Cure context rot.

AI agents re-read your entire context on every turn - costs explode, quality drops. The Librarian fixes this: up to 85% fewer tokens, no context rot, and near-infinite scalability. Open source.

💸

Exponential Cost

By turn 50, brute-force approaches send 6× more tokens than necessary. Every turn re-processes the entire history - costs scale as n².

Up to 85%Cost Reductionvs. brute-force at 50 turns

🧠

Context Rot

As context grows, LLMs lose track. Key instructions get buried under noise. Research shows the "Lost in the Middle" effect can cause quality to drop by 20% - 85% as context length increases.

82%Answer AccuracyBeats brute-force (78%) with less context

⏱️

Latency Ceiling

At 100K tokens, brute-force response generation can take up to 60 seconds. The prefill cost scales linearly with history size.

3-4×Faster at ScaleAt 100K tokens vs. brute-force

How the Librarian Works

A simple three-step process that replaces brute-force context with intelligent reasoning.

1

Index

After each message, a lightweight model creates a ~100-token summary. This builds a compressed index of the entire conversation - 10× smaller than the raw history. This happens asynchronously, so the user never waits.

2

Select

When a new message arrives, the Librarian reads the summary index and reasons about which messages are relevant. Unlike vector search, it understands temporal logic and dependencies between messages.

3

Hydrate

Only the selected messages are fetched in full and passed to the responder. The result: a highly curated context of ~800 tokens instead of 2,000+ tokens of noise. Less noise → better answers.

Built for Everyone

Coming Soon

Fine-Tuned Librarian Endpoints

We're building specialized LLM endpoints optimized for the Librarian's selection task. Early benchmarks show 1.3s context creation - an 84% reduction from general-purpose models. Zero config, drop-in replacement.