Why RAG Failed Us for SRE and How We Built Dynamic Memory Retrieval Instead

2 points by TheBengaluruGuy 4 days ago · 1 comment

Reader

We built an AI agent for SRE/production operations (DrDroid) and found that standard RAG/embedding-based retrieval didn't work well for engineering contexts — keywords and jargons have no semantic meaning, and a single character difference (e.g. us-east-1 vs us-east-2) can mean completely unrelated things.

This writeup explains how we designed Dynamic Memory Retrieval (DMR) — a multi-layered agentic search system over 80+ integrations (Grafana, Datadog, K8s, AWS, etc.) that indexes 200+ record types and enables the agent to iteratively discover and extract relevant context during production investigations.

Key takeaways: why keyword search outperformed embeddings for our domain, how we structure short-term vs long-term memory, and what it takes to make an agent reliably navigate a company's entire production stack.

Settings

Why RAG Failed Us for SRE and How We Built Dynamic Memory Retrieval Instead

Keyboard Shortcuts