A self-hosted deep research agent that browses the web, takes notes, and writes reports — surviving crashes, rate limits, and API outages by checkpointing every LLM turn to Postgres.
Built on Pi Agent SDK + Absurd + Steel.
Quick Start
# Prerequisites: Docker, Bun git clone https://github.com/steel-experiments/durable-researcher.git cd durable-researcher # One-command setup (installs deps, starts Postgres, initializes schema) ./setup.sh # Edit .env to add your API keys, then run a research task bun run dev "quantum error correction advances"
Or do it manually:
bun install docker compose up -d # Start Postgres bun run db:init # Initialize Absurd schema + default queue ZAI_API_KEY=... STEEL_API_KEY=... bun run dev "quantum error correction advances"
How It Works
You → CLI → Absurd task (Postgres) → Pi Agent loop (glm-5.1 + reasoning)
↓
plan → prefetch → note → evaluate → scout → report
↑ ↑ ↑
Bing/DDG Steel Browse Cache
(fallback) (scrape) (Postgres)
The agent follows a plan → prefetch → note → evaluate → follow-up → report cycle:
- Plan — Decomposes your topic into targeted sub-queries
- Prefetch — Fans out all sub-queries in parallel: concurrent search + browse with relevance filtering
- Note — Records structured findings with source attribution and auto-deduplication
- Evaluate — Assesses coverage gaps with quality-ranked notes
- Follow-up — Targeted search + browse for gaps, or direct browsing of known URLs
- Report — Writes a sourced, analytical report (format adapts to user instructions)
Every message is checkpointed to Postgres via Absurd. Kill the process mid-run, restart, and it picks up exactly where it left off. Reports are saved to output/ automatically.
CLI
# Basic research bun run dev "impact of AI on journalism" # Control depth: quick (1 iteration), standard (3), deep (5) bun run dev "Rust vs Go for microservices" --depth deep # Ask clarifying questions before researching bun run dev "AI safety" --clarify # Use a different model bun run dev "AI safety" --model anthropic:claude-sonnet-4-6 # Limit sources bun run dev "quantum computing" --max-sources 10
Working with existing research
When you run a topic that already has completed research, you're prompted to choose:
Found completed research on this topic:
019d6494-... "quantum error correction advances" [completed] (30m ago)
What would you like to do?
[v] View existing report
[e] Extend research with more sources
[n] Start fresh research
Choice (v/e/n):
Or use flags to skip the prompt:
bun run dev "quantum error correction" --view # view existing report + follow-up mode bun run dev "quantum error correction" --extend # extend with more sources bun run dev "quantum error correction" --new # start fresh
Extend mode seeds the new task with all prior notes and visited URLs. The agent focuses on gaps, newer developments, and low-confidence areas without re-browsing pages it already visited.
After any report, you enter follow-up mode where you can ask questions about the findings:
================================================================================
FOLLOW-UP MODE — ask questions about the research (type 'exit' to quit)
================================================================================
> What are the main differences between surface codes and qLDPC codes?
Task management
bun run dev --list # list recent research tasks bun run dev --resume <task-id> # resume a specific task bun run dev --cleanup # remove completed/failed tasks
In-progress tasks with the same or similar topic are auto-detected and resumed. Similarity matching uses an LLM to catch differently-worded queries on the same subject.
Configuration
Copy .env.example to .env or export directly — shell env vars take precedence over .env.
| Variable | Required | Description |
|---|---|---|
ZAI_API_KEY |
Yes | Z.AI API key for GLM models |
STEEL_API_KEY |
Yes | Steel Cloud API key |
DATABASE_URL |
No | Postgres connection (default: postgresql://postgres:postgres@localhost:5432/absurd) |
AGENT_MODEL |
No | Agent model (default: zai:glm-5.1) |
AGENT_REASONING |
No | Agent reasoning effort (default: high) |
UTILITY_MODEL |
No | Utility model (default: zai:glm-5.1) |
UTILITY_REASONING |
No | Utility reasoning effort (default: off) |
MAX_DURATION |
No | Task timeout in seconds (default: 1200 = 20 min) |
JUDGE_MODEL |
Eval only | Judge model: gemini-2.5-pro or claude-haiku-4-5-20251001 |
GEMINI_API_KEY |
Eval only | Google API key (required if using Gemini judge) |
ANTHROPIC_API_KEY |
Eval only | Anthropic API key (required if using Claude judge) |
Architecture
src/
├── agent.ts # Absurd task registration + durable agent loop
├── bench.ts # Headless CLI bridge for benchmarking
├── browse-cache.ts # Postgres-backed cache for scraped pages (survives crashes)
├── config.ts # Centralized config from .env (models, reasoning, timeout)
├── durable-turns.ts # Checkpoint bridge: Absurd steps ↔ Pi Agent messages
├── steel-client.ts # Steel SDK wrapper, multi-engine search, relevance filtering
├── task-finder.ts # Task deduplication (exact + LLM fuzzy match)
├── notes-ranker.ts # Trigram similarity dedup + confidence ranking
├── clarify.ts # Pre-research clarification questions via LLM
├── follow-up.ts # Interactive follow-up questions after report
├── content.ts # Text cleaning, truncation, quality checks
├── prompts.ts # Handlebars template loader
├── index.ts # CLI entry point
└── tools/
├── plan.ts # Generate sub-queries + search strategy
├── prefetch.ts # Pipelined parallel search+browse fan-out (concurrency 10)
├── scout.ts # Search+browse in one call for follow-up gaps
├── search.ts # Web search with relevance filtering
├── browse.ts # Scrape + smart summarize (raw ≤4KB, LLM >4KB)
├── screenshot.ts # Capture page screenshots
├── note.ts # Record structured findings with auto-dedup
└── evaluate.ts # Assess research coverage with ranked notes
prompts/
├── system.hbs # Main agent system prompt
├── plan.hbs # Research planning prompt
├── summarize.hbs # Page summarization prompt
└── clarify.hbs # Clarification question generation prompt
Durable Turns Pattern
The core innovation: every message_end event from the Pi Agent loop is persisted as an Absurd step checkpoint. On resume:
loadMessageLog()replays all checkpointed messages from PostgresrebuildStateFromMessages()reconstructs notes + URL dedup set from replayed tool callsrunAgentLoopContinue()feeds the full conversation back to the LLM — it continues seamlessly
The LLM doesn't know it crashed. The conversation transcript IS the state.
Hard Limits & Graceful Timeout
The agent enforces hard stops via steering messages injected into the conversation:
| Limit | Standard depth | Trigger |
|---|---|---|
| Max sources | 20 | Browsed URL count |
| Max turns | 45 | Assistant message count |
| Task timeout | 1200s (configurable) | Wall-clock time (60s warning buffer) |
When a limit is hit, the agent is told to stop researching and write its report immediately. On timeout, the agent loop is aborted via AbortController and a partial report is built from accumulated notes. Timeout is configurable via MAX_DURATION env var.
Parallel Prefetch
After planning, the prefetch_sources tool fans out all sub-queries concurrently:
- Pipelined: browses start as each search completes (not waiting for all searches)
- Browses top 2 results per query with a semaphore (max 10 concurrent)
- Budget capped at
maxSources / 2to leave room for targeted follow-ups - Results filtered by relevance scoring before browsing
- Smart summarization: content ≤4KB returned raw (preserves specific data), longer content LLM-summarized
Scout Tool
The scout tool combines search + browse in one call for follow-up gaps. Instead of the 3-turn pattern (search → LLM decides → browse), scout does it in 1 turn — searching a query, filtering by relevance, and browsing the top 3 results in parallel. Saves 1-2 LLM turns per follow-up cycle.
Browse Cache
Scraped pages are cached in Postgres keyed by (task_id, url). On crash/resume, cached pages are reused without re-scraping Steel. Cache entries expire after 7 days and are cleaned up automatically with --cleanup.
Search Result Relevance Filtering
Search results are scored against both the research topic and the specific query using keyword matching with basic stemming. Results must match at least 2 topic keywords to pass. This prevents browsing irrelevant pages (dictionaries, product sites, unrelated content). The agent is also instructed to browse known URLs directly when search results are poor.
Note Deduplication & Ranking
Notes are automatically deduplicated using trigram Jaccard similarity (threshold 0.6). Near-duplicate notes are merged: longer content is kept, source URLs are unioned, and higher confidence is preserved. The evaluate tool displays notes in quality-ranked order (confidence → source count → content length).
Models
All LLM calls use glm-5.1 by default, configurable via environment variables:
| Role | Env Variable | Default | Description |
|---|---|---|---|
| Agent loop | AGENT_MODEL |
zai:glm-5.1 |
Main research agent + follow-up |
| Agent reasoning | AGENT_REASONING |
high |
Thinking effort: minimal, low, medium, high, xhigh |
| Utility calls | UTILITY_MODEL |
zai:glm-5.1 |
Summarization, planning, matching, clarification |
| Utility reasoning | UTILITY_REASONING |
(off) | Reasoning effort for utility calls |
The --model CLI flag overrides AGENT_MODEL. Token usage is tracked per-model and printed at the end of each run.
Development
bun test # Run tests bun run typecheck # TypeScript check bun run db:up # Start Postgres bun run db:init # Initialize Absurd schema (idempotent) ./setup.sh # Full setup: deps, Postgres, schema, eval
Evaluation
Benchmark against ResearchRubrics (101 tasks, 2,593 criteria) and DRACO (100 tasks, 3,934 criteria) using LLM-as-judge (Gemini or Claude).
cd eval uv sync --dev # Install eval dependencies uv run bench download all # Download datasets from HuggingFace uv run bench run researchrubrics --limit 10 --depth quick --project-root .. # Judge with Gemini (default) or Claude uv run bench judge researchrubrics --model gemini-2.5-pro uv run bench judge researchrubrics --model claude-haiku-4-5-20251001 uv run bench score researchrubrics # Compute scores uv run bench report researchrubrics # Generate summary report
Set JUDGE_MODEL in .env to change the default judge. Each stage is resumable — re-running skips completed work. See eval/README.md for full details.
License
MIT