Runtime safety net for LLM agents. Does nothing when things work. Saves your budget and tells you exactly why when they don't.
from state_harness import GrowthRatioGuard, FailureReport guard = GrowthRatioGuard(token_budget=50_000) with guard: for turn in agent_loop: result = llm.invoke(turn.prompt) guard.record_step(tokens_used=result.usage.total_tokens) # What went wrong? (zero-cost, no LLM calls) report = FailureReport.from_guard(guard) print(report)
⚠️ STABILITY TRIPPED at turn 12
Pattern: Context Accumulation Spiral (confidence: 92%)
• Last 5 turns all exceeded 1.5× baseline (4/4 were accelerating).
• Peak growth ratio: 5.2× baseline.
• Without intervention, projected cost was $0.0396 (actual: $0.0039).
Energy: ▁▁▁▁▁▂▂▃▄▆█
Baseline: 1050 tokens/turn
Peak ratio: 5.2× baseline
Cost: $0.0039 (saved ~$0.0357 by tripping early)
Suggested actions:
🔴 1. Enable RG history compression in your agent loop.
→ Compressing older messages reduces prompt tokens by 40-60%.
🟡 2. Lower the growth ratio threshold to 1.8×.
→ A lower threshold would have caught it earlier.
🟢 3. Add a sliding-window context strategy.
→ Send only the last N messages plus a summary of earlier ones.
Why this exists
Every team running LLM agents in production has experienced this: an agent gets stuck in a loop, token usage spirals, and you find a $15 charge for a single failed request the next morning. You kill the process — but you have no idea why it happened or how to prevent it next time.
A hard budget cap solves the cost problem — but tells you nothing. You know the task was killed. You don't know if it was a context accumulation spiral, a retry storm, or policy drift. You can't fix what you can't diagnose.
State-harness is a library, not a platform. pip install and go. It uses Lyapunov stability theory to detect runaway behavior before it becomes expensive — and when it trips, it classifies the failure pattern and tells you exactly what went wrong, how to fix it, and how much you saved. All at zero cost — no extra LLM calls, no external APIs.
What it catches
| Pattern | Signal | Example |
|---|---|---|
| Context Spiral | Token growth accelerating beyond baseline | Agent replaying full history each turn |
| Retry Storm | Low-variance repeated calls | Tool failing, agent retrying identically |
| Policy Drift | VSA similarity score dropping | Agent going off-topic mid-conversation |
| Early Explosion | Token spike in first 3 turns | Oversized system prompt or tool response |
| Budget Exhaustion | Cumulative spend hits ceiling | Complex task, not necessarily broken |
What you get — and what you don't
| ✅ Know WHY your agent failed | Pattern classification + evidence + fix suggestions — zero LLM cost |
| ✅ Save compute on failing tasks | 38.6% fewer search nodes on SWE-bench |
| ✅ Never interfere with healthy agents | Zero false positives across 1,886 short/medium-loop runs |
| ✅ Validated across 3,175 runs | 4 benchmarks, 5-condition ablation, multi-trial with bootstrap CIs |
| ✅ Model-agnostic | Zero false positives confirmed across 7 models: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B (Ollama) |
| ❌ Does NOT make your agent smarter | Resolve rates are statistically identical with or without monitoring |
| ❌ Does NOT replace a budget cap | A naive cap achieves comparable success rates — but tells you nothing |
The value is diagnostics. A budget cap tells you "task killed." State-harness tells you "task killed because of a context accumulation spiral — enable history compression to fix it." That difference is why this exists.
Who should use this
- Teams running search-tree agents (MCTS, beam search) — the architecture behind SWE-bench solvers and tools like Devin. Branches, not loops, drive cost. A per-branch iteration cap looks fine in isolation; the tree-level cost explosion happens silently.
- Platform teams running 1,000+ agent tasks/day — manual trace inspection doesn't scale. State-harness classifies failure patterns at the edge (zero cost, no LLM calls) and exports them as OpenTelemetry attributes for aggregate analysis.
- Researchers benchmarking agents — the nondeterminism floor (~4–5% stdev on Gemini 2.5 Flash) means single-run comparisons with <8% delta are noise. State-harness quantifies this.
Who should NOT use this
- Chatbots, RAG pipelines, or single-turn apps — these don't spiral. You don't need monitoring.
- Simple ReAct loops with <10 turns —
max_iterations=10and a budget cap are sufficient. Every modern framework (LangGraph, CrewAI) supports this natively.
Installation
pip install state-harness
Requires Python ≥ 3.10. Pre-built wheels are available for Linux, macOS, and Windows (x86_64 and ARM64). No Rust toolchain needed.
From source (for development)
git clone https://github.com/vishal-dehurdle/state-harness.git cd state-harness python -m venv .venv && source .venv/bin/activate # Install Rust (if not already installed) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh pip install maturin maturin develop --release # Run tests pip install pytest pytest tests/
Quickstart
Basic: GrowthRatioGuard (recommended)
The GrowthRatioGuard normalizes token usage against a baseline, so it only trips on disproportionate growth — not the natural growth of multi-turn context windows.
from state_harness import GrowthRatioGuard, StabilityViolation guard = GrowthRatioGuard( token_budget=100_000, # hard ceiling ratio_threshold=2.0, # trip when turn is 2× the baseline window=3, # 3 consecutive escalating turns to trip budget_gate=8_000, # don't trip until 8K tokens spent ) with guard: for turn in agent_loop: try: result = llm.invoke(turn.prompt) guard.record_step( tokens_used=result.usage.total_tokens, errors=0, ) except StabilityViolation as e: print(f"Agent killed: {e}") break print(f"Total cost: {guard.total_tokens} tokens") print(f"Baseline: {guard.baseline} tokens/turn") print(f"Peak ratio: {guard.current_ratio}×")
Failure Diagnostics
After any execution (tripped or not), get a structured failure report:
from state_harness import FailureReport report = FailureReport.from_guard(guard, model="gemini-2.5-flash") # Human-readable terminal output print(report) # Structured dict for logging / dashboards import json print(json.dumps(report.to_dict(), indent=2))
The report classifies the failure pattern, provides evidence, estimates cost impact, and suggests specific fixes — all without any LLM calls.
Classic: BoundaryGuard
For lower-level control using raw token counts (no normalization):
from state_harness import BoundaryGuard with BoundaryGuard(token_budget=100_000, lambda_=1.0, window=5) as guard: for turn in agent_loop: result = llm.invoke(turn.prompt) guard.record_step( tokens_used=result.usage.total_tokens, errors=0, tool_name="search", )
Decorator: @boundary_guard
from state_harness import boundary_guard @boundary_guard( token_budget=50_000, token_counter=lambda r: r.usage.total_tokens, ) def agent_step(prompt: str): return llm.invoke(prompt)
Framework Integration
LangGraph (recommended)
from langgraph.prebuilt import create_react_agent from state_harness.adapters import monitor_graph agent = create_react_agent(model, tools=[search, calculate]) safe = monitor_graph(agent, token_budget=100_000) result = safe.invoke({"messages": [("user", "Fix the login bug")]}) # After execution — always available: print(safe.total_tokens) # cumulative usage print(safe.tripped) # did stability trip? print(safe.report) # full FailureReport with pattern + suggestions
For streaming:
for chunk in safe.stream({"messages": [("user", "Refactor this module")]}): print(chunk)
With a trip callback (e.g., for Slack alerts):
safe = monitor_graph( agent, token_budget=100_000, on_trip=lambda report: slack.send(f"Agent tripped: {report.pattern}"), )
Advanced: per-tool wrapping with LangGraphMiddleware
from state_harness import BoundaryGuard from state_harness.adapters import LangGraphMiddleware guard = BoundaryGuard(token_budget=150_000) middleware = LangGraphMiddleware(guard) @middleware.wrap_tool def search_database(query: str): return db.search(query) with guard: result = agent.invoke({"messages": [...]})
CrewAI
from crewai import Agent, Task, Crew from state_harness.adapters import CrewAICallback callback = CrewAICallback(token_budget=200_000) crew = Crew( agents=[researcher, writer], tasks=[research_task, write_task], step_callback=callback.step_callback, task_callback=callback.task_callback, ) result = crew.kickoff() print(callback.report) # FailureReport callback.close()
Vanilla Python Hooks
from state_harness import BoundaryGuard from state_harness.adapters import VanillaHook guard = BoundaryGuard(token_budget=50_000) hook = VanillaHook(guard) with guard: for step in agent_loop: hook.before_call(tool_name="search") result = execute_tool(step) hook.after_call(tokens_used=result.tokens)
CLI
# Simulate a token trajectory — see what the guard would do state-harness simulate 1000 1200 1500 2000 3000 5000 8000 --budget 50000 # Analyze a saved report state-harness analyze report.json state-harness analyze report.json --json # JSON output state-harness analyze report.json --otel # OpenTelemetry attributes # Batch analyze all reports in a directory state-harness batch --dir ./reports/ --output results.csv
Structured Output
Every FailureReport supports multiple output formats:
report = FailureReport.from_guard(guard) # JSON (for logging, APIs, storage) report.to_json() # pretty-printed report.to_json(indent=None) # compact, single line # CSV (for batch analysis of 1000s of runs) with open("results.csv", "w") as f: f.write(FailureReport.csv_header() + "\n") for r in reports: f.write(r.to_csv_row() + "\n") # OpenTelemetry (for Datadog, Grafana, Honeycomb) from opentelemetry import trace span = trace.get_current_span() span.set_attributes(report.to_otel_attributes()) # Adds: state_harness.pattern, state_harness.confidence, etc.
Architecture
State-harness combines three physics-inspired mechanisms, implemented in Rust for microsecond-speed enforcement:
graph TD
A["Agent Loop"] --> B["GrowthRatioGuard\n(Python SDK)"]
B --> |"Normalizes tokens → growth ratio\nWarmup baseline · Budget gate"| C{" "}
C --> D["Lyapunov Monitor\nV(k) = S + λθ\nΔV ≥ 0?"]
C --> E["RG Decimator\nTF-IDF\nCompression"]
C --> F["Holographic Engine\n(VSA)\nDrift Detection"]
style D fill:#1a1a1a,stroke:#555,color:#e8e8e8
style E fill:#1a1a1a,stroke:#555,color:#e8e8e8
style F fill:#1a1a1a,stroke:#555,color:#e8e8e8
style B fill:#0d1117,stroke:#30363d,color:#e6edf3
All three mechanisms are implemented in Rust (via PyO3) for microsecond-speed enforcement.
| Component | Purpose | Speed |
|---|---|---|
| Lyapunov Monitor | Tracks energy derivative ΔV(k). Trips when ΔV ≥ 0 for W consecutive steps. | ~1μs/step |
| RG Decimator | Compresses conversation history via RG-inspired decimation (TF-IDF scoring). Retains structurally important messages. | ~100µs/compress |
| Holographic Engine | VSA-based policy drift detection. Binds domain invariants to high-dimensional vectors. | ~10μs/check |
Benchmarks
Evaluated across four complementary benchmarks with a 5-condition ablation study (3,175 total runs) isolating each mechanism's contribution. Full methodology and data in the research paper.
Ablation Conditions
| Condition | Lyapunov | RG Decimation | VSA Dual-Gate | Description |
|---|---|---|---|---|
| A. Baseline | — | — | — | Unmonitored agent |
| B. Lyapunov-only | ✅ | — | — | Energy monitoring, no intervention |
| C. Lyapunov+RG | ✅ | ✅ | — | + history compression on violation |
| D. Full-stack | ✅ | ✅ | ✅ | + policy drift gating |
| E. Naive Cap | — | — | — | Hard budget cap (control) |
Summary: Non-invasive monitoring with zero-cost diagnostics
| Benchmark | Runs | Stability Trips | Cost Savings (D vs A) | Resolve-Rate Δ | Diagnostics |
|---|---|---|---|---|---|
| MINT (reasoning + coding) | 1,136 | 0 | ~0% | −0.7pp (noise) | N/A (no trips) |
| τ³-bench (customer service) | 750 | 0 | 8.1% | within ±12pp nondeterminism | N/A (no trips) |
| SWE-bench Verified (coding) | 333 + 148 | ~38% | 38.6% (nodes) | −3.6pp (within ±4–5% noise) | ✅ Pattern classification |
| Custom Local (4 models) | 240 | 3 (true pos.) | 15.2% | 0pp | ✅ Pattern classification |
| MINT Local (Qwen3:4B) | 568 | 0 | ~0% | +1.8pp | N/A (no trips) |
What the harness does — and doesn't do:
- ✅ Never interferes with healthy agents — zero stability trips across 1,886 short/medium-loop runs (MINT + τ³)
- ✅ Saves compute on spiraling tasks — 38.6% fewer search nodes, 30% faster wall time on SWE-bench
- ✅ Tells you why tasks failed — zero-cost failure diagnostics (context spiral, retry storm, policy drift) with actionable fixes
⚠️ Does not improve resolve rate — multi-trial SWE-bench (333 runs) confirms: harness 40.5% ± 2.7% vs naive cap 45.9% ± 5.4% vs baseline 44.1% ± 4.1% — all within noise
A naive budget cap achieves comparable task success rates. The harness's unique value is diagnostics (understanding why failures happen) and compute efficiency (33% fewer nodes than naive cap).
SWE-bench Verified (central result)
37 Django instances from SWE-bench Verified. Agent: moatless-tools SearchTree with 50-node budget. Model: Gemini 2.5 Flash.
Single-trial ablation (148 runs)
| Condition | Resolved | Rate | Total Nodes | Wall Time | Nodes/Resolve |
|---|---|---|---|---|---|
| A. Baseline | 15 / 37 | 40.5% | 945 | 80 min | 63.0 |
| B. Lyapunov | 16 / 37 | 43.2% | 620 | 69 min | 38.8 |
| D. Full-stack | 14 / 37 | 37.8% | 580 | 56 min | 41.4 |
| E. Naive Cap | 21 / 37 | 56.8% | 876 | 77 min | 41.7 |
Note: Single-trial resolve rates have ~±8pp standard error. E's apparent 56.8% is not statistically significant vs A's 40.5%. Multi-trial results below confirm this.
What the harness provides:
- Compute-efficient: 38.6% fewer search tree nodes than baseline, 33% fewer than naive cap
- Faster: 30% wall-time reduction (80 → 56 min)
- Eliminates burnout: Baseline had 7 tasks burning the full 50-node budget (all failed). With monitoring: zero
- Diagnostics: Every tripped task gets a classified failure pattern with actionable fix suggestions — at zero LLM cost
- Simple integration: Lyapunov monitoring alone (Condition B) delivers ~90% of total benefit — 5 lines of code
Ablation — each mechanism contributes independently:
| Layer Added | Compute (nodes) | Δ vs Baseline | Cumulative Reduction |
|---|---|---|---|
| A. No monitoring | 945 | — | — |
| B. + Lyapunov | 620 | −325 | 34.4% |
| D. + RG + VSA | 580 | −40 | 38.6% |
Lyapunov monitoring alone delivers ~90% of the total benefit. RG decimation and VSA add incremental value.
Multi-trial validation (333 runs)
To quantify nondeterminism and validate the single-trial findings, we ran 3 independent trials per condition (A, D, E) across all 37 instances — 333 total runs (12 runs resulted in stuck Docker containers killed after 28+ min; counted as failures):
| Condition | Trial 1 | Trial 2 | Trial 3 | Mean ± σ |
|---|---|---|---|---|
| A. Baseline | 18/37 (48.6%) | 16/37 (43.2%) | 15/37 (40.5%) | 44.1% ± 4.1% |
| D. Full-stack | 15/37 (40.5%) | 16/37 (43.2%) | 14/37 (37.8%) | 40.5% ± 2.7% |
| E. Naive Cap | 19/37 (51.4%) | 15/37 (40.5%) | 17/37 (45.9%) | 45.9% ± 5.4% |
Key finding: Cross-condition variance (2.9%) ≤ within-condition nondeterminism (4.1%). The differences between conditions are entirely within the noise band of LLM nondeterminism — confirming non-invasiveness with statistical rigor.
Note on nondeterminism: The ~4% within-condition stdev converges with τ³-bench findings (±4.6%), establishing a ~4–5% nondeterminism floor as a fundamental property of Gemini 2.5 Flash on code tasks. Any single-run benchmark comparison is unreliable for deltas < 8%.
Statistical validation: Bootstrap confidence intervals (10,000 resamples) and Welch's t-tests confirm no significant pairwise differences: A−D = +3.6pp [−0.9, +8.1], p ≈ 0.17; A−E = −1.8pp [−8.1, +4.5], p ≈ 0.68; D−E = −5.4pp [−10.8, 0.0], p ≈ 0.09. Full analysis in the research paper §7.3.1.
τ³-bench Airline (non-invasiveness confirmation)
50 tasks × 3 trials × 5 conditions = 750 total runs. Agent handles airline reservations via tool calls. Model: Gemini 2.5 Flash. Concurrency=1.
| Condition | Trial Pass | Rate | Task Pass (maj) | Rate | Cost | Cost Δ |
|---|---|---|---|---|---|---|
| A. Baseline | 99/150 | 66.0% | 35/50 | 70.0% | $2.47 | — |
| B. Lyapunov-only | 83/150 | 55.3% | 28/50 | 56.0% | $2.42 | −2.0% |
| C. Lyapunov+RG | 79/150 | 52.7% | 26/50 | 52.0% | $1.69 | −31.8% |
| D. Full-stack | 86/150 | 57.3% | 30/50 | 60.0% | $2.28 | −8.1% |
| E. Naive Cap | 81/150 | 54.0% | 26/50 | 52.0% | $2.33 | −5.7% |
Key findings:
- Zero stability trips across all 750 runs. The monitor correctly identifies all airline tasks as stable and never intervenes — confirming non-invasiveness on medium-loop customer-service agents.
- Pass-rate variance is LLM nondeterminism, not harness impact. The naive cap (E) — which has zero monitoring — shows a −16pp drop from baseline, worse than full-stack monitoring (D, −10pp). This confirms the ~10–16pp spread is intrinsic benchmark variance, not monitoring-caused regression.
- 25% of tasks flip pass/fail within the same condition across 3 trials — the airline domain's intrinsic nondeterminism floor (~±12pp).
- 8.1% cost savings from full-stack monitoring, with the harness observing passively (zero interventions).
MINT (non-invasiveness validation)
284 tasks × 4 conditions = 1,136 total runs across GSM8K (48), MATH (100), HumanEval (45), MBPP (91). Agent uses up to 5 turns per task.
| Condition | GSM8K | MATH | Total | Tokens |
|---|---|---|---|---|
| A. Baseline | 91.7% | 39.0% | 29.2% | 1,909,582 |
| B. Lyapunov | 91.7% | 41.0% | 29.9% | 1,904,421 |
| C. Lyapunov+RG | 89.6% | 37.0% | 28.2% | 1,910,926 |
| D. Full-stack | 87.5% | 39.0% | 28.5% | 1,949,708 |
Zero stability violations across all 1,136 runs. The monitor correctly identifies short-loop tasks as stable and never intervenes. Token usage is invariant (<2% overhead).
Failed tasks cost disproportionately more — validating the economic thesis:
| Task | Success Avg | Failure Avg | Ratio |
|---|---|---|---|
| GSM8K | 2,613 tok | 8,857 tok | 3.4× |
| MATH | 5,154 tok | 8,188 tok | 1.6× |
Note: HumanEval and MBPP show 0% across all conditions due to a MINT framework limitation in code execution evaluation — consistent across all conditions, confirming the harness does not introduce new failure modes.
Local Model Validation (edge deployment)
20 custom tasks (5 easy, 10 medium, 5 hard) × 4 models × 3 conditions = 240 runs. Hardware: Apple M4 MacBook Pro, 16 GB RAM, Ollama local inference.
| Model | Size | Baseline | Harness | Naive Cap | Token Savings | FP |
|---|---|---|---|---|---|---|
| Llama 3.2:3B | 2.0 GB | 45% | 45% | 60% | 1.2% | 0 |
| Phi-4-Mini | 2.5 GB | 30% | 30% | 40% | 20.7% | 0 |
| Qwen3:4B | 2.5 GB | 30% | 30% | 40% | 0.9% | 0 |
| Gemma4:E4B | 9.6 GB | 35% | 35% | 70% | 37.9% | 0 |
Key findings:
- Zero false positives across all 80 harness runs — 4 model families, 3 difficulty tiers. The growth-ratio metric generalizes across architectures without threshold retuning.
- Small-model self-sabotage: Naive cap outperforms baseline by +17.5pp on average (median +12.5pp). Small models solve early turns correctly but destroy solutions in later turns. Strongest on Gemma4:E4B (+35pp).
- Model-family behavioral signatures:
- Llama 3.2:3B: Classic spirals (ratios: 2.3×, 5.9×, 7.6×) — 3 true-positive trips
- Phi-4-Mini: Spike-and-recover — 20.7% passive savings
- Qwen3:4B: 255K tokens but flat ratios (≤1.06×) — correctly classified as stable despite 3× volume
- Gemma4:E4B: Decreasing ratios — 37.9% passive savings, zero trips
Practical takeaway: If you’re deploying ≤4B models via Ollama, state-harness works out of the box with zero false positives. The self-sabotage finding suggests you should also add a turn limit (2–3 turns) for open-ended code generation tasks.
MINT on Qwen3:4B (568 runs)
| Task | Harness (max=5) | Naive Cap (max=2) | Δ |
|---|---|---|---|
| GSM8K | 37.5% | 27.1% | +10.4pp |
| MATH | 0.0% | 0.0% | — |
| HumanEval | 11.1% | 11.1% | — |
| MBPP | 14.3% | 14.3% | — |
| Total | 12.7% | 10.9% | +1.8pp |
Zero harness interventions across all 284 tasks. On MINT’s short-loop tasks (max 5 turns), the harness monitoring window (W=3) cannot trigger within the available post-warmup turns — a structural guarantee, not a probabilistic observation.
Reproducing the benchmarks
Full reproduction steps (all three benchmarks)
# 1. Clone repos git clone https://github.com/vishal-dehurdle/state-harness.git git clone https://github.com/sierra-research/tau-bench.git tau3-bench # 2. Install state-harness cd state-harness python -m venv .venv && source .venv/bin/activate pip install maturin && maturin develop --release # 3. Install τ³-bench (with state-harness agent) cd ../tau3-bench uv sync cp ../state-harness/tau3_integration/harness_agent.py src/tau2/agent/ cp ../state-harness/tau3_integration/naive_cap_agent.py src/tau2/agent/ # 4. Configure Vertex AI export GOOGLE_CLOUD_PROJECT=your-project-id export VERTEXAI_LOCATION=asia-south1 # 5. Run τ³ 5-phase benchmark bash benchmarks/tau3/run_5phase_airline.sh # 6. Run SWE-bench (requires Docker images) bash benchmarks/swe_bench/run_benchmark.sh bash benchmarks/swe_bench/run_benchmark_dbe.sh # 7. Run MINT bash benchmarks/mint/run_mint_fullstack.sh
Ablation conditions are controlled via environment variables:
| Variable | Values | Effect |
|---|---|---|
HARNESS_RG |
on / off |
Enable/disable RG history compression |
HARNESS_VSA |
on / off |
Enable/disable VSA policy drift detection |
HARNESS_RATIO_THRESHOLD |
float (e.g., 2.0) |
Override growth ratio threshold |
HARNESS_BUDGET_GATE |
int (e.g., 8000) |
Override minimum spend before trip |
See benchmarks/ for full setup, configs, and reproduction instructions for all three benchmarks.
Future evaluations
- Multi-trial SWE-bench — 333 runs (3 trials × 3 conditions × 37 instances) confirming non-invasiveness within ±4% noise band
- Local model validation — 240 runs across 4 open-weight models (Llama, Phi, Qwen, Gemma) + 568 MINT runs on Qwen3:4B
- Terminal-Bench — Terminal-based agent tasks; command-line tool loops where spirals manifest as repeated failed commands
- SWE-bench Pro — Harder, contamination-resistant variant of SWE-bench
- Cross-model validation — 7 models total: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B
Known limitations
- 37 SWE-bench instances — A larger sample would improve statistical power (n=3 trials gives limited degrees of freedom for t-tests).
- No causal intervention — The harness currently kills spiraling tasks. Redirect/repair is on the roadmap.
- Physics-inspired, not physics-equivalent — Terms like "Renormalization Group" and "Lyapunov stability" are used as structural inspirations. The mathematical mapping is analogical, not isomorphic.
- Custom benchmark scale — The 20-task local battery is smaller than standard benchmarks. The self-sabotage finding (mean +17.5pp, median +12.5pp) is consistent across 4 models but requires larger-scale replication.
Configuration Guide
| Parameter | Default | Description |
|---|---|---|
token_budget |
100,000 | Hard ceiling on cumulative tokens |
ratio_threshold |
2.0 | Growth ratio above which a turn counts as "escalating" (domain-tuned: airline=2.0, retail=2.5, telecom=2.0) |
window |
3 | Consecutive escalating turns before circuit breaker trips |
warmup_turns |
3 | Turns used to establish baseline (no monitoring during warmup) |
budget_gate |
8,000 | Minimum cumulative tokens before the monitor can trip (retail: 12,000) |
lambda_ |
1.0 | Error weighting in the Lyapunov energy function |
Environment variable overrides (highest precedence, for threshold sweeps):
| Env Var | Description |
|---|---|
HARNESS_RATIO_THRESHOLD |
Override ratio_threshold (e.g., 2.5) |
HARNESS_BUDGET_GATE |
Override budget_gate (e.g., 12000) |
Tuning tips:
- More aggressive (catch spirals earlier):
ratio_threshold=1.8, window=2 - More conservative (fewer false positives):
ratio_threshold=2.5, window=3 - High-value tasks: Increase
budget_gateto 20K+ to let expensive tasks run longer - Complex domains (retail, multi-tool): Start with
ratio_threshold=2.5
Theoretical Foundations
State-harness applies control theory to LLM agent execution:
- Lyapunov stability: The energy function V(k) = S(k) + λθ(k) models token consumption as a dynamical system. When ΔV ≥ 0 for W consecutive steps, the system is provably unstable.
- Renormalization Group (RG) theory: Message compression is modeled as coarse-graining — eliminating high-frequency noise while preserving scale-invariant task objectives.
- Vector Symbolic Architecture (VSA): Domain policies are bound to high-dimensional bipolar vectors (10,000-d, i8 space), enabling constant-time semantic drift detection outside the LLM context window.
Research
This library implements the framework described in:
Empirical Lyapunov Stability: Growth-Ratio Energy Functions as Leading Indicators of Agent Task Failure Vishal Verma, 2026 Read the full paper →
Key findings from the paper (updated with multi-trial and local model validation):
- Non-invasiveness confirmed across 333 SWE-bench runs — resolve rate delta (−3.6pp) falls within the ±4.1% nondeterminism band
- Zero stability violations across 1,886 short/medium-loop cloud runs (MINT + τ³) — the monitor never interferes with healthy agents
- Zero false positives across 80 local-model harness runs — spanning Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B on Apple M4 via Ollama
- Zero-cost failure diagnostics — every tripped task is classified (context spiral, retry storm, policy drift) with actionable fix suggestions, requiring no additional LLM calls
- Lyapunov monitoring alone delivers ~90% of the total benefit — the simplest integration (5 lines of
GrowthRatioGuardcode) captures the majority of the value - On long-loop agents (SWE-bench), full-stack monitoring reduces compute by 38.6% and wall time by 30%
- Failed tasks cost 1.6–3.4× more than successful ones — economic justification for early termination
- Eliminates all max-budget burnout events (7 → 0 tasks hitting the 50-node ceiling on SWE-bench)
- ~4–5% nondeterminism floor established across both τ³-bench and SWE-bench — any single-run comparison is unreliable for deltas < 8%
- Small-model self-sabotage: Naive turn-limiting outperforms unconstrained baselines by +17.5pp on average on ≤4B models — runtime governance is capability-preserving, not just cost-saving
Based on the theoretical framework from:
The Fluid Dynamics of Multi-Agent AI: Resolving d'Alembert's Paradox of Generative Workflows Vishal Verma, 2026 Read →
Contributing
Contributions are welcome. See CONTRIBUTING.md for dev environment setup, code style, and PR guidelines.
Roadmap
- Adaptive threshold — Auto-tune τ based on task complexity signal from early turns
- Causal intervention — Instead of killing spiraling tasks, redirect them (prompt injection, tool restriction)
- Streaming support — Token-level monitoring for streaming LLM responses
- Multi-model validation — 7 models validated: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + 4 local models via Ollama
- Dashboard / observability — Optional lightweight UI for monitoring energy trajectories in real-time
Security
For security vulnerabilities, see SECURITY.md. Please do not open public issues for security reports.
License
Split-core licensing:
| Component | License | Notes |
|---|---|---|
Rust Core (src/) |
BSL 1.1 | Free for non-commercial + ARR < $1M. Converts to Apache 2.0 on May 26, 2030. |
Python SDK (python/) |
Apache 2.0 | Fully permissive. |
See LICENSE.md for full details.