GitHub - vishal-dehurdle/state-harness: Runtime safety net for LLM agents. Detects token spirals, kills doomed tasks early, tells you exactly why. Rust core, Python SDK. pip install state-harness

Runtime safety net for LLM agents. Does nothing when things work. Saves your budget and tells you exactly why when they don't.

from state_harness import GrowthRatioGuard, FailureReport

guard = GrowthRatioGuard(token_budget=50_000)

with guard:
    for turn in agent_loop:
        result = llm.invoke(turn.prompt)
        guard.record_step(tokens_used=result.usage.total_tokens)

# What went wrong? (zero-cost, no LLM calls)
report = FailureReport.from_guard(guard)
print(report)

⚠️  STABILITY TRIPPED at turn 12

Pattern: Context Accumulation Spiral (confidence: 92%)
  • Last 5 turns all exceeded 1.5× baseline (4/4 were accelerating).
  • Peak growth ratio: 5.2× baseline.
  • Without intervention, projected cost was $0.0396 (actual: $0.0039).

Energy: ▁▁▁▁▁▂▂▃▄▆█
  Baseline: 1050 tokens/turn
  Peak ratio: 5.2× baseline

Cost: $0.0039 (saved ~$0.0357 by tripping early)

Suggested actions:
  🔴 1. Enable RG history compression in your agent loop.
     → Compressing older messages reduces prompt tokens by 40-60%.
  🟡 2. Lower the growth ratio threshold to 1.8×.
     → A lower threshold would have caught it earlier.
  🟢 3. Add a sliding-window context strategy.
     → Send only the last N messages plus a summary of earlier ones.

Why this exists

Every team running LLM agents in production has experienced this: an agent gets stuck in a loop, token usage spirals, and you find a $15 charge for a single failed request the next morning. You kill the process — but you have no idea why it happened or how to prevent it next time.

A hard budget cap solves the cost problem — but tells you nothing. You know the task was killed. You don't know if it was a context accumulation spiral, a retry storm, or policy drift. You can't fix what you can't diagnose.

State-harness is a library, not a platform. pip install and go. It uses Lyapunov stability theory to detect runaway behavior before it becomes expensive — and when it trips, it classifies the failure pattern and tells you exactly what went wrong, how to fix it, and how much you saved. All at zero cost — no extra LLM calls, no external APIs.

What it catches

Pattern	Signal	Example
Context Spiral	Token growth accelerating beyond baseline	Agent replaying full history each turn
Retry Storm	Low-variance repeated calls	Tool failing, agent retrying identically
Policy Drift	VSA similarity score dropping	Agent going off-topic mid-conversation
Early Explosion	Token spike in first 3 turns	Oversized system prompt or tool response
Budget Exhaustion	Cumulative spend hits ceiling	Complex task, not necessarily broken

What you get — and what you don't


✅ Know WHY your agent failed	Pattern classification + evidence + fix suggestions — zero LLM cost
✅ Save compute on failing tasks	38.6% fewer search nodes on SWE-bench
✅ Never interfere with healthy agents	Zero false positives across 1,886 short/medium-loop runs
✅ Validated across 3,175 runs	4 benchmarks, 5-condition ablation, multi-trial with bootstrap CIs
✅ Model-agnostic	Zero false positives confirmed across 7 models: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B (Ollama)
❌ Does NOT make your agent smarter	Resolve rates are statistically identical with or without monitoring
❌ Does NOT replace a budget cap	A naive cap achieves comparable success rates — but tells you nothing

The value is diagnostics. A budget cap tells you "task killed." State-harness tells you "task killed because of a context accumulation spiral — enable history compression to fix it." That difference is why this exists.

Who should use this

Teams running search-tree agents (MCTS, beam search) — the architecture behind SWE-bench solvers and tools like Devin. Branches, not loops, drive cost. A per-branch iteration cap looks fine in isolation; the tree-level cost explosion happens silently.
Platform teams running 1,000+ agent tasks/day — manual trace inspection doesn't scale. State-harness classifies failure patterns at the edge (zero cost, no LLM calls) and exports them as OpenTelemetry attributes for aggregate analysis.
Researchers benchmarking agents — the nondeterminism floor (~4–5% stdev on Gemini 2.5 Flash) means single-run comparisons with <8% delta are noise. State-harness quantifies this.

Who should NOT use this

Chatbots, RAG pipelines, or single-turn apps — these don't spiral. You don't need monitoring.
Simple ReAct loops with <10 turns — max_iterations=10 and a budget cap are sufficient. Every modern framework (LangGraph, CrewAI) supports this natively.

Installation

pip install state-harness

Requires Python ≥ 3.10. Pre-built wheels are available for Linux, macOS, and Windows (x86_64 and ARM64). No Rust toolchain needed.

From source (for development)

git clone https://github.com/vishal-dehurdle/state-harness.git
cd state-harness

python -m venv .venv && source .venv/bin/activate

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

pip install maturin
maturin develop --release

# Run tests
pip install pytest
pytest tests/

Quickstart

Basic: GrowthRatioGuard (recommended)

The GrowthRatioGuard normalizes token usage against a baseline, so it only trips on disproportionate growth — not the natural growth of multi-turn context windows.

from state_harness import GrowthRatioGuard, StabilityViolation

guard = GrowthRatioGuard(
    token_budget=100_000,     # hard ceiling
    ratio_threshold=2.0,      # trip when turn is 2× the baseline
    window=3,                 # 3 consecutive escalating turns to trip
    budget_gate=8_000,        # don't trip until 8K tokens spent
)

with guard:
    for turn in agent_loop:
        try:
            result = llm.invoke(turn.prompt)
            guard.record_step(
                tokens_used=result.usage.total_tokens,
                errors=0,
            )
        except StabilityViolation as e:
            print(f"Agent killed: {e}")
            break

print(f"Total cost: {guard.total_tokens} tokens")
print(f"Baseline: {guard.baseline} tokens/turn")
print(f"Peak ratio: {guard.current_ratio}×")

Failure Diagnostics

After any execution (tripped or not), get a structured failure report:

from state_harness import FailureReport

report = FailureReport.from_guard(guard, model="gemini-2.5-flash")

# Human-readable terminal output
print(report)

# Structured dict for logging / dashboards
import json
print(json.dumps(report.to_dict(), indent=2))

The report classifies the failure pattern, provides evidence, estimates cost impact, and suggests specific fixes — all without any LLM calls.

Classic: BoundaryGuard

For lower-level control using raw token counts (no normalization):

from state_harness import BoundaryGuard

with BoundaryGuard(token_budget=100_000, lambda_=1.0, window=5) as guard:
    for turn in agent_loop:
        result = llm.invoke(turn.prompt)
        guard.record_step(
            tokens_used=result.usage.total_tokens,
            errors=0,
            tool_name="search",
        )

Decorator: `@boundary_guard`

from state_harness import boundary_guard

@boundary_guard(
    token_budget=50_000,
    token_counter=lambda r: r.usage.total_tokens,
)
def agent_step(prompt: str):
    return llm.invoke(prompt)

Framework Integration

LangGraph (recommended)

from langgraph.prebuilt import create_react_agent
from state_harness.adapters import monitor_graph

agent = create_react_agent(model, tools=[search, calculate])
safe = monitor_graph(agent, token_budget=100_000)

result = safe.invoke({"messages": [("user", "Fix the login bug")]})

# After execution — always available:
print(safe.total_tokens)  # cumulative usage
print(safe.tripped)       # did stability trip?
print(safe.report)        # full FailureReport with pattern + suggestions

For streaming:

for chunk in safe.stream({"messages": [("user", "Refactor this module")]}):
    print(chunk)

With a trip callback (e.g., for Slack alerts):

safe = monitor_graph(
    agent,
    token_budget=100_000,
    on_trip=lambda report: slack.send(f"Agent tripped: {report.pattern}"),
)

Advanced: per-tool wrapping with LangGraphMiddleware

from state_harness import BoundaryGuard
from state_harness.adapters import LangGraphMiddleware

guard = BoundaryGuard(token_budget=150_000)
middleware = LangGraphMiddleware(guard)

@middleware.wrap_tool
def search_database(query: str):
    return db.search(query)

with guard:
    result = agent.invoke({"messages": [...]})

CrewAI

from crewai import Agent, Task, Crew
from state_harness.adapters import CrewAICallback

callback = CrewAICallback(token_budget=200_000)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    step_callback=callback.step_callback,
    task_callback=callback.task_callback,
)

result = crew.kickoff()
print(callback.report)  # FailureReport
callback.close()

Vanilla Python Hooks

from state_harness import BoundaryGuard
from state_harness.adapters import VanillaHook

guard = BoundaryGuard(token_budget=50_000)
hook = VanillaHook(guard)

with guard:
    for step in agent_loop:
        hook.before_call(tool_name="search")
        result = execute_tool(step)
        hook.after_call(tokens_used=result.tokens)

CLI

# Simulate a token trajectory — see what the guard would do
state-harness simulate 1000 1200 1500 2000 3000 5000 8000 --budget 50000

# Analyze a saved report
state-harness analyze report.json
state-harness analyze report.json --json    # JSON output
state-harness analyze report.json --otel    # OpenTelemetry attributes

# Batch analyze all reports in a directory
state-harness batch --dir ./reports/ --output results.csv

Structured Output

Every FailureReport supports multiple output formats:

report = FailureReport.from_guard(guard)

# JSON (for logging, APIs, storage)
report.to_json()            # pretty-printed
report.to_json(indent=None) # compact, single line

# CSV (for batch analysis of 1000s of runs)
with open("results.csv", "w") as f:
    f.write(FailureReport.csv_header() + "\n")
    for r in reports:
        f.write(r.to_csv_row() + "\n")

# OpenTelemetry (for Datadog, Grafana, Honeycomb)
from opentelemetry import trace
span = trace.get_current_span()
span.set_attributes(report.to_otel_attributes())
# Adds: state_harness.pattern, state_harness.confidence, etc.

Architecture

State-harness combines three physics-inspired mechanisms, implemented in Rust for microsecond-speed enforcement:

graph TD
    A["Agent Loop"] --> B["GrowthRatioGuard\n(Python SDK)"]
    B --> |"Normalizes tokens → growth ratio\nWarmup baseline · Budget gate"| C{" "}
    C --> D["Lyapunov Monitor\nV(k) = S + λθ\nΔV ≥ 0?"]
    C --> E["RG Decimator\nTF-IDF\nCompression"]
    C --> F["Holographic Engine\n(VSA)\nDrift Detection"]
    
    style D fill:#1a1a1a,stroke:#555,color:#e8e8e8
    style E fill:#1a1a1a,stroke:#555,color:#e8e8e8
    style F fill:#1a1a1a,stroke:#555,color:#e8e8e8
    style B fill:#0d1117,stroke:#30363d,color:#e6edf3

All three mechanisms are implemented in Rust (via PyO3) for microsecond-speed enforcement.

Component	Purpose	Speed
Lyapunov Monitor	Tracks energy derivative ΔV(k). Trips when ΔV ≥ 0 for W consecutive steps.	~1μs/step
RG Decimator	Compresses conversation history via RG-inspired decimation (TF-IDF scoring). Retains structurally important messages.	~100µs/compress
Holographic Engine	VSA-based policy drift detection. Binds domain invariants to high-dimensional vectors.	~10μs/check

Benchmarks

Evaluated across four complementary benchmarks with a 5-condition ablation study (3,175 total runs) isolating each mechanism's contribution. Full methodology and data in the research paper.

Ablation Conditions

Condition	Lyapunov	RG Decimation	VSA Dual-Gate	Description
A. Baseline	—	—	—	Unmonitored agent
B. Lyapunov-only	✅	—	—	Energy monitoring, no intervention
C. Lyapunov+RG	✅	✅	—	+ history compression on violation
D. Full-stack	✅	✅	✅	+ policy drift gating
E. Naive Cap	—	—	—	Hard budget cap (control)

Summary: Non-invasive monitoring with zero-cost diagnostics

Benchmark	Runs	Stability Trips	Cost Savings (D vs A)	Resolve-Rate Δ	Diagnostics
MINT (reasoning + coding)	1,136	0	~0%	−0.7pp (noise)	N/A (no trips)
τ³-bench (customer service)	750	0	8.1%	within ±12pp nondeterminism	N/A (no trips)
SWE-bench Verified (coding)	333 + 148	~38%	38.6% (nodes)	−3.6pp (within ±4–5% noise)	✅ Pattern classification
Custom Local (4 models)	240	3 (true pos.)	15.2%	0pp	✅ Pattern classification
MINT Local (Qwen3:4B)	568	0	~0%	+1.8pp	N/A (no trips)

What the harness does — and doesn't do:

✅ Never interferes with healthy agents — zero stability trips across 1,886 short/medium-loop runs (MINT + τ³)
✅ Saves compute on spiraling tasks — 38.6% fewer search nodes, 30% faster wall time on SWE-bench
✅ Tells you why tasks failed — zero-cost failure diagnostics (context spiral, retry storm, policy drift) with actionable fixes
⚠️ Does not improve resolve rate — multi-trial SWE-bench (333 runs) confirms: harness 40.5% ± 2.7% vs naive cap 45.9% ± 5.4% vs baseline 44.1% ± 4.1% — all within noise

A naive budget cap achieves comparable task success rates. The harness's unique value is diagnostics (understanding why failures happen) and compute efficiency (33% fewer nodes than naive cap).

SWE-bench Verified (central result)

37 Django instances from SWE-bench Verified. Agent: moatless-tools SearchTree with 50-node budget. Model: Gemini 2.5 Flash.

Single-trial ablation (148 runs)

Condition	Resolved	Rate	Total Nodes	Wall Time	Nodes/Resolve
A. Baseline	15 / 37	40.5%	945	80 min	63.0
B. Lyapunov	16 / 37	43.2%	620	69 min	38.8
D. Full-stack	14 / 37	37.8%	580	56 min	41.4
E. Naive Cap	21 / 37	56.8%	876	77 min	41.7

Note: Single-trial resolve rates have ~±8pp standard error. E's apparent 56.8% is not statistically significant vs A's 40.5%. Multi-trial results below confirm this.

What the harness provides:

Compute-efficient: 38.6% fewer search tree nodes than baseline, 33% fewer than naive cap
Faster: 30% wall-time reduction (80 → 56 min)
Eliminates burnout: Baseline had 7 tasks burning the full 50-node budget (all failed). With monitoring: zero
Diagnostics: Every tripped task gets a classified failure pattern with actionable fix suggestions — at zero LLM cost
Simple integration: Lyapunov monitoring alone (Condition B) delivers ~90% of total benefit — 5 lines of code

Ablation — each mechanism contributes independently:

Layer Added	Compute (nodes)	Δ vs Baseline	Cumulative Reduction
A. No monitoring	945	—	—
B. + Lyapunov	620	−325	34.4%
D. + RG + VSA	580	−40	38.6%

Lyapunov monitoring alone delivers ~90% of the total benefit. RG decimation and VSA add incremental value.

Multi-trial validation (333 runs)

To quantify nondeterminism and validate the single-trial findings, we ran 3 independent trials per condition (A, D, E) across all 37 instances — 333 total runs (12 runs resulted in stuck Docker containers killed after 28+ min; counted as failures):

Condition	Trial 1	Trial 2	Trial 3	Mean ± σ
A. Baseline	18/37 (48.6%)	16/37 (43.2%)	15/37 (40.5%)	44.1% ± 4.1%
D. Full-stack	15/37 (40.5%)	16/37 (43.2%)	14/37 (37.8%)	40.5% ± 2.7%
E. Naive Cap	19/37 (51.4%)	15/37 (40.5%)	17/37 (45.9%)	45.9% ± 5.4%

Key finding: Cross-condition variance (2.9%) ≤ within-condition nondeterminism (4.1%). The differences between conditions are entirely within the noise band of LLM nondeterminism — confirming non-invasiveness with statistical rigor.

Note on nondeterminism: The ~4% within-condition stdev converges with τ³-bench findings (±4.6%), establishing a ~4–5% nondeterminism floor as a fundamental property of Gemini 2.5 Flash on code tasks. Any single-run benchmark comparison is unreliable for deltas < 8%.

Statistical validation: Bootstrap confidence intervals (10,000 resamples) and Welch's t-tests confirm no significant pairwise differences: A−D = +3.6pp [−0.9, +8.1], p ≈ 0.17; A−E = −1.8pp [−8.1, +4.5], p ≈ 0.68; D−E = −5.4pp [−10.8, 0.0], p ≈ 0.09. Full analysis in the research paper §7.3.1.

τ³-bench Airline (non-invasiveness confirmation)

50 tasks × 3 trials × 5 conditions = 750 total runs. Agent handles airline reservations via tool calls. Model: Gemini 2.5 Flash. Concurrency=1.

Condition	Trial Pass	Rate	Task Pass (maj)	Rate	Cost	Cost Δ
A. Baseline	99/150	66.0%	35/50	70.0%	$2.47	—
B. Lyapunov-only	83/150	55.3%	28/50	56.0%	$2.42	−2.0%
C. Lyapunov+RG	79/150	52.7%	26/50	52.0%	$1.69	−31.8%
D. Full-stack	86/150	57.3%	30/50	60.0%	$2.28	−8.1%
E. Naive Cap	81/150	54.0%	26/50	52.0%	$2.33	−5.7%

Key findings:

Zero stability trips across all 750 runs. The monitor correctly identifies all airline tasks as stable and never intervenes — confirming non-invasiveness on medium-loop customer-service agents.
Pass-rate variance is LLM nondeterminism, not harness impact. The naive cap (E) — which has zero monitoring — shows a −16pp drop from baseline, worse than full-stack monitoring (D, −10pp). This confirms the ~10–16pp spread is intrinsic benchmark variance, not monitoring-caused regression.
25% of tasks flip pass/fail within the same condition across 3 trials — the airline domain's intrinsic nondeterminism floor (~±12pp).
8.1% cost savings from full-stack monitoring, with the harness observing passively (zero interventions).

MINT (non-invasiveness validation)

284 tasks × 4 conditions = 1,136 total runs across GSM8K (48), MATH (100), HumanEval (45), MBPP (91). Agent uses up to 5 turns per task.

Condition	GSM8K	MATH	Total	Tokens
A. Baseline	91.7%	39.0%	29.2%	1,909,582
B. Lyapunov	91.7%	41.0%	29.9%	1,904,421
C. Lyapunov+RG	89.6%	37.0%	28.2%	1,910,926
D. Full-stack	87.5%	39.0%	28.5%	1,949,708

Zero stability violations across all 1,136 runs. The monitor correctly identifies short-loop tasks as stable and never intervenes. Token usage is invariant (<2% overhead).

Failed tasks cost disproportionately more — validating the economic thesis:

Task	Success Avg	Failure Avg	Ratio
GSM8K	2,613 tok	8,857 tok	3.4×
MATH	5,154 tok	8,188 tok	1.6×

Note: HumanEval and MBPP show 0% across all conditions due to a MINT framework limitation in code execution evaluation — consistent across all conditions, confirming the harness does not introduce new failure modes.

Local Model Validation (edge deployment)

20 custom tasks (5 easy, 10 medium, 5 hard) × 4 models × 3 conditions = 240 runs. Hardware: Apple M4 MacBook Pro, 16 GB RAM, Ollama local inference.

Model	Size	Baseline	Harness	Naive Cap	Token Savings
Llama 3.2:3B	2.0 GB	45%	45%	60%	1.2%
Phi-4-Mini	2.5 GB	30%	30%	40%	20.7%
Qwen3:4B	2.5 GB	30%	30%	40%	0.9%
Gemma4:E4B	9.6 GB	35%	35%	70%	37.9%

Key findings:

Zero false positives across all 80 harness runs — 4 model families, 3 difficulty tiers. The growth-ratio metric generalizes across architectures without threshold retuning.
Small-model self-sabotage: Naive cap outperforms baseline by +17.5pp on average (median +12.5pp). Small models solve early turns correctly but destroy solutions in later turns. Strongest on Gemma4:E4B (+35pp).
Model-family behavioral signatures:
- Llama 3.2:3B: Classic spirals (ratios: 2.3×, 5.9×, 7.6×) — 3 true-positive trips
- Phi-4-Mini: Spike-and-recover — 20.7% passive savings
- Qwen3:4B: 255K tokens but flat ratios (≤1.06×) — correctly classified as stable despite 3× volume
- Gemma4:E4B: Decreasing ratios — 37.9% passive savings, zero trips

Practical takeaway: If you’re deploying ≤4B models via Ollama, state-harness works out of the box with zero false positives. The self-sabotage finding suggests you should also add a turn limit (2–3 turns) for open-ended code generation tasks.

MINT on Qwen3:4B (568 runs)

Task	Harness (max=5)	Naive Cap (max=2)	Δ
GSM8K	37.5%	27.1%	+10.4pp
MATH	0.0%	0.0%	—
HumanEval	11.1%	11.1%	—
MBPP	14.3%	14.3%	—
Total	12.7%	10.9%	+1.8pp

Zero harness interventions across all 284 tasks. On MINT’s short-loop tasks (max 5 turns), the harness monitoring window (W=3) cannot trigger within the available post-warmup turns — a structural guarantee, not a probabilistic observation.

Reproducing the benchmarks

Full reproduction steps (all three benchmarks)

# 1. Clone repos
git clone https://github.com/vishal-dehurdle/state-harness.git
git clone https://github.com/sierra-research/tau-bench.git tau3-bench

# 2. Install state-harness
cd state-harness
python -m venv .venv && source .venv/bin/activate
pip install maturin && maturin develop --release

# 3. Install τ³-bench (with state-harness agent)
cd ../tau3-bench
uv sync
cp ../state-harness/tau3_integration/harness_agent.py src/tau2/agent/
cp ../state-harness/tau3_integration/naive_cap_agent.py src/tau2/agent/

# 4. Configure Vertex AI
export GOOGLE_CLOUD_PROJECT=your-project-id
export VERTEXAI_LOCATION=asia-south1

# 5. Run τ³ 5-phase benchmark
bash benchmarks/tau3/run_5phase_airline.sh

# 6. Run SWE-bench (requires Docker images)
bash benchmarks/swe_bench/run_benchmark.sh
bash benchmarks/swe_bench/run_benchmark_dbe.sh

# 7. Run MINT
bash benchmarks/mint/run_mint_fullstack.sh

Ablation conditions are controlled via environment variables:

Variable	Values	Effect
`HARNESS_RG`	`on` / `off`	Enable/disable RG history compression
`HARNESS_VSA`	`on` / `off`	Enable/disable VSA policy drift detection
`HARNESS_RATIO_THRESHOLD`	float (e.g., `2.0`)	Override growth ratio threshold
`HARNESS_BUDGET_GATE`	int (e.g., `8000`)	Override minimum spend before trip

See benchmarks/ for full setup, configs, and reproduction instructions for all three benchmarks.

Future evaluations

Multi-trial SWE-bench — 333 runs (3 trials × 3 conditions × 37 instances) confirming non-invasiveness within ±4% noise band
Local model validation — 240 runs across 4 open-weight models (Llama, Phi, Qwen, Gemma) + 568 MINT runs on Qwen3:4B
Terminal-Bench — Terminal-based agent tasks; command-line tool loops where spirals manifest as repeated failed commands
SWE-bench Pro — Harder, contamination-resistant variant of SWE-bench
Cross-model validation — 7 models total: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B

Known limitations

37 SWE-bench instances — A larger sample would improve statistical power (n=3 trials gives limited degrees of freedom for t-tests).
No causal intervention — The harness currently kills spiraling tasks. Redirect/repair is on the roadmap.
Physics-inspired, not physics-equivalent — Terms like "Renormalization Group" and "Lyapunov stability" are used as structural inspirations. The mathematical mapping is analogical, not isomorphic.
Custom benchmark scale — The 20-task local battery is smaller than standard benchmarks. The self-sabotage finding (mean +17.5pp, median +12.5pp) is consistent across 4 models but requires larger-scale replication.

Configuration Guide

Parameter	Default	Description
`token_budget`	100,000	Hard ceiling on cumulative tokens
`ratio_threshold`	2.0	Growth ratio above which a turn counts as "escalating" (domain-tuned: airline=2.0, retail=2.5, telecom=2.0)
`window`	3	Consecutive escalating turns before circuit breaker trips
`warmup_turns`	3	Turns used to establish baseline (no monitoring during warmup)
`budget_gate`	8,000	Minimum cumulative tokens before the monitor can trip (retail: 12,000)
`lambda_`	1.0	Error weighting in the Lyapunov energy function

Environment variable overrides (highest precedence, for threshold sweeps):

Env Var	Description
`HARNESS_RATIO_THRESHOLD`	Override ratio_threshold (e.g., `2.5`)
`HARNESS_BUDGET_GATE`	Override budget_gate (e.g., `12000`)

Tuning tips:

More aggressive (catch spirals earlier): ratio_threshold=1.8, window=2
More conservative (fewer false positives): ratio_threshold=2.5, window=3
High-value tasks: Increase budget_gate to 20K+ to let expensive tasks run longer
Complex domains (retail, multi-tool): Start with ratio_threshold=2.5

Theoretical Foundations

State-harness applies control theory to LLM agent execution:

Lyapunov stability: The energy function V(k) = S(k) + λθ(k) models token consumption as a dynamical system. When ΔV ≥ 0 for W consecutive steps, the system is provably unstable.
Renormalization Group (RG) theory: Message compression is modeled as coarse-graining — eliminating high-frequency noise while preserving scale-invariant task objectives.
Vector Symbolic Architecture (VSA): Domain policies are bound to high-dimensional bipolar vectors (10,000-d, i8 space), enabling constant-time semantic drift detection outside the LLM context window.

Research

This library implements the framework described in:

Empirical Lyapunov Stability: Growth-Ratio Energy Functions as Leading Indicators of Agent Task Failure Vishal Verma, 2026 Read the full paper →

Key findings from the paper (updated with multi-trial and local model validation):

Non-invasiveness confirmed across 333 SWE-bench runs — resolve rate delta (−3.6pp) falls within the ±4.1% nondeterminism band
Zero stability violations across 1,886 short/medium-loop cloud runs (MINT + τ³) — the monitor never interferes with healthy agents
Zero false positives across 80 local-model harness runs — spanning Llama 3.2:3B, Phi-4-Mini, Qwen3:4B, Gemma4:E4B on Apple M4 via Ollama
Zero-cost failure diagnostics — every tripped task is classified (context spiral, retry storm, policy drift) with actionable fix suggestions, requiring no additional LLM calls
Lyapunov monitoring alone delivers ~90% of the total benefit — the simplest integration (5 lines of GrowthRatioGuard code) captures the majority of the value
On long-loop agents (SWE-bench), full-stack monitoring reduces compute by 38.6% and wall time by 30%
Failed tasks cost 1.6–3.4× more than successful ones — economic justification for early termination
Eliminates all max-budget burnout events (7 → 0 tasks hitting the 50-node ceiling on SWE-bench)
~4–5% nondeterminism floor established across both τ³-bench and SWE-bench — any single-run comparison is unreliable for deltas < 8%
Small-model self-sabotage: Naive turn-limiting outperforms unconstrained baselines by +17.5pp on average on ≤4B models — runtime governance is capability-preserving, not just cost-saving

Based on the theoretical framework from:

The Fluid Dynamics of Multi-Agent AI: Resolving d'Alembert's Paradox of Generative Workflows Vishal Verma, 2026 Read →

Contributing

Contributions are welcome. See CONTRIBUTING.md for dev environment setup, code style, and PR guidelines.

Roadmap

Adaptive threshold — Auto-tune τ based on task complexity signal from early turns
Causal intervention — Instead of killing spiraling tasks, redirect them (prompt injection, tool restriction)
Streaming support — Token-level monitoring for streaming LLM responses
Multi-model validation — 7 models validated: GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash + 4 local models via Ollama
Dashboard / observability — Optional lightweight UI for monitoring energy trajectories in real-time

Security

For security vulnerabilities, see SECURITY.md. Please do not open public issues for security reports.

License

Split-core licensing:

Component	License	Notes
Rust Core (`src/`)	BSL 1.1	Free for non-commercial + ARR < $1M. Converts to Apache 2.0 on May 26, 2030.
Python SDK (`python/`)	Apache 2.0	Fully permissive.

See LICENSE.md for full details.