pythea
LLM reliability research from Hassana Labs.
Three tools, one goal: catch models that know but don't use.
π Strawberry: Procedural Hallucination Toolkit
Ask Claude to count the r's in "strawberry." It writes "s-t-r-a-w-b-e-r-r-y," identifies each r, gets to 3. Then outputs "2."
The model didn't lack information. The answer was right thereβin text it generated moments earlier. The computation worked. The routing failed.
This toolkit detects those failures mathematically.
pip install pythea
python -m strawberry.factual_recall \
--question "Which US senators from Minnesota graduated from Princeton" \
--out report.jsonWhat it catches:
- RAG that retrieves but doesn't read
- Chain-of-thought that cites steps it ignored
- Self-verification that validates without checking
- Citation confabulation (decorative sources)
How: Scrub the cited evidence, measure confidence change. No change? The model was confabulating.
Integrations: Strawberry ships with an MCP server (strawberry.mcp_server) that can be used from
Claude Code or OpenAI Codex (Codex CLI / IDE extension) to run detect_hallucination and
audit_trace_budget as tools.
Agent workflows (Codex skills): This repo includes repo-scoped Codex skills under .codex/skills/:
-
rca-fix-agent: evidence-first debugging agent implemented as a composition ofplanning-agent+execution-agent(reproduce β evidence β hypotheses β Strawberry-verified ROOT_CAUSE β plan-of-record β execute patches with Strawberry-gated acceptance β run a mandatory final verification test β iterate). -
proof-attack-agent: a focused, anti-repeat brute-force βproof gap attackβ loop (usually for a stuck Lean goal). It uses Strawberry verification as a heartbeat after every micro-plan update and maintains an Attempt Ledger so you donβt retry the same dead ends. -
proof-repair-agent: plan-driven, evidence-first proof repair / proof synthesis agent for LaTeX + a formal backstop (Lean/Coq/...). It usesplanning-agentto produce a plan-of-record, usesproof-attack-agentto explore stuck gaps, checks for theorem drift, and iterates until theorem Y is machine-checked. -
planning-agent: evidence-first hierarchical planning agent that does local environment forensics + web lookup, then Strawberry-verifies each plan step against an explicit βwhat does success look like?β spec. -
execution-agent: Strawberry-gated plan execution agent that ships code patches, accepting each step only when βthis step has succeededβ is provably true (tests/output/diffs), and (for full runs) requires passing the planβs mandatory final verification test.
π Thea API Client
Lightweight client for the Thea Mini Reasoning API.
from pythea import TheaClient with TheaClient(base_url="https://...") as client: resp = client.unified_answer( question="What is 2+2?", backend="aoai-pool", m=6, ) print(resp.get("answer"))
π Offline QMV Probing
Model-agnostic permutation-mixture evaluation via Bernoulli first-token logprob probes.
from pythea.offline import qmv res = qmv.evaluate_permutation_family( probe=probe, parts=parts, cfg=qmv.PermutationEvalConfig(m=6, num_bands=2, seed=0), ) print(res.q_bar, res.q_lo, res.js_bound)
Install
git clone https://github.com/leochlon/pythea.git cd pythea pip install -e .
Extras:
pip install -e ".[dev]" # tests pip install -e ".[offline]" # tiktoken for logit bias pip install -e ".[vllm]" # local inference
Repo layout
pythea/
βββ strawberry/ # Procedural hallucination toolkit
β βββ README.md
β βββ src/
βββ src/pythea/ # Thea client + QMV probing
βββ docs/ # Detailed documentation
βββ examples/
βββ tests/
βββ benchmarks/
Citation
@article{hassanalabs2026procedural, title={An Information-Theoretic and Causal Theory of Procedural Hallucinations}, author={{Hassana Labs}}, journal={arXiv preprint}, year={2026} }
License
MIT β see LICENSE.md