Hallucination Risk Calculator & Prompt Re-engineering Toolkit (Multi-Provider Support)
Post-hoc calibration without retraining for large language models. This toolkit turns a raw prompt into:
- a bounded hallucination risk using the Expectation-level Decompression Law (EDFL), and
- a decision to ANSWER or REFUSE under a target SLA, with transparent math (nats).
π― Key Features
- Multi-Provider Support: Works with OpenAI, Anthropic (Claude), Hugging Face, and Ollama models
- No Retraining Required: Pure inference-time calibration
- Two Deployment Modes:
- Evidence-based: prompts include evidence/context; rolling priors are built by erasing that evidence
- Closed-book: prompts have no evidence; rolling priors are built by semantic masking of entities/numbers/titles
- Mathematically Grounded: Based on EDFL/B2T/ISR framework from NeurIPS 2024 preprint
Table of Contents
- Install & Setup
- Supported Model Providers
- Quick Start Examples
- Core Mathematical Framework
- Understanding System Behavior
- Two Ways to Build Rolling Priors
- API Surface
- Calibration & Validation
- Practical Considerations
- Project Layout
- Deployment Options
Install & Setup
Basic Installation
# Core requirement pip install --upgrade openai # For additional providers (optional) pip install anthropic # For Claude models pip install transformers torch # For local Hugging Face models pip install ollama # For Ollama models pip install requests # For HTTP-based backends
API Keys Setup
# For OpenAI export OPENAI_API_KEY=sk-... # For Anthropic (Claude) export ANTHROPIC_API_KEY=sk-ant-... # For Hugging Face Inference API export HF_TOKEN=hf_...
Supported Model Providers
The toolkit now supports multiple LLM providers through universal backend adapters:
1. OpenAI (Original)
- GPT-4o, GPT-4o-mini, and other Chat Completions models
- Requires
OPENAI_API_KEY
2. Anthropic (Claude)
- Claude 3.5 Sonnet, Claude 3 Opus, and other Claude models
- Requires
anthropicpackage andANTHROPIC_API_KEY
3. Hugging Face (Three Modes)
- Local Transformers: Run models locally with
transformers - TGI Server: Connect to Text Generation Inference servers
- Inference API: Use hosted models via Hugging Face API
4. Ollama
- Run any Ollama-supported model locally
- Supports both Python SDK and HTTP API
5. OpenRouter (Recommended for Production)
- Single API for 100+ models from OpenAI, Anthropic, Google, Meta, and more
- Automatic fallbacks and load balancing
- Often cheaper than direct API access due to volume aggregation
- Built-in rate limiting and retry logic
Quick Start Examples
Using OpenAI (Original)
from hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner backend = OpenAIBackend(model="gpt-4o-mini") planner = OpenAIPlanner(backend, temperature=0.3) item = OpenAIItem( prompt="Who won the 2019 Nobel Prize in Physics?", n_samples=7, m=6, skeleton_policy="closed_book" ) metrics = planner.run( [item], h_star=0.05, # Target 5% hallucination max isr_threshold=1.0, # Standard ISR gate margin_extra_bits=0.2, # Safety margin B_clip=12.0, # Clipping bound clip_mode="one-sided" # Conservative mode ) for m in metrics: print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}") print(f"Risk bound: {m.roh_bound:.3f}")
Using Anthropic (Claude)
from hallucination_toolkit import OpenAIPlanner, OpenAIItem from htk_backends import AnthropicBackend # Use Claude instead of GPT backend = AnthropicBackend(model="claude-3-5-sonnet-latest") planner = OpenAIPlanner(backend, temperature=0.3) # Rest of the code remains identical items = [OpenAIItem(prompt="What is quantum entanglement?", n_samples=7, m=6)] metrics = planner.run(items, h_star=0.05)
Using Hugging Face (Local)
from hallucination_toolkit import OpenAIPlanner, OpenAIItem from htk_backends import HuggingFaceBackend # Run Llama locally backend = HuggingFaceBackend( mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct", device_map="auto" # or "cuda" or "cpu" ) planner = OpenAIPlanner(backend, temperature=0.3) # Same evaluation flow metrics = planner.run([...], h_star=0.05)
Using Hugging Face (TGI Server)
from htk_backends import HuggingFaceBackend # Connect to a Text Generation Inference server backend = HuggingFaceBackend( mode="tgi", tgi_url="http://localhost:8080" ) planner = OpenAIPlanner(backend, temperature=0.3)
Using Hugging Face (Inference API)
from htk_backends import HuggingFaceBackend import os # Use Hugging Face's hosted models backend = HuggingFaceBackend( mode="inference_api", model_id="mistralai/Mistral-7B-Instruct-v0.3", hf_token=os.environ["HF_TOKEN"] ) planner = OpenAIPlanner(backend, temperature=0.3)
Using OpenRouter (Recommended for Multi-Model Testing)
OpenRouter provides access to 100+ models through a single API, making it ideal for comparing hallucination bounds across providers:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem from htk_backends import OpenRouterBackend # Access any model through OpenRouter's unified API backend = OpenRouterBackend( model="openrouter/auto", # Auto-selects best available model # model="anthropic/claude-3.5-sonnet", # Or specify exact model # api_key="...", # Uses OPENROUTER_API_KEY env var if not provided http_referer="https://your.app", # Optional but recommended x_title="EDFL Decision Head (prod)", # Optional app identifier providers={"allow": ["anthropic", "google", "openai"]}, # Optional: limit providers ) planner = OpenAIPlanner( backend=backend, temperature=0.5, max_tokens_decision=8, # Tiny JSON decision head q_floor=None, # Or set your prior floor ) items = [OpenAIItem( prompt="What is quantum entanglement?", n_samples=3, m=6, skeleton_policy="auto" )] metrics = planner.run( items, h_star=0.05, isr_threshold=1.0, B_clip=12.0, clip_mode="one-sided" ) for m in metrics: print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}") print(f"ISR: {m.isr:.3f}, RoH bound: {m.roh_bound:.3f}")
Why OpenRouter for this toolkit?
- Test calibration across many models without managing multiple API keys
- Automatic fallbacks ensure high availability for production deployments
- Cost optimization through intelligent routing
- Perfect for A/B testing different models' hallucination characteristics
Using Ollama
from htk_backends import OllamaBackend # Use any Ollama model backend = OllamaBackend( model="llama3.1:8b-instruct", host="http://localhost:11434" # Default Ollama port ) planner = OpenAIPlanner(backend, temperature=0.3)
Backend Configuration Details
AnthropicBackend
AnthropicBackend( model="claude-3-5-sonnet-latest", # or any Claude model api_key=None, # Uses ANTHROPIC_API_KEY env var if None request_timeout=60.0 )
Requirements: pip install anthropic
OpenRouterBackend
OpenRouterBackend( model="openrouter/auto", # Auto-routing or specific model api_key=None, # Uses OPENROUTER_API_KEY env var http_referer="https://your.app", # Recommended for tracking x_title="Your App Name", # Optional identifier providers={"allow": ["anthropic", "google"]}, # Optional filtering )
Requirements: pip install openai (OpenRouter uses OpenAI-compatible API)
Available models include:
anthropic/claude-3.5-sonnet,openai/gpt-4-turbo,google/gemini-prometa-llama/llama-3-70b-instruct,mistralai/mixtral-8x7b- See OpenRouter models for full list
HuggingFaceBackend
The Hugging Face backend supports three operational modes:
Mode 1: Local Transformers
HuggingFaceBackend( mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct", device_map="auto", # GPU allocation strategy torch_dtype="float16", # Optional: precision setting trust_remote_code=True, # For custom model code model_kwargs={} # Additional model parameters )
Requirements: pip install transformers torch
Mode 2: TGI Server
HuggingFaceBackend( mode="tgi", tgi_url="http://localhost:8080", # Your TGI server URL model_id=None # Not needed for TGI )
Requirements: pip install requests and a running TGI server
Mode 3: Inference API
HuggingFaceBackend( mode="inference_api", model_id="mistralai/Mistral-7B-Instruct-v0.3", hf_token="hf_..." # Your Hugging Face token )
Requirements: pip install requests and a Hugging Face account
OllamaBackend
OllamaBackend( model="llama3.1:8b-instruct", # Any Ollama model host="http://localhost:11434", # Ollama server URL request_timeout=60.0 )
Requirements: pip install ollama (optional) or pip install requests, and Ollama installed locally
Comparing Providers
Here's a complete example comparing different providers on the same prompt:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem from htk_backends import AnthropicBackend, HuggingFaceBackend, OllamaBackend # Define test prompt prompt = "What are the main differences between quantum and classical computing?" item = OpenAIItem(prompt=prompt, n_samples=5, m=6, skeleton_policy="closed_book") # Test configuration config = dict( h_star=0.05, isr_threshold=1.0, margin_extra_bits=0.2, B_clip=12.0, clip_mode="one-sided" ) # Compare providers providers = { "GPT-4o-mini": OpenAIBackend(model="gpt-4o-mini"), "Claude-3.5": AnthropicBackend(model="claude-3-5-sonnet-latest"), "Llama-3.1": HuggingFaceBackend(mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"), "Ollama": OllamaBackend(model="llama3.1:8b-instruct") } results = {} for name, backend in providers.items(): try: planner = OpenAIPlanner(backend, temperature=0.3) metrics = planner.run([item], **config) results[name] = metrics[0] print(f"{name}: {'ANSWER' if metrics[0].decision_answer else 'REFUSE'} (RoH={metrics[0].roh_bound:.3f})") except Exception as e: print(f"{name}: Error - {e}")
Core Mathematical Framework
The EDFL Principle
Let the binary event
Build an ensemble of content-weakened prompts (the rolling priors)
-
Information budget:
$$\bar{\Delta} = \tfrac{1}{m}\sum_k \mathrm{clip}_+(\log P(y) - \log S_k(y), B)$$ (one-sided clipping; default$B=12$ nats to prevent outliers while maintaining conservative bounds). -
Prior masses:
$q_k = S_k(\mathcal{A})$ , with:-
$\bar{q}=\tfrac{1}{m}\sum_k q_k$ (average prior for EDFL bound) -
$q_{\text{lo}}=\min_k q_k$ (worst-case prior for SLA gating)
-
By EDFL, the achievable reliability is bounded by:
Thus the hallucination risk (error) is bounded by
Decision Rule (SLA Gating)
For target hallucination rate
-
Bits-to-Trust:
$\mathrm{B2T} = \mathrm{KL}(\mathrm{Ber}(1-h^*) | \mathrm{Ber}(q_{\text{lo}}))$ -
Information Sufficiency Ratio:
$\mathrm{ISR} = \bar{\Delta}/\mathrm{B2T}$ -
ANSWER iff
$\mathrm{ISR}\ge 1$ and$\bar{\Delta} \ge \mathrm{B2T} + \text{margin}$ (defaultmarginβ0.2nats)
Why two priors? The gate uses worst-case
$q_{\text{lo}}$ for strict SLA compliance. The RoH bound uses average$\bar{q}$ per EDFL theory. This dual approach ensures conservative safety while providing realistic risk bounds.
Understanding System Behavior
Expected Behavioral Patterns
The toolkit exhibits different behaviors across query types, which is mathematically consistent with the framework:
Simple Arithmetic Queries
Observation: May abstain despite apparent simplicity
Explanation:
- Models often attempt answers even with masked numbers (pattern recognition)
- This yields low information lift
$\bar{\Delta} \approx 0$ between full prompt and skeletons - Despite potentially low EDFL risk bound, worst-case prior gate triggers abstention (ISR < 1)
Named-Entity Factoids
Observation: Generally answered with confidence
Explanation:
- Masking entities/dates substantially reduces answer propensity in skeletons
- Restoring these yields large
$\bar{\Delta}$ that clears B2T threshold - System answers with tight EDFL risk bound
This is not a bug but a feature: The framework prioritizes safety through worst-case guarantees while providing realistic average-case bounds.
Provider-Specific Considerations
Different model providers may exhibit varying behaviors:
- OpenAI/Anthropic: Generally produce clean JSON decisions with high compliance
- Hugging Face (Local): May require instruction-tuned variants for best results
- Ollama: Performance depends on the specific model; instruction-tuned models recommended
- Base Models: May need adjusted prompting or higher sampling for stable priors
Two Ways to Build Rolling Priors
1) Evidence-based (when you have context)
- Prompt contains a field like
Evidence:(or JSON keys) - Skeletons erase the evidence content but preserve structure and roles; then permute blocks deterministically (seeded)
- Decision head: "Answer only if the provided evidence is sufficient; otherwise refuse."
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner from htk_backends import AnthropicBackend backend = AnthropicBackend(model="claude-3-5-sonnet-latest") prompt = """Task: Answer strictly based on the evidence below. Question: Who won the Nobel Prize in Physics in 2019? Evidence: - Nobel Prize press release (2019): James Peebles (1/2); Michel Mayor & Didier Queloz (1/2). Constraints: If evidence is insufficient or conflicting, refuse. """ item = OpenAIItem( prompt=prompt, n_samples=5, m=6, fields_to_erase=["Evidence"], skeleton_policy="auto" ) planner = OpenAIPlanner(backend, temperature=0.3) metrics = planner.run([item], h_star=0.05, isr_threshold=1.0)
2) Closed-book (no evidence)
- Prompt has no evidence
- Skeletons apply semantic masking of:
- Multi-word proper nouns (e.g., "James Peebles" β "[β¦]")
- Years (e.g., "2019" β "[β¦]")
- Numbers (e.g., "3.14" β "[β¦]")
- Quoted spans (e.g., '"Nobel Prize"' β "[β¦]")
- Masking strengths: Progressive levels (0.25, 0.35, 0.5, 0.65, 0.8, 0.9) across skeleton ensemble
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner from htk_backends import OllamaBackend backend = OllamaBackend(model="mixtral:8x7b-instruct") item = OpenAIItem( prompt="Who won the 2019 Nobel Prize in Physics?", n_samples=7, m=6, skeleton_policy="closed_book" ) planner = OpenAIPlanner(backend, temperature=0.3) metrics = planner.run([item], h_star=0.05)
API Surface
Core Classes
OpenAIBackend(model, api_key=None)β Original OpenAI wrapperAnthropicBackend(model, api_key=None)β Anthropic Claude adapterHuggingFaceBackend(mode, model_id, ...)β Hugging Face adapter (3 modes)OllamaBackend(model, host)β Ollama local model adapterOpenAIItem(prompt, n_samples=5, m=6, fields_to_erase=None, skeleton_policy="auto")β One evaluation itemOpenAIPlanner(backend, temperature=0.5, q_floor=None)β Runs evaluation (works with any backend):run(items, h_star, isr_threshold, margin_extra_bits, B_clip=12.0, clip_mode="one-sided") -> List[ItemMetrics]aggregate(items, metrics, alpha=0.05, h_star, ...) -> AggregateReport
Helper Functions
make_sla_certificate(report, model_name)β Creates formal SLA certificatesave_sla_certificate_json(cert, path)β Exports certificate for auditgenerate_answer_if_allowed(backend, item, metric)β Only emits answer if decision was ANSWER
ItemMetrics Fields
Every ItemMetrics includes:
-
delta_bar: Information budget (nats) -
q_conservative: Worst-case prior$q_{\text{lo}}$ -
q_avg: Average prior$\bar{q}$ -
b2t: Bits-to-Trust requirement -
isr: Information Sufficiency Ratio -
roh_bound: EDFL hallucination risk bound -
decision_answer: Boolean decision -
rationale: Human-readable explanation -
meta: Dict withq_list,S_list_y,P_y,closed_book, etc.
Calibration & Validation
Validation Set Calibration
On a labeled validation set:
- Sweep the margin parameter from 0 to 1 nats
- For each margin, compute:
- Empirical hallucination rate among answered items
- Wilson upper bound at 95% confidence
-
Select smallest margin where Wilson upper bound β€ target
$h^*$ (e.g., 5%) -
Freeze policy:
$(h^*, \tau, \text{margin}, B, \text{clip_mode}, m, r, \text{skeleton_policy})$
Portfolio Reporting
The toolkit provides comprehensive metrics:
- Answer/abstention rates
- Empirical hallucination rate + Wilson bound
- Distribution of per-item EDFL RoH bounds
- Worst-case and median risk bounds
- Complete audit trail
Practical Considerations
Choosing the Right Provider
| Provider | Best For | Considerations |
|---|---|---|
| OpenAI | Production deployment, consistent JSON | Requires API key, costs per token |
| Anthropic | High-quality reasoning, safety-critical | Requires API key, may have rate limits |
| OpenRouter | Multi-model testing, cost optimization | Single API for 100+ models, automatic fallbacks |
| HuggingFace (Local) | Full control, no API costs | Requires GPU, setup complexity |
| HuggingFace (TGI) | Team deployments, caching | Requires server setup |
| HuggingFace (API) | Quick prototyping | Rate limits, requires HF token |
| Ollama | Local experimentation | Easy setup, model quality varies |
Performance Characteristics by Provider
| Provider | Latency per Item | Cost | Setup Complexity |
|---|---|---|---|
| OpenAI | 2-5 seconds | ~$0.01-0.03 | Low |
| Anthropic | 3-6 seconds | ~$0.02-0.05 | Low |
| HF Local | 1-10 seconds | Free (GPU cost) | Medium-High |
| HF TGI | 1-3 seconds | Server costs | High |
| HF API | 3-8 seconds | Free tier/paid | Low |
| Ollama | 2-15 seconds | Free (local) | Low |
Common Issues & Solutions
Issue: Non-JSON responses from local models
Solution: Use instruction-tuned model variants (e.g., -Instruct suffixes)
Issue: Different risk bounds across providers
Expected: Models have different knowledge/calibration; the framework adapts accordingly
Issue: Timeouts with local models
Solution: Increase request_timeout parameter or reduce batch size
Project Layout
.
βββ app/ # Application entry points
β βββ web/web_app.py # Streamlit UI
β βββ cli/frontend.py # Interactive CLI
β βββ examples/ # Example scripts
β βββ launcher/entry.py # Unified launcher
βββ hallbayes/ # Core modules
β βββ hallucination_toolkit.py # Main toolkit
β βββ htk_backends.py # Universal backend adapters
β βββ build_offline_backend.sh
βββ electron/ # Desktop wrapper
βββ launch/ # Platform launchers
βββ release/ # Packaged artifacts
βββ bin/ # Offline backend binary
βββ requirements.txt
βββ pyproject.toml
βββ README.md
Deployment Options
1. Direct Python Usage (Any Provider)
from hallbayes import OpenAIPlanner, OpenAIItem, make_sla_certificate from hallbayes.htk_backends import AnthropicBackend # or any other backend # Choose your provider backend = AnthropicBackend(model="claude-3-5-sonnet-latest") # Configure and run items = [OpenAIItem(prompt="...", n_samples=7, m=6)] planner = OpenAIPlanner(backend, temperature=0.3) metrics = planner.run(items, h_star=0.05) # Generate SLA certificate report = planner.aggregate(items, metrics) cert = make_sla_certificate(report, model_name="Claude-3.5-Sonnet") save_sla_certificate_json(cert, "sla.json")
2. Web Interface (Streamlit)
streamlit run app/web/web_app.py
3. Batch Processing with Multiple Providers
from hallucination_toolkit import OpenAIPlanner, OpenAIItem from htk_backends import AnthropicBackend, OllamaBackend import json # Load prompts with open("prompts.json") as f: prompts = json.load(f) # Setup providers providers = { "claude": AnthropicBackend(model="claude-3-5-sonnet-latest"), "llama": OllamaBackend(model="llama3.1:8b-instruct") } # Process with each provider results = {} for name, backend in providers.items(): planner = OpenAIPlanner(backend, temperature=0.3) items = [OpenAIItem(prompt=p, n_samples=5, m=6) for p in prompts] metrics = planner.run(items, h_star=0.05) results[name] = planner.aggregate(items, metrics)
Quick Migration Guide
If you're already using the toolkit with OpenAI, here's how to try other providers:
# Original (OpenAI only) from hallucination_toolkit import OpenAIBackend backend = OpenAIBackend(model="gpt-4o-mini") # New (Any provider) - just change these two lines: from htk_backends import AnthropicBackend # or HuggingFaceBackend, OllamaBackend backend = AnthropicBackend(model="claude-3-5-sonnet-latest") # Everything else stays exactly the same! planner = OpenAIPlanner(backend, temperature=0.3) # ... rest of your code unchanged
Based on the Paper: Predictable Compression Failures: Why Language Models Actually Hallucinate - https://arxiv.org/abs/2509.11208
License
This project is licensed under the MIT License β see the LICENSE file for details.
Attribution
Developed by Hassana Labs (https://hassana.io).