Pre-flight cost estimation for Claude Code.
Know what your AI coding task will cost before it runs. Tarmac hooks into Claude Code, intercepts your prompt, and shows a calibrated cost range — so you can proceed, switch models, or cancel before spending a cent.
The Problem
Claude Code has zero cost visibility. You type a prompt, it runs for 2 minutes or 20 minutes, and you find out the cost after it's done. For complex tasks on Opus, that can be $5-20+ per prompt — and there's no way to know in advance.
The Solution
Tarmac installs as a Claude Code hook. Every time you submit a prompt, Tarmac intercepts it, extracts features, runs a trained regression model with conformal prediction intervals, and injects a cost estimate into Claude's context. Claude then presents the estimate and asks whether to proceed.
⚡ TARMAC COST ESTIMATE
━━━━━━━━━━━━━━━━━━━━━━━━━━━
Sonnet 4.6 $0.12 - $0.89
Opus 4.6 $0.58 - $4.34
Haiku 4.5 $0.03 - $0.22
Task type: code modification
Input: 847 tokens
Coverage: 80% confidence interval
Method: conformal-regression
━━━━━━━━━━━━━━━━━━━━━━━━━━━
No API key required. No external calls. Everything runs locally in ~5ms.
Quick Start
npm install -g tarmac-cost tarmac-cost setup
That's it. Open Claude Code and every prompt (5+ words) will now include a cost estimate.
After a session, run tarmac-cost report to compare the estimate to what actually happened:
$ tarmac-cost report
📊 TARMAC SESSION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model: Opus 4.6
Estimated: $0.58 - $4.34
Actual: $2.17
Result: ✅ Within estimate
API calls: 12
Duration: 94s
━━━━━━━━━━━━━━━━━━━━━━━━━━━
The report compares the last estimate against the actual cost from that session's transcript. Run it after exiting a Claude Code session to see how the prediction held up.
To uninstall, remove the Tarmac hook entries from ~/.claude/settings.json.
Benchmarks
Validated on 3,381 real tasks (3,000 SWE-bench + 381 local Claude Code sessions):
| Dataset | Coverage (80% target) | Median Interval Width | vs Heuristic Baseline |
|---|---|---|---|
| Overall (n=3,381) | 81.1% | $0.78 | +19.3pp |
| SWE-bench (n=3,000) | 83.6% | $0.85 | +14.3pp |
| Opus 4.6 | 84.5% | $1.13 | +17.3pp |
| Sonnet 4.6 | 81.7% | $0.67 | +13.0pp |
| Haiku 4.5 | 84.6% | $0.47 | +7.8pp |
"Coverage" = percentage of actual costs that fell within the predicted range. An 80% target means you should expect ~4 out of 5 estimates to contain the true cost. We hit 81.1% overall.
How It Works
┌────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ You type │────▶│ Claude Code │────▶│ Tarmac │────▶│ Claude │
│ a prompt │ │ hook fires │ │ estimates cost │ │ presents it │
└────────────┘ └──────────────┘ └────────────────┘ └──────────────┘
- Hook intercept — Claude Code's
UserPromptSubmithook pipes your prompt totarmac-cost estimatevia stdin - Feature extraction — 24 features extracted from prompt text (length, code blocks, file paths, task keywords, vocabulary richness, etc.)
- Per-model regression — Separate ridge regression models for Opus, Sonnet, and Haiku predict log₁₀(cost)
- Conformal calibration — Residuals from a held-out calibration set determine the interval width needed for 80% coverage
- Output — The estimate is injected as
additionalContextinto Claude's system prompt, which Claude then presents to the user
Why Conformal Prediction?
Traditional approaches (heuristic multipliers, percentile-based ranges) can't provide coverage guarantees. Conformal prediction is a distribution-free method that gives calibrated prediction intervals: if you ask for 80% coverage, you get ~80% coverage, regardless of the underlying distribution. No assumptions about normality or homoscedasticity needed.
Features Used (24 total)
| Category | Features |
|---|---|
| Size | log char count, word count, line count, sentence count |
| Code signals | code blocks, file paths, function names, class names |
| Error signals | stack traces, error messages |
| Text properties | vocabulary richness, technical density, avg/max line length |
| Task indicators | mentions fix, add, refactor, test, deprecation, regression, performance |
| Structure | question count, URL count, inline code references |
Reproducing the Results
The model was trained on SWE-bench data (3,000 instances across Opus 4.6, Sonnet 4.6, and Haiku 4.5).
# Install dependencies npm install # Train the model (outputs src/data/model-weights.ts) npx tsx train-model.ts # Run head-to-head validation against heuristic baseline npx tsx validate-conformal.ts # Feature importance analysis npx tsx signal-analysis.ts
Training data files:
data-swebench.json— SWE-bench leaderboard data with per-instance costsdata-swebench-statements.json— Problem statements for each SWE-bench instance
Architecture
tarmac/
├── src/
│ ├── cli.ts # CLI entry point
│ ├── types.ts # TypeScript interfaces
│ ├── commands/
│ │ ├── estimate.ts # Cost estimation (UserPromptSubmit hook)
│ │ ├── report.ts # Outcome recording (Stop hook)
│ │ └── setup.ts # Hook installation + config
│ ├── core/
│ │ ├── conformal-predictor.ts # Regression model + conformal intervals
│ │ ├── prompt-classifier.ts # Task type classification
│ │ ├── context-estimator.ts # Prior context estimation
│ │ ├── token-counter.ts # Token counting (API or heuristic)
│ │ ├── output-estimator.ts # Output token estimation
│ │ ├── cost-calculator.ts # Token → dollar conversion
│ │ ├── formatter.ts # Estimate → formatted output
│ │ ├── history-analyzer.ts # Past session analysis
│ │ ├── haiku-preflight.ts # Optional Haiku pre-analysis
│ │ └── telemetry.ts # Config + last-estimate persistence
│ └── data/
│ ├── model-weights.ts # Trained model weights (auto-generated)
│ └── pricing.ts # Claude model pricing
├── train-model.ts # Training script
├── validate-conformal.ts # Validation script
├── signal-analysis.ts # Feature analysis
├── data-swebench.json # Training data
├── data-swebench-statements.json # Problem statements
├── package.json
├── tsconfig.json
└── LICENSE
Limitations
What tarmac can't do (yet):
- Short/vague prompts — A 5-word prompt like "fix the login bug" gives the model very little signal. Estimates will be wide.
- No context awareness — The model sees only the current prompt text, not the conversation history or codebase. A follow-up "do it differently" has no features to work with.
- Local data gap — Trained primarily on SWE-bench (3,000 instances). Only 381 local sessions in the validation set. Real-world usage patterns may differ.
- Irreducible variance — Even with perfect features, ~38% of cost variance is irreducible (same prompt can cost 2x or 0.5x depending on codebase state, model behavior, etc). This is a fundamental limit, not a model problem.
- Claude Code only — Currently only supports Claude Code's hook system. No support for other AI coding tools yet.
Contributing
Contributions welcome. The biggest impact areas:
- More training data — Run
validate-conformal.tson your local sessions and share anonymized results - Better features — The model uses 24 text features. Codebase-aware features (repo size, language, recent changes) could help
- Context awareness — Using conversation history to improve follow-up estimates
- Multi-provider — Extending beyond Claude to support other LLM providers
License
MIT
